We showed them chess games and they became unbeatable opponents; we made them read our texts and they began to write; They also learned to paint and retouch photographs. Has anyone doubted that artificial intelligence cannot do the same with speech and music?
Google’s research division introduced AudioLM (paper), a framework for generating high-quality sound that remains constant over the long term. To do this, it starts from a recording of just a few seconds, and is capable of extending it naturally and coherently. More particularly, she achieves it without having been trained with previous transcripts or annotations even though the generated speech is syntactically plausible and semantically plausible. Furthermore, it maintains the identity and prosody of the speaker to the point where the listener cannot tell which part of the audio is original and which part has been generated by artificial intelligence.
The examples of this artificial intelligence are striking. Not only is it capable of reproducing articulation, pitch, timbre, and intensity, but it is also capable of capturing the sound of the speaker’s breath and forming meaningful sentences. If it’s not from a studio sound, but from a sound with background noise, AudioLM plays it back for continuity. You can listen to more samples on the AudioLM website.
An artificial intelligence trained in semantics and acoustics
What it’s like ? Generating audio or music is nothing new. But the method devised by Google researchers to solve the problem is as follows. Semantic markers are extracted from each audio to encode a high-level structure (phonemes, lexicon, semantics…), and acoustic markers (speaker identity, recording quality, background noise…). With this data already processed and understandable for artificial intelligence, AudioML begins its work by establishing a hierarchy in which it first predicts semantic markers which are then used as constraints to predict the acoustic markers. These are then reused at the end to turn the bits into something humans can listen to.
This semantic separation of acoustics and its hierarchy is not only a beneficial practice for training language models to generate speech. According to the researchers, it is also more efficient when it comes to processing piano compositions, as they show on their website. It is much better than models that only train with acoustic markers.
The most important thing about AudioLM’s artificial intelligence is not that it is able to chase speeches and melodies, but that it can do everything at the same time. It is then, a unique language model that can be used to convert text to speech a robot could read entire books and give voiceover professionals a break or make any device capable of communicating with people using a familiar voice. This idea has already been explored by Amazon, which has considered using the voices of your loved ones in its Alexa speakers.
Exciting or dangerous?
Software like Dalle-2 and Stable Diffusion are great tools for sketching out ideas or generating creative assets in seconds, like the illustration used on the cover of this article. Audio can be even more important, and one can imagine that several companies use an announcer’s voice on demand. Movies could even be dubbed with the voices of deceased actors. The reader may wonder if this possibility, although exciting, would not be dangerous. Any audio recording could be manipulated for political, legal or judicial purposes. According to Google, while humans struggle to detect what is human intelligence and what is artificial intelligence, a computer can detect whether the audio is organic or not. In other words, it is not only the machine that can replace us but to enhance their work, it will be essential to have another machine.
For now, AudioLM is not open to the public, it is just a language model that can be integrated into different projects. But this demo, along with OpenAI’s Jukebox music program, shows how quickly we’re entering a new world where no one will know, or care, if that photo was taken by a person or if there’s a person or a voice. in off artificially generated on the screen. other end of the line in real time.