On Tuesday, Meta introduced SeamlessM4T, a multimodal AI mannequin for speech and textual content translations. As a neural community that may course of each textual content and audio, it might probably carry out text-to-speech, speech-to-text, speech-to-speech, and text-to-text translations for “as much as 100 languages,” in line with Meta. Its objective is to assist individuals who communicate totally different languages talk with one another extra successfully.
Persevering with Meta’s comparatively open strategy to AI, Meta is releasing SeamlessM4T beneath a research license (CC BY-NC 4.0) that enables builders to construct on the work. They’re additionally releasing SeamlessAlign, which Meta calls “the most important open multimodal translation dataset up to now, totaling 270,000 hours of mined speech and textual content alignments.” That may seemingly kick-start the coaching of future translation AI fashions from different researchers.
Among the many options of SeamlessM4T touted on Meta’s promotional weblog, the corporate says that the mannequin can carry out speech recognition (you give it audio of speech, and it converts it to textual content), speech-to-text translation (it interprets spoken audio to a special language in textual content), speech-to-speech translation (you feed it speech audio, and it outputs translated speech audio), text-to-text translation (much like how Google Translate capabilities), and text-to-speech translation (feed it textual content and it’ll translate and communicate it out in one other language). Every of the textual content translation capabilities helps practically 100 languages, and the speech output capabilities assist about 36 output languages.
Within the SeamlessM4T announcement, Meta references the Babel Fish, a fictional fish from Douglas Adams’ classic sci-fi series that, when positioned in a single’s ear, can immediately translate any spoken language:
Constructing a common language translator, just like the fictional Babel Fish in The Hitchhiker’s Information to the Galaxy, is difficult as a result of current speech-to-speech and speech-to-text techniques solely cowl a small fraction of the world’s languages. However we imagine the work we’re saying in the present day is a big step ahead on this journey.
How did they prepare it? In keeping with the Seamless4MT research paper, Meta’s researchers “created a multimodal corpus of robotically aligned speech translations of greater than 470,000 hours, dubbed SeamlessAlign” (beforehand talked about above). They then “filtered a subset of this corpus with human-labeled and pseudo-labeled knowledge, totaling 406,000 hours.”
As common, Meta is being just a little imprecise about the place it acquired its coaching knowledge. The textual content knowledge got here from “the identical dataset deployed in NLLB,” (units of sentences pulled from Wikipedia, information sources, scripted speeches, and different sources and translated by skilled human translators). And SeamlessM4T’s speech knowledge got here from “4 million hours of uncooked audio originating from a publicly obtainable repository of crawled internet knowledge,” of which 1 million hours had been in English, in line with the analysis paper. Meta didn’t specify which repository or the provenance of the audio clips used.
Meta is way from the primary AI firm to supply machine-learning translation instruments. Google Translate has used machine-learning strategies since 2006, and enormous language fashions (corresponding to GPT-4) are well-known for his or her capability to translate between languages. However extra just lately, the tech has heated up on the audio processing entrance. In September, OpenAI launched its personal open supply speech-to-text translation mannequin, known as Whisper, that may acknowledge speech in audio and translate it to textual content with a excessive stage of accuracy.
SeamlessM4T builds from that development by increasing multimodal translation to many extra languages. As well as, Meta says that SeamlessM4T’s “single system strategy”—a monolithic AI mannequin as a substitute of a number of fashions mixed in a series (like a few of Meta’s previous audio-processing strategies)—reduces errors and will increase the effectivity of the interpretation course of.