This algorithm can learn a language just by watching videos
The Massachusetts Institute of Technology (MIT) has introduced an innovative algorithm that can learn a language solely by watching videos. The model, called DenseAV, has the potential to decode interspecies communication.
Developed at the Massachusetts Institute of Technology (MIT), DenseAV is capable of learning and understanding language by watching videos of people speaking. This innovative algorithm has potential applications in multimedia search, language learning, and robotics.
This new algorithm could potentially decode communication between animals.
Mark Hamilton, a PhD student in electrical engineering and computer science, leads this project alongside his colleagues at MIT’s Computer Science and Artificial Intelligence Laboratory. They aim to leverage machines to decode animal communication, starting with human language acquisition.
The inspiration for this algorithm came from a film. In one scene, a penguin falls and groans as it tries to get up. Hamilton observed that this groan seemed to imply a word, leading to the idea that sound and video could be used together to teach an algorithm a language.
This idea led to the development of DenseAV, a model designed to learn language by predicting visual content from sound. For example, hearing the phrase “bake the cake at 350” would lead the model to expect visuals of a cake or an oven.
To enable DenseAV to match sound and visual content across millions of videos, it needed to learn the context in which people were speaking. After training DenseAV on this matching task, the research team examined which pixels the model focused on while processing sounds.
When the word “dog” was spoken, the algorithm searched for dog images in the video stream, indicating it understood the meaning of the word. Similarly, when it heard a dog bark, it looked for dogs in the video.
The team wondered if DenseAV could distinguish between the word “dog” and the sound of a dog barking. By applying a dual-brain approach to DenseAV, they discovered that one side naturally focused on language, such as the word “dog,” while the other side focused on sounds, such as barking.
Since the team aimed to rediscover the essence of language from scratch without using pre-trained language models, they faced the challenging task of learning language without text input. This method is inspired by how children learn language by observing and listening to their surroundings.
Sources :Â
https://www.techtimes.com/articles/305612/20240612/mit-unveils-new-algorithm-learns-language-watching-videos.htm
Section Title
Page Contents
Toggle