A closer look at training AI models using kids with GoPros

A closer look at training AI models using kids with GoPros

Source Node: 2576500

AI researchers looking for better ways to train large language models are turning to the masters of language acquisition – children – to find out how it’s done.

Large language models – the complex neural networks behind the generative AI boom – are trained on mountains of data. Yet many of these models are little more than an overgrown autocomplete, predicting the next word with an increasingly disconcerting degree of accuracy.

Rather more impressive is the way human children pick up language. Toddler brains are like sponges, soaking in information from all around them and processing it into a coherent view of the world. Sure, the results from an LLM are quicker – getting a nine-month-old to sing the alphabet is a feat – but, over time, the child will likely become much smarter and more creative than the model.

“The best AI systems learn from this astronomical amount of text mined from the web and all over the place. The best systems now train on trillions of words in order to learn language,” Brenden Lake, a psychologist at New York State University studying human and artificial intelligence, told The Register. “It’s remarkable that they [AI models] do become fluent in language, but of course, children don’t need nearly that much experience.”

Linguists and child development experts are nowhere near agreement over understanding how exactly children acquire language. Lake believes that researching how children learn may hold the secret to making AI models far more efficient, and could also hold the key to helping children who struggle.

Lake’s latest research project seeks to determine how effectively an AI model can be trained solely using the stimuli experienced by a child learning their first words. So naturally, it involves strapping GoPro-like cameras to toddlers’ heads.

To do this, Lake and his team are said to be gathering video and audio data from more than 25 children around the US – including his own daughter, Luna.

The model attempts to associate video footage from the child’s perspective with the words spoken by the tyke’s caregiver in a manner similar to how OpenAI’s Clip model works to connect captions to images, he explained. Clip can take an image as input, and output a suggested descriptive caption, based on its training data of image-caption pairs.

Lake and co’s model, meanwhile, can take an image of a scene as input, and output language to describe that scene, based on the training data from the GoPro footage and audio of the caregivers. The model can also turn descriptions into frames previously seen in training.

This might sound straightforward: The model learns to match spoken words to objects observed in the video frame just like a kid would. But as Lake points out, children aren’t always looking at the object or action being described. There are also further abstractions – such as if the child is offered milk, but it’s served in an opaque cup. These are very loose associations, Lake notes.

The experiment, he explained, isn’t whether a model can be trained to match objects in images to the corresponding word – that’s already been done by OpenAI and others. Instead, researchers hope to understand whether a model can actually learn to identify objects using nothing more than the incredibly sparse dataset available to a child.

This is sort of the opposite of what we’re seeing for model builders like OpenAI, Google, Meta, and others. For instance, Meta’s third-gen Llama models were trained using 15 trillion tokens – the words and punctuation that make up a sentence.

“I think there should be more focus not just on training larger and larger language models from more and more data,” Lake told us. “Yes, you can get amazing capabilities that way, but it starts to become more distant from what we know of as human intelligence and what we admire about human intelligence … that is, the ability to learn from limited input and then generalize very far from the data that we see.”

Early successes

Lake’s team has reason to believe this is possible. In February, they trained a neural network on the experiences of a young child using 61 hours of video footage.

That research, published in the journal Science, claimed the model was able to connect various words and phrases uttered by the subject to the experiences captured in the frames of the videos. Presented with a word or phrase, the model was able to recall relevant images.

Lake adds that the model was also able to generalize the names of objects in images it wasn’t trained on – though he notes accuracy understandably suffered in these scenarios. While promising, Lake notes that the model was really only a proof of concept.

“It didn’t learn everything that a child would know. That’s why the project is unfinished,” he stressed. “It was only about 60 hours of annotated speech. So that’s only about one percent of the experience a child would have gotten in that two-year period. We need more data in order to get a better sense of what’s learnable.”

We need more data in order to get a better sense of what’s learnable

Lake also admits that the methodology used by the first model introduced certain limitations. Only video segments associated with a caregiver’s words were analyzed, and the footage itself was converted to images at a rate of five frames per second.

Because of this, “it wasn’t really capable of learning things like verbs or action words or abstract words, because it was only getting static slices of what the world looks like,” he said. “It had no notion of what happened before. What happened afterwards. What was the context of the conversation, right? So, learning a word – like walk, or jump, or push – is gonna be really difficult to learn from just the frame.”

This is now changing. As the technology behind modeling videos becomes more mature, Lake is looking to incorporate more of it into future models. Longer term, he suggests there may be opportunities to extend well beyond building more efficient models.

“If we’re able to build a model that really begins to acquire language – a lot like a child or in close correspondence to how children learn – it would open up really important applications for understanding learning and development, potentially understanding developmental disorders or cases where children struggle to learn language,” Lake said.

Such a model, he argued, could eventually be used to test millions of different approaches to speech therapy to identify which are the most effective. ®

Time Stamp:

More from The Register