OpenAI’s new video generation tool could learn a lot from babies

A dalmation puppy leaning out the window of a colourful house — A still from a film made by OpenAI using its AI video-generation tool Sora. Photograph: OpenAI/Sora

“First text, then images, now OpenAI has a model for generating videos,” screamed Mashable the other day. The makers of ChatGPT and Dall-E had just announced Sora, a text-to-video diffusion model. Cue excited commentary all over the web about what will doubtless become known as T2V, covering the usual spectrum – from “Does this mark the end of [insert threatened activity here]?” to “meh” and everything in between.

Sora (the name is Japanese for “sky”) is not the first T2V tool, but it looks more sophisticated than earlier efforts like Meta’s Make-a-Video AI. It can turn a brief text description into a detailed, high-definition film clip up to a minute long. For example, the prompt “A cat waking up its sleeping owner, demanding breakfast. The owner tries to ignore the cat, but the cat tries new tactics, and finally, the owner pulls out his secret stash of treats from underneath the pillow to hold off the cat a little longer,” produces a slick video clip that would go viral on any social network.

Cute, eh? Well, up to a point. OpenAI seems uncharacteristically candid about the tool’s limitations. It may, for example, “struggle with accurately simulating the physics of a complex scene”.

That’s putting it mildly. One of the videos in its sample set illustrates the model’s difficulties. The prompt that produces the movie is “Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee”. At first sight, it’s impressive. But then one notices that one of the ships moves quickly in an inexplicable way, and it becomes clear that while Sora may know a lot about the reflection of light in fluids, it knows little or nothing about the physical laws that govern the movements of galleons.

Other limitations: Sora can be a bit hazy on cause and effect; “a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark”. Tut, tut. It may also “confuse spatial details of a prompt, for example mixing up left and right”. And so on.

Still, it’s a start, and will doubtless get better with another billion teraflops of computing power. And though Hollywood studio bosses can continue to sleep easy in their king-size beds, Sora will soon be good enough to replace some kinds of stock video, just as AIs such as Midjourney and Dall-E are replacing Shutterstock-type photography.

Despite its concessions about the tool’s limitations, though, OpenAI says Sora “serves as a foundation for models that can understand and simulate the real world”. This, it says, will be an “important milestone” in achieving artificial general intelligence (AGI).

And here’s where things get interesting. OpenAI’s corporate goal, remember, is to achieve the holy grail of AGI, and the company seems to believe that generative AIs represent a tangible step towards that goal. The problem is that getting to AGI means building machines that have an understanding of the real world that is at least on a par with ours. Among other things, that requires an understanding of the physics of objects in motion. So the implicit bet in the OpenAI project seems to be that one day, given enough computing power, machines that are able to predict how pixels move on a screen will also have learned how the physical objects they are portraying will behave in real life. In other words, it’s a wager that extrapolation of the machine-learning paradigm will eventually get us to superintelligent machines.

But AIs that are able to navigate the real world will need to understand more than how the laws of physics operate in that world. They will also need to figure out how humans operate in it. And to anyone who has followed the work of Alison Gopnik, that looks like a bit of a stretch for the kind of machines that the world currently regards as “AI”.

Gopnik is famous for her research into how children learn. Watching her Ted Talk, What Do Babies Think?, would be a salutary experience for techies who imagine that technology is the answer to the intelligence question. Decades of research exploring the sophisticated intelligence-gathering and decision-making that babies are doing when they play has led her to the conclusion that “Babies and young children are like the R&D division of the human species”. Having spent a year watching the first year of our granddaughter’s development, and in particular observing how she is beginning to figure out causality, this columnist is inclined to agree. If Sam Altman and the guys at OpenAI are really interested in AGI, maybe they should spend some time with babies.

What I’ve been reading

Algorithmic politics
Henry Farrell has written a seminal essay about the political economy of AI.

Bot habits
There is a reflective piece in the Atlantic by Albert Fox Cahn and Bruce Schneier on how chatbots will change the way we talk.

No call-up
Science fiction writer Charlie Stross has written a blogpost on why Britain couldn’t implement conscription, even if it wanted to.

Richard Hartley

Technology, Photography & Film

OpenAI’s new video generation tool could learn a lot from babies

What I’ve been reading

Leave a Comment Cancel comment