Google is preparing to merge its Gemini AI models with Veo, its advanced video-generating system, in a strategic move to build more intelligent, real-world-aware digital assistants. This plan was revealed by DeepMind CEO Demis Hassabis during an episode of Possible, a podcast co-hosted by LinkedIn co-founder Reid Hoffman.
Hassabis explained that Gemini was designed to be multimodal from the beginning. This means it can process and generate various forms of media—including text, audio, and images. “We’ve always built Gemini as a multimodal model,” he said. “And the reason is we envision a universal assistant—one that can actually help in real-life situations.”
This development aligns with a larger trend in artificial intelligence. Tech companies are racing to build what are being called “omni” models—AI systems that can understand and synthesize text, audio, images, and video in a unified framework. Google’s latest Gemini version already supports image and audio generation. Similarly, OpenAI’s ChatGPT now includes image creation tools, while Amazon is expected to launch its own “any-to-any” AI model later this year.
By integrating Gemini with Veo, Google aims to enhance the AI’s understanding of the physical world. This merger could allow future digital assistants to go beyond conversation—to observe, interpret, and even predict real-world events by watching videos.
YouTube’s Role in Teaching AI Real-World Physics
A key component of this strategy is video data. According to Hassabis, Google is leveraging its vast library of YouTube content to train Veo to understand real-world physics. “Basically, by watching a lot of YouTube videos, Veo 2 can figure out the physics of the world,” he said. This includes how objects move, interact, and respond to various conditions—something that’s difficult to learn from text alone.
Although Google hasn’t confirmed the full extent of YouTube’s role, the company has previously said its models may use some content from the platform. In 2023, Google updated its terms of service—reportedly to allow more data to be used for AI training. This change sparked debates around creator rights, transparency, and ethical data use.
Despite the concerns, the integration of Gemini and Veo is a logical next step in Google’s push toward more human-like AI systems. A digital assistant that can both understand and generate video content opens the door to smarter applications—from personal assistants and creative tools to educational platforms and enterprise use.
This shift could make AI not just more useful, but more intuitive. By learning from the world as humans do—through vision, sound, and interaction—future AI systems may better understand context, nuance, and intention.
With this new direction, Google continues to shape the future of multimodal AI, aiming to make assistants not only more powerful but also more grounded in how we experience reality.