📑 Table of Contents

Tencent AI Chief: Multimodal & Embodied AI Define Next Era

📅 · 📁 Industry · 👁 0 views · ⏱️ 11 min read
💡 Tencent's Shunyu Yao predicts AI is a long-term game, shifting focus from single-path LLMs to multimodal and embodied intelligence.

Tencent Chief AI Scientist Shunyu Yao has declared that the artificial intelligence sector is entering a new phase characterized by diversification rather than a single dominant trajectory. Speaking at the Tencent Cloud AI Industry Application Conference, Yao emphasized that the industry must move beyond the current obsession with large language models (LLMs) to embrace multimodal and embodied intelligence.

This strategic pivot signals a maturing market where short-term profit chasing gives way to sustainable, long-term innovation. The following analysis breaks down Yao's key predictions and their implications for global tech giants like OpenAI, Microsoft, and emerging startups.

Key Takeaways from the Tencent AI Forum

  • Long-Term Horizon: AI development is compared to the PC revolution of the 1970s, suggesting decades of growth remain.
  • Diversification Strategy: The industry is moving away from a single path (pre-training/post-training) toward multiple specialized directions.
  • Beyond Coding Agents: While coding assistants are popular, new opportunities exist in robotics, sensory integration, and physical interaction.
  • Rejection of Quick Profits: Yao criticizes the 'get rich quick' mentality prevalent in some Silicon Valley circles.
  • Unfilled Market Space: Significant portions of the AI application landscape remain unexplored and untapped.
  • Multimodal Dominance: Future super-apps will likely integrate text, image, audio, and video seamlessly.

Rejecting the 'Get Rich Quick' Mentality in Silicon Valley

Shunyu Yao’s first major assertion challenges the prevailing narrative among certain Western investors and developers who view AI as a transient boom. He explicitly criticized the mindset of those seeking to 'make money for two years and retire.' This perspective ignores the fundamental nature of technological revolutions. History shows that transformative technologies require decades to fully mature and integrate into society.

Yao draws a direct parallel between the current state of AI and the personal computer (PC) era of the 1970s. Just as the PC did not immediately replace all manual processes, AI is only beginning its journey. The initial hype cycle surrounding ChatGPT and similar tools represents merely the foundation, not the pinnacle. Developers expecting immediate, massive returns without sustained investment risk missing the broader evolution.

This long-term view aligns with strategies adopted by major US corporations. Companies like Microsoft and Google have committed billions to AI infrastructure, acknowledging that ROI will accrue over years, not quarters. By framing AI as a marathon, Yao encourages a more stable, research-driven approach. This stability is crucial for building robust systems that can handle complex, real-world tasks without frequent catastrophic failures.

The Evolution Beyond Single-Path Development

The second core judgment focuses on the structural evolution of AI technology. For the past several years, the industry has followed a relatively linear progression: pre-training foundational models, followed by post-training alignment, and finally deploying agents. While this path yielded significant breakthroughs, it is no longer the sole avenue for innovation.

Yao argues that the future lies in diversification. The market is fragmenting into specialized domains, each requiring unique architectures and training data. This shift prevents monopolization by a few generalist models and opens doors for niche players. It also reduces the risk associated with betting everything on a single technological stack.

Emerging Frontiers: Multimodal and Embodied Intelligence

The most actionable insight from Yao’s speech is the rise of multimodal and embodied intelligence. Multimodal AI refers to systems capable of processing and generating information across various formats—text, images, audio, and video—simultaneously. Unlike traditional LLMs that primarily handle text, multimodal models understand context through sensory inputs.

For example, a multimodal system can analyze a video feed, understand the spoken dialogue, and interpret visual gestures to provide a comprehensive response. This capability is essential for applications in healthcare, autonomous driving, and customer service. It moves AI from a passive text generator to an active perceptual agent.

The Rise of Embodied AI

Embodied intelligence takes this concept further by integrating AI with physical bodies, such as robots or autonomous vehicles. This field requires the convergence of software intelligence with hardware mechanics. Unlike cloud-based LLMs, embodied AI must operate in real-time within physical constraints.

Yao notes that while Coding Agents have dominated recent headlines, they represent only a fraction of potential applications. Robotics offers a vast, largely empty space for innovation. Companies like Tesla with Optimus and Boston Dynamics are pioneering this space, but the ecosystem remains open. Startups focusing on specific industrial robots, home assistants, or agricultural drones will find ample opportunity.

This divergence suggests that the next generation of 'super apps' will not be chat interfaces. Instead, they will be physical devices or integrated systems that interact with the world. This shift demands new skills from developers, including knowledge of sensor fusion, control theory, and real-time computing.

Industry Context and Strategic Implications

The global AI landscape is currently defined by a race for dominance in LLMs. However, Yao’s comments suggest that this race may reach a saturation point sooner than expected. As foundational models become commoditized, value will shift to application layers and specialized modalities.

Western companies must adapt to this multi-modal reality. OpenAI’s GPT-4 already incorporates vision capabilities, and Anthropic’s Claude models are expanding into similar territories. However, the integration of these features into cohesive, user-friendly products remains a challenge. The competition is no longer just about model accuracy but about seamless user experience across different media types.

Furthermore, the emphasis on embodied intelligence highlights the importance of hardware-software synergy. This trend favors companies with vertical integration capabilities, such as Apple or Xiaomi, which control both the device and the operating system. Pure software players may need to form strategic partnerships with hardware manufacturers to stay relevant.

What This Means for Developers and Businesses

For developers, the message is clear: diversify your skill set. Mastery of prompt engineering alone will not suffice. Understanding how to train models on mixed datasets and how to deploy them on edge devices will become critical. Businesses should invest in R&D for multimodal applications, particularly in sectors where visual and auditory data are paramount, such as retail analytics and security.

Investors should look beyond pure LLM startups. Opportunities abound in companies developing sensors, robotic actuators, and middleware that connects AI brains to physical bodies. The market is rewarding innovation that bridges the digital and physical worlds.

Looking Ahead: The Next Decade of AI

As we look toward the future, the distinction between digital and physical intelligence will blur. We can expect to see more hybrid models that combine the reasoning power of LLMs with the perceptual abilities of multimodal systems and the action capabilities of robotics. This convergence will drive the next wave of productivity gains and economic transformation.

The timeline for this transition is gradual but inevitable. Within the next 3 to 5 years, we will likely see widespread adoption of embodied AI in industrial settings. Consumer-grade embodied AI may take longer due to cost and safety concerns, but progress will accelerate as hardware costs decrease.

Yao’s comparison to the 1970s PC era serves as a reminder that patience and persistence are virtues in AI development. The early adopters who build robust, versatile systems today will define the standards of tomorrow. The 'empty spaces' he mentions are not voids but canvases waiting for creative solutions.

Gogo's Take

  • 🔥 Why This Matters: The shift to multimodal and embodied AI marks the end of the 'chatbot-only' era. It signifies that AI is becoming a tangible, interactive force in the physical world, creating massive opportunities in robotics, smart manufacturing, and advanced consumer electronics beyond simple text generation.
  • ⚠️ Limitations & Risks: Developing embodied AI requires significant capital investment in hardware and rigorous safety testing. Unlike software bugs, errors in physical robots can cause real-world damage. Additionally, the complexity of multimodal data integration increases computational costs and energy consumption significantly.
  • 💡 Actionable Advice: Developers should start experimenting with multimodal APIs from providers like OpenAI and Google Cloud to understand cross-modal data handling. Investors should prioritize startups with strong hardware partnerships or proprietary sensor technology, rather than those relying solely on generic LLM wrappers.