Meituan LongCat-Next: A Native Multimodal Model That Sees and Hears Like a Human — Open-Sourced
Meituan LongCat-Next: A Native Multimodal Model That Sees and Hears Like a Human
On June 15, 2026, Meituan's LongCat team open-sourced LongCat-Next — a native multimodal model that treats vision and speech as first-class input channels, not secondary annotations.
This is different from the LongCat-Video-Avatar 1.5 released earlier this month. Where the avatar model generates digital human videos, LongCat-Next is a perception model — it understands the physical world through sight and sound, then acts on that understanding.
What Makes LongCat-Next Different
Most multimodal models today tack vision onto a language backbone: convert an image to tokens, feed it through a text-prediction pipeline, hope for the best.
LongCat-Next flips that architecture. Vision and speech are native languages in the model, not bolt-on capabilities. The model was designed from the ground up to:
- Perceive real-world environments through camera and microphone input
- Understand spatial relationships, object interactions, and human gestures
- Respond naturally through speech with context awareness
"We're not adding vision to a language model," the LongCat team stated. "We're building a model that happens to speak, see, and listen equally well."
What's Been Open-Sourced
The release includes two components:
- The core LongCat-Next model — weights and inference code for the native multimodal architecture
- A discrete tokenizer — converts visual and audio input into the model's native token format
This gives the developer community the building blocks to create context-aware AI systems that understand physical spaces — think warehouse robots that see what they're picking, AR assistants that understand your surroundings, or accessibility tools that describe the world to visually impaired users.
Why This Matters for the Open-Source Ecosystem
The open-source AI world has been dominated by language models and image generators. LongCat-Next fills a missing piece: real-world perception.
For comparison:
- LLaMA / DeepSeek / Qwen → text reasoning
- Stable Diffusion / Flux → image generation
- Whisper → speech recognition
- LongCat-Next → unified perception + understanding
This is the model that could power the next generation of robotics, autonomous systems, and AR/VR — all without vendor lock-in.
Meanwhile: General 365 Benchmark Results Released
Alongside LongCat-Next, the team released the official results of their General 365 reasoning benchmark, evaluating 26 mainstream AI models.
The findings are sobering:
| Model | General 365 Score |
|---|---|
| Gemini 3 Pro | 62.8% 🥇 |
| Claude Opus 4.8 | ~58% |
| GPT-5.5 | ~56% |
| DeepSeek V4 | ~55% |
| Majority of 26 models | Below 60% |
The 60% threshold was set as the "passing mark." Most models failed.
This benchmark focuses on complex logical reasoning — multi-step deduction, counterfactual thinking, and causal inference — not just knowledge retrieval. The results suggest that even frontier models have significant blind spots in genuine reasoning.
Monday AI Roundup: More June Stories Worth Watching
Apple's Siri Gets a Brain
At WWDC 2026, Apple unveiled a completely rebuilt Siri that can:
- Understand on-screen context
- Search messages and photos across apps
- Execute multi-step actions (book a ride, send a message, add a calendar event)
- Maintain cross-device conversation memory
With over 1 billion active iPhones, this is the largest consumer AI agent deployment in history. Apple's privacy-first approach (on-device processing + private cloud compute) gives it a trust advantage that competitors can't easily replicate.
Read more: WWDC 2026: Apple's AI Core
Jeff Bezos Returns with Prometheus AI
For the first time since stepping down from Amazon, Jeff Bezos is backing a new AI venture — Prometheus — focused on industrial engineering.
Unlike chatbot companies, Prometheus aims to help engineers design, simulate, optimize, and manufacture physical products. If successful, this could be one of the most consequential AI applications outside of software.
Moonshot's Kimi Work: Multi-Agent Desktop
Chinese AI company Moonshot launched Kimi Work, a desktop platform that orchestrates hundreds of AI agents simultaneously for research, analysis, report generation, and workflow automation. The local-first approach also appeals to organizations concerned about data privacy.
Google I/O 2026: Gemini Omni and Gemini 3.5 Flash
Google announced two new models at I/O:
- Gemini Omni — can create anything from any input, starting with video. A leap in world understanding and multimodal editing.
- Gemini 3.5 Flash — first in a new family combining frontier intelligence with action capabilities.
Full coverage: Google I/O 2026 Roundup
The Big Picture: June 2026's Two Axes
If you look across this week's news, two clear directions emerge:
Axis 1: Multimodal Perception — Models that see, hear, and understand the physical world (LongCat-Next, Gemini Omni, Apple's on-device AI)
Axis 2: Multi-Agent Orchestration — Systems that coordinate multiple AI agents for complex tasks (Kimi Work, Apple Siri, Hermes scheduled agents)
The companies that win the next phase won't be the ones with the best chatbot. They'll be the ones that bridge the gap between "AI that talks" and "AI that acts."
What's your take on the native multimodal approach vs. bolt-on vision? Join the discussion or contribute a tool to the directory.
Related AI Tools
Found this helpful? Share it with your team.
Read more articles →