Meituan LongCat-Next: A Native Multimodal Model That Sees and Hears Like a Human

Q: What's Been Open-Sourced

The release includes two components: 1. The core LongCat-Next model — weights and inference code for the native multimodal architecture 2. A discrete tokenizer — converts visual and audio input into the model's native token format This gives the developer community the building blocks to create context-aware AI systems that understand physical spaces — think warehouse robots that see what they're picking, AR assistants that understand your surroundings, or accessibility tools that describe the w

On June 15, 2026, Meituan's LongCat team open-sourced LongCat-Next — a native multimodal model that treats vision and speech as first-class input channels, not secondary annotations.

This is different from the LongCat-Video-Avatar 1.5 released earlier this month. Where the avatar model generates digital human videos, LongCat-Next is a perception model — it understands the physical world through sight and sound, then acts on that understanding.

What Makes LongCat-Next Different

Most multimodal models today tack vision onto a language backbone: convert an image to tokens, feed it through a text-prediction pipeline, hope for the best.

LongCat-Next flips that architecture. Vision and speech are native languages in the model, not bolt-on capabilities. The model was designed from the ground up to:

Perceive real-world environments through camera and microphone input
Understand spatial relationships, object interactions, and human gestures
Respond naturally through speech with context awareness

"We're not adding vision to a language model," the LongCat team stated. "We're building a model that happens to speak, see, and listen equally well."

What's Been Open-Sourced

The release includes two components:

The core LongCat-Next model — weights and inference code for the native multimodal architecture
A discrete tokenizer — converts visual and audio input into the model's native token format

This gives the developer community the building blocks to create context-aware AI systems that understand physical spaces — think warehouse robots that see what they're picking, AR assistants that understand your surroundings, or accessibility tools that describe the world to visually impaired users.

Why This Matters for the Open-Source Ecosystem

The open-source AI world has been dominated by language models and image generators. LongCat-Next fills a missing piece: real-world perception.

For comparison:

LLaMA / DeepSeek / Qwen → text reasoning
Stable Diffusion / Flux → image generation
Whisper → speech recognition
LongCat-Next → unified perception + understanding

This is the model that could power the next generation of robotics, autonomous systems, and AR/VR — all without vendor lock-in.

Meanwhile: General 365 Benchmark Results Released

Alongside LongCat-Next, the team released the official results of their General 365 reasoning benchmark, evaluating 26 mainstream AI models.

The findings are sobering:

Model	General 365 Score
Gemini 3 Pro	62.8% 🥇
Claude Opus 4.8	~58%
GPT-5.5	~56%
DeepSeek V4	~55%
Majority of 26 models	Below 60%

The 60% threshold was set as the "passing mark." Most models failed.

This benchmark focuses on complex logical reasoning — multi-step deduction, counterfactual thinking, and causal inference — not just knowledge retrieval. The results suggest that even frontier models have significant blind spots in genuine reasoning.

Monday AI Roundup: More June Stories Worth Watching

Apple's Siri Gets a Brain

At WWDC 2026, Apple unveiled a completely rebuilt Siri that can:

Understand on-screen context
Search messages and photos across apps
Execute multi-step actions (book a ride, send a message, add a calendar event)
Maintain cross-device conversation memory

With over 1 billion active iPhones, this is the largest consumer AI agent deployment in history. Apple's privacy-first approach (on-device processing + private cloud compute) gives it a trust advantage that competitors can't easily replicate.

Jeff Bezos Returns with Prometheus AI

For the first time since stepping down from Amazon, Jeff Bezos is backing a new AI venture — Prometheus — focused on industrial engineering.

Unlike chatbot companies, Prometheus aims to help engineers design, simulate, optimize, and manufacture physical products. If successful, this could be one of the most consequential AI applications outside of software.

Moonshot's Kimi Work: Multi-Agent Desktop

Chinese AI company Moonshot launched Kimi Work, a desktop platform that orchestrates hundreds of AI agents simultaneously for research, analysis, report generation, and workflow automation. The local-first approach also appeals to organizations concerned about data privacy.

Google I/O 2026: Gemini Omni and Gemini 3.5 Flash

Google announced two new models at I/O:

Gemini Omni — can create anything from any input, starting with video. A leap in world understanding and multimodal editing.
Gemini 3.5 Flash — first in a new family combining frontier intelligence with action capabilities.

Full coverage: Google I/O 2026 Roundup

The Big Picture: June 2026's Two Axes

If you look across this week's news, two clear directions emerge:

Axis 1: Multimodal Perception — Models that see, hear, and understand the physical world (LongCat-Next, Gemini Omni, Apple's on-device AI)

Axis 2: Multi-Agent Orchestration — Systems that coordinate multiple AI agents for complex tasks (Kimi Work, Apple Siri, Hermes scheduled agents)

The companies that win the next phase won't be the ones with the best chatbot. They'll be the ones that bridge the gap between "AI that talks" and "AI that acts."

What's your take on the native multimodal approach vs. bolt-on vision? Join the discussion or contribute a tool to the directory.

Meituan LongCat-Next: A Native Multimodal Model That Sees and Hears Like a Human — Open-Sourced