Brand LogoBrand Logo (Dark)
HomeAI AgentsToolkitsGitHub PicksSubmit AgentBlog

Categories

  • Art Generators
  • Audio Generators
  • Automation Tools
  • Chatbots & AI Agents
  • Code Tools
  • Financial Tools

Categories

  • Large Language Models
  • Marketing Tools
  • No-Code & Low-Code
  • Research & Search
  • Video & Animation
  • Video Editing

GitHub Picks

  • DeerFlow — ByteDance Open-Source SuperAgent Harness

Latest Blogs

  • OpenClaw vs Composer 2 Which AI Assistant Delivers More Value
  • Google AI Studio vs Anthropic Console
  • Stitch 2.0 vs Lovable Which AI Design Tool Wins in 2026
  • Monetizing AI for Solopreneurs and Small Teams in 2026
  • OpenClaw vs MiniMax Which AI Assistant Wins in 2026

Latest Blogs

  • OpenClaw vs KiloClaw Is Self-Hosting Still Better
  • OpenClaw vs Kimi Claw
  • GPT-5.4 vs Gemini 3.1 Pro
  • Farewell to Bloomberg Terminal as Perplexity Computer AI Redefines Finance
  • Best Practices for OpenClaw
LinkStartAI© 2026 LinkstartAI. All rights reserved.
Contact UsAbout
  1. Home
  2. GitHub Picks
  3. Yuan3.0 Ultra
Yuan3.0 Ultra logo

Yuan3.0 Ultra

Delete a third of its parameters and become smarter — one of only three trillion-parameter open-source multimodal LLMs in the world
1.2kPythonYuan 3.0 Model License Agreement
#llm#moe#multimodal#enterprise-ai#rag#text-to-sql
#reinforcement-learning
#open-source
#trillion-parameter
#document-understanding
#agent
#chinese-ai

What is it?

Yuan3.0 Ultra is a trillion-parameter open-source multimodal foundation LLM released by YuanLab.ai in March 2026, and one of only three open-source multimodal models at the trillion-parameter scale globally. Its language backbone employs a Mixture-of-Experts(MoE)architecture with 103 Transformer layers, starting pre-training at 1515B parameters and compressed to 1010B via the novel Layer-Adaptive Expert Pruning(LAEP)algorithm, with 68.8B activated parameters and a 49% gain in pre-training efficiency. It further integrates a Localized Filtering-based Attention(LFA)mechanism and a Reflection Inhibition Reward Mechanism(RIRM)to reduce reasoning token waste by 14.38%. Against frontier models like DeepSeek-V3, GPT-5.2, and Kimi K2.5, Yuan3.0 Ultra achieves top scores on ChatRAG(68.2%), Docmatix(67.4%), and SummEval(62.8%), making it a best-in-class core engine for enterprise document-driven and data-driven Agent AI applications.

Pain Points vs Innovation

✕Traditional Pain Points✓Innovative Solutions
Traditional trillion-parameter MoE models suffer from severe expert load imbalance during pre-training — the gap between highest- and lowest-load experts can reach 500x, wasting massive compute resourcesLAEP Algorithm: Adaptively prunes low-load experts layer-by-layer during the stable pre-training phase and applies greedy expert rearrangement for balanced device load, achieving 33.3% parameter reduction and 49% efficiency gain simultaneously
Reasoning-oriented models like DeepSeek-R1 exhibit overthinking behavior, generating excessive reflection tokens even after reaching a correct answer, driving up inference costsEnhanced RIRM: Under the RAPO fast-thinking RL framework, reward constraints on reflection step count yield a 16.33% accuracy improvement and a 14.38% reduction in output token length, delivering gains in both quality and compute efficiency
Most open-source LLMs underperform in enterprise-specific verticals such as RAG, Text-to-SQL, and table understanding, limiting direct adoption for financial reports or approval workflow processingLFA Mechanism: Localized Filtering-based Attention models semantic relationships more effectively than classical Softmax Attention, especially in long-document and cross-modal scenarios
Closed or semi-open models like Kimi K2.5 and GPT-5.2 cannot be privately deployed or fine-tuned, creating data security risks for enterprises handling sensitive internal knowledgeFully Open Release: Model weights, technical report, SFT fine-tuning scripts, and RL training scripts are publicly available, enabling community retraining and enterprise customization

Architecture Deep Dive

Unified Multimodal Architecture
Yuan3.0 Ultra adopts a three-component unified architecture consisting of a Vision Encoder, a Language Backbone, and a Multimodal Alignment Module, enabling end-to-end synergistic modeling of visual and linguistic information. The vision encoder maps raw image pixel sequences into visual token representations, while the alignment module serves as a semantic bridge between visual and language spaces to ensure cross-modal consistency. The language backbone, built on a 103-layer deep MoE Transformer, forms the architectural core with a 64K token context window, enabling direct processing of multi-page enterprise documents and cross-document knowledge retrieval.
LAEP: Layer-Adaptive Expert Pruning
LAEP is the most critical engineering innovation in Yuan3.0 Ultra, purpose-built for the pre-training stage of MoE LLMs. Its key insight is that pre-training can be divided into an Initial Transition Phase and a Stable Phase, during which expert token load becomes highly imbalanced, with the gap between highest- and lowest-load experts reaching up to 500x. LAEP monitors the per-expert token distribution layer-by-layer during the stable phase, adaptively identifying and pruning persistently low-load redundant experts, compressing total parameters from 1515B to 1010B with a 33.3% reduction. A greedy expert rearrangement algorithm then redistributes surviving experts across devices for balanced load, ultimately boosting overall pre-training efficiency by 49%, achieving a real compute utilization of 92.8 TFLOP/GPU.
LFA: Localized Filtering-based Attention
LFA is a structural replacement for classical Softmax Self-Attention, introducing localized filtering operations into the attention computation to enable finer modeling of local semantic relationships and suppress attention noise in long sequences. Compared to standard attention, LFA achieves higher accuracy on structured text such as tables, code, and SQL queries, as well as cross-modal alignment tasks. This is one key reason Yuan3.0 Ultra leads on MMTab and Text-to-SQL benchmarks. In 64K long-context scenarios, LFA also helps reduce the computational complexity of global attention, balancing precision and efficiency.
RIRM: Reflection Inhibition Reward
RIRM is the core alignment innovation introduced during Yuan3.0 Ultra’s RL post-training phase under the RAPO framework, designed to address the overthinking problem prevalent in fast-thinking RL models. Its mechanism introduces explicit reward constraints on reflection step count: continued reflection after reaching the first correct answer is penalized, while maintaining necessary reasoning depth on complex problems receives a positive reward. This bi-directional constraint simultaneously delivers a 16.33% improvement in training accuracy and a 14.38% reduction in output token length, a true Pareto improvement of fewer tokens and higher accuracy that significantly reduces enterprise inference costs.
vLLM Inference and RLHF Training Stack
The Yuan3.0 Ultra open-source repository contains two core submodules: vllm and rlhf. The vllm submodule provides high-throughput inference adaptation based on the vLLM framework, supporting both bfloat16 and int4 quantized inference modes with tensor-parallel multi-GPU deployment to minimize inference latency. The rlhf submodule provides complete Supervised Fine-Tuning(SFT)and Reinforcement Learning(RL)training scripts, enabling enterprises to perform domain adaptation and alignment training on Yuan3.0 Ultra using private datasets, serving as essential engineering infrastructure for industry-specific customization scenarios.

Deployment Guide

1. Clone the repository and install vLLM inference dependencies

bash
1git clone https://github.com/Yuan-lab-LLM/Yuan3.0-Ultra.git\ncd Yuan3.0-Ultra/vllm\npip install -r requirements.txt

2. Download model weights from ModelScope or HuggingFace(int4 quantized version recommended to reduce VRAM usage)

bash
1# HuggingFace\nhuggingface-cli download YuanLabAI/Yuan3.0-Ultra-int4 --local-dir ./models/Yuan3.0-Ultra-int4\n\n# Or ModelScope\nmodelscope download --model YuanLabAI/Yuan3.0-Ultra-int4 --local_dir ./models/Yuan3.0-Ultra-int4

3. Launch multi-GPU inference service using vLLM(example: 4x A100 80G)

bash
1python -m vllm.entrypoints.openai.api_server \\\n  --model ./models/Yuan3.0-Ultra-int4 \\\n  --tensor-parallel-size 4 \\\n  --max-model-len 32768 \\\n  --port 8000

4. Test model inference via the OpenAI-compatible API endpoint

bash
1curl http://localhost:8000/v1/chat/completions \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n    model: Yuan3.0-Ultra-int4,\n    messages: [{role: user, content: Analyze the anomalous data in this financial report.}],\n    max_tokens: 2048\n  }'

5. Optional: Run enterprise SFT fine-tuning on private data using the rlhf submodule

bash
1cd ../rlhf\nbash scripts/run_sft.sh \\\n  --model_path ../models/Yuan3.0-Ultra-int4 \\\n  --data_path ./data/your_enterprise_dataset.json \\\n  --output_dir ./output/yuan_sft_finetuned

Use Cases

Core SceneTarget AudienceSolutionOutcome
Enterprise Knowledge Base RAG QA SystemAI platform engineers at knowledge-intensive enterprises in finance, legal, and healthcareLeverage Yuan3.0 Ultra’s top-tier ChatRAG score of 68.2% to build multi-turn conversational enterprise knowledge Q and A systems that precisely retrieve internal documents and historical case recordsKnowledge retrieval accuracy surpasses GPT-4o and Claude Opus 4.6, significantly reducing manual knowledge query costs while enabling compliance auditing and decision support
Multimodal Financial Report Auto-ParsingFinance departments and BI data teams at large enterprisesUtilize Yuan3.0 Ultra’s LFA attention mechanism and MMTab 62.3% multimodal table understanding to auto-parse mixed-layout quarterly and annual reports and approval forms, extracting key figures and anomaly indicatorsCompresses report parsing that previously required hours of manual review to minute-level processing, reducing financial analysis labor costs and improving data accuracy
Natural Language Driven Database Query PlatformBusiness analysts and operations staff without SQL programming skillsDeploy Yuan3.0 Ultra as a Text-to-SQL engine with a Spider 1.0 benchmark of 83.9%, outperforming DeepSeek V3.2 and Kimi K2.5, allowing business users to query enterprise data warehouses via natural language and auto-generate and execute SQLEliminates technical barriers, enabling self-service real-time data queries and report generation that multiplies data-driven decision-making efficiency

Limitations & Gotchas

Limitations & Gotchas
  • Extremely high hardware requirements: The 16bit full-precision version requires approximately 2TB of VRAM, roughly 25 A100 80G GPUs; even the int4 quantized version demands at least 4 to 8 high-end GPUs, far exceeding the self-deployment capacity of most SMEs and making cloud inference services a practical necessity
  • Higher inference latency: With 68.8B activated parameters, single-request inference latency is significantly higher than 7B to 70B class models, making it unsuitable for real-time interactive scenarios such as live customer service and better suited for batch processing and async tasks
  • General reasoning not best-in-class: Benchmark data shows Yuan3.0 Ultra trails Gemini 3.1 Pro and Claude Opus 4.6 on BFCL V3 tool invocation and its MATH-500 mathematical reasoning score also lags slightly behind some models optimized specifically for reasoning tasks
  • Multi-turn tool calling weakness: The Multi-turn context maintenance dimension on BFCL V3 scores only 45.3%, below Gemini 3.1 Pro and Claude Opus 4.6, which may cause context loss or instruction drift in complex multi-step Agent workflows
  • Nascent community ecosystem: The repository was open-sourced only in March 2026, and the surrounding toolchain, third-party integration plugins, and technical community discussions are still sparse compared to LLaMA and Qwen series, limiting available community resources when issues arise
  • License compliance risk: Uses a custom Yuan 3.0 Model License Agreement rather than standard Apache 2.0 or MIT, requiring enterprises to carefully review licensing terms before commercial deployment, particularly clauses concerning derivative distribution and overseas deployment restrictions

Frequently Asked Questions

Does LAEP pruning risk damaging the model’s specialized capabilities in certain domains? Can the pruned 33% of parameters be recovered?▾
This is the most fundamental architectural debate in community discussions. LAEP pruning occurs during the Stable Phase of pre-training, targeting experts that have been persistently low-load. These experts contribute almost no actual computation during the stable phase, making them structurally redundant rather than functionally redundant. Pruning is therefore not random cutting but evidence-based structural compression. Benchmark results show the compressed 1010B model outperforms earlier checkpoints across enterprise evaluation sets, confirming low-load experts were not contributing to modeling. The pruned parameters cannot be directly recovered because pruning is an irreversible structural change, but complete training scripts are provided, allowing enterprises to perform SFT on the 1010B base to supplement domain-specific capabilities.
Yuan3.0 Ultra’s 68.2% on ChatRAG dramatically outperforms Claude Opus 4.6 and GPT-5.2. Are there concerns about data contamination or self-evaluation bias?▾
This is one of the most frequently challenged questions on Reddit and Hacker News. ChatRAG is an open-source standard RAG evaluation suite from NVIDIA, comprising 10 subtasks from diverse sources with fully transparent dataset composition and evaluation methodology, leaving limited room for custom bias. Yuan3.0 Ultra ranked first on 9 out of 10 subtasks, with its strongest advantages on the hardest long-context retrieval tasks, which aligns closely with its architectural strengths of a 64K context window and LFA attention. However, the technical report was self-published by the team, and independent third-party reproduction tests have not yet appeared. Caution regarding the magnitude of this lead is warranted until external validation is available.
For enterprise RAG production deployments, how does Yuan3.0 Ultra compare to DeepSeek-V3 in practice?▾
Benchmark data clearly favors Yuan3.0 Ultra on ChatRAG and SummEval, showing a clear RAG advantage over DeepSeek-V3. However, practical production selection must weigh multiple dimensions. First, inference cost: DeepSeek-V3 activates about 37B parameters versus Yuan3.0 Ultra’s 68.8B, implying roughly 1.9x more compute per request and lower throughput on identical hardware. Second, ecosystem maturity: DeepSeek-V3 has more mature vLLM optimization, quantization support, and third-party framework integrations such as LangChain and LlamaIndex. Third, licensing: DeepSeek-V3 uses MIT while Yuan3.0 Ultra uses a custom agreement requiring additional overseas deployment compliance evaluation. Enterprises prioritizing RAG accuracy with sufficient compute should prefer Yuan3.0 Ultra, while cost- and ecosystem-constrained scenarios favor DeepSeek-V3.
How does RIRM balance avoiding overthinking with preserving complex reasoning capability? Is there a risk of prematurely truncating reasoning chains on difficult problems?▾
The RIRM reward function design is critical. It does not apply uniform penalties to all reflective behavior, but specifically penalizes continuing to reflect after already reaching a correct answer, while positively rewarding deep reasoning chains on complex problems. This means the reasoning chain can continue as long as the model has not yet reached its confidence threshold. However, there is a latent risk: the model’s confidence judgment is itself a learned soft decision, which in out-of-distribution or adversarial scenarios may lead to premature truncation where the model believes it is correct but is actually wrong. Benchmark data shows the mechanism is robust on mathematics and scientific reasoning, but for highly open-ended questions or domain transfer scenarios, domain-specific SFT is recommended to recalibrate confidence thresholds before production deployment.
How significant is the accuracy loss of the int4 quantized version versus 16bit? Can enterprises safely use int4 for mission-critical business applications?▾
Yuan3.0 Ultra provides both BF16 and int4 versions; int4 quantization compresses VRAM requirements from roughly 2TB to approximately 500GB, making multi-GPU A100 cluster deployment feasible. For 1000B plus ultra-large models, int4 PTQ typically introduces relatively small accuracy loss, usually within a 1 to 3% benchmark score range, because larger parameter counts make the relative impact of quantization noise smaller. However, the technical report does not provide explicit comparative benchmarks between 16bit and int4 versions, which remains an important information gap. For enterprise mission-critical applications such as financial compliance or medical report analysis, A/B testing on target tasks before full int4 deployment is strongly recommended rather than relying solely on general benchmark inference.
Yuan3.0 Ultra outperforms Kimi K2.5 and DeepSeek V3.2 on Text-to-SQL in Spider, but trails Kimi K2.5 on BIRD. Why?▾
Spider 1.0 and BIRD differ fundamentally in task design. Spider focuses on syntactic correctness and standard SQL pattern recognition, making it closer to a knowledge memorization evaluation. BIRD introduces real database noise, ambiguous column names, and multi-hop reasoning requirements, making it much closer to production-style reasoning. Yuan3.0 Ultra’s Spider lead demonstrates exceptional SQL syntax generation and Schema Linking capability, while its weaker result on BIRD reveals somewhat lower robustness when handling data noise and semantic ambiguity. This explains why database governance quality, such as naming conventions and field annotation completeness, is decisive for real-world Text-to-SQL performance. Yuan3.0 Ultra excels on well-structured schemas, while performance gaps widen in the presence of heavy legacy noise.
What are the core differences between Yuan 3.0’s custom license and Apache 2.0 or MIT? What legal risks should enterprises consider for commercial use?▾
The Yuan 3.0 Model License Agreement permits commercial use without requiring prior authorization, which is more permissive than some academically restrictive licenses. However, compared to Apache 2.0 or MIT, several key constraints exist. First, it prohibits uses that may harm the nation or society, a broadly worded clause with uncertain legal interpretation in some jurisdictions. Second, derivative model distribution terms must be reviewed carefully for whether they require preserving original license references. Third, restrictions on services that have not undergone safety assessment and registration may conflict with local regulations in overseas deployments. Legal teams should conduct a clause-by-clause compliance review, especially for enterprises planning deployments in the EU or US markets.
How does Yuan3.0 Ultra compare to Qwen3-235B-A22B in enterprise Agent tool-calling scenarios, and what are the fundamental architectural trade-offs?▾
From BFCL V3 benchmark data, Qwen3-235B-A22B scores 68.0%, marginally ahead of Yuan3.0 Ultra at 67.8%, but the two models show sharply different sub-dimension profiles. Qwen3 leads on Relevance, indicating higher tool selection accuracy, while Yuan3.0 Ultra is stronger on Irrelevance Detection, meaning it more reliably refuses tool calls when it should not make them. Architecturally, Qwen3-235B-A22B activates 22B parameters versus Yuan3.0 Ultra’s 68.8B, giving it clearer inference efficiency advantages and higher concurrency per unit of compute. Yuan3.0 Ultra’s 64K context window versus Qwen3’s 32K gives it a stronger position on long-document Agent tasks. Concurrency-sensitive Agent platforms should prefer Qwen3, while long-document processing and strict tool-refusal scenarios favor Yuan3.0 Ultra.
View on GitHub

Project Metrics

Stars1.2 k
LanguagePython
LicenseYuan 3.0 Model License Agreement
Deploy DifficultyHard

Table of Contents

  1. 01What is it?
  2. 02Pain Points vs Innovation
  3. 03Architecture Deep Dive
  4. 04Deployment Guide
  5. 05Use Cases
  6. 06Limitations & Gotchas
  7. 07Frequently Asked Questions

Related Projects

Awesome LLM Apps
Awesome LLM Apps
96.4 k·Python
RAG_Techniques
RAG_Techniques
25.5 k·Jupyter Notebook
DeerFlow — ByteDance Open-Source SuperAgent Harness
DeerFlow — ByteDance Open-Source SuperAgent Harness
26.1 k·Python
gstack
gstack
0·TypeScript