A.I. Smart Router
A.I.智能路由
通过分级分类、自动故障切换处理和成本优化,智能地将请求路由到最优的AI模型。
工作原理(默认静默模式)
路由器透明运行——用户正常发送消息,即可获得最适合其任务的最佳模型响应。无需特殊指令。

可选可见性:在任何消息中包含[显示路由]即可查看路由决策。
分级分类系统
路由器采用三级决策流程:
┌─────────────────────────────────────────────────────────────────┐
│ TIER 1: INTENT DETECTION │
│ Classify the primary purpose of the request │
├─────────────────────────────────────────────────────────────────┤
│ CODE │ ANALYSIS │ CREATIVE │ REALTIME │ GENERAL │
│ write/debug │ research │ writing │ news/live │ Q&A/chat │
│ refactor │ explain │ stories │ X/Twitter │ translate │
│ review │ compare │ brainstorm │ prices │ summarize │
└──────┬───────┴──────┬──────┴─────┬──────┴─────┬─────┴─────┬─────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ TIER 2: COMPLEXITY ESTIMATION │
├─────────────────────────────────────────────────────────────────┤
│ SIMPLE (Tier $) │ MEDIUM (Tier $$) │ COMPLEX (Tier $$$)│
│ • One-step task │ • Multi-step task │ • Deep reasoning │
│ • Short response OK │ • Some nuance │ • Extensive output│
│ • Factual lookup │ • Moderate context │ • Critical task │
│ → Haiku/Flash │ → Sonnet/Grok/GPT │ → Opus/GPT-5 │
└──────────────────────────┴─────────────────────┴───────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ TIER 3: SPECIAL CASE OVERRIDES │
├─────────────────────────────────────────────────────────────────┤
│ CONDITION │ OVERRIDE TO │
│ ─────────────────────────────────────┼─────────────────────────│
│ Context >100K tokens │ → Gemini Pro (1M ctx) │
│ Context >500K tokens │ → Gemini Pro ONLY │
│ Needs real-time data │ → Grok (regardless) │
│ Image/vision input │ → Opus or Gemini Pro │
│ User explicit override │ → Requested model │
└──────────────────────────────────────┴──────────────────────────┘
意图检测模式
代码意图
- 关键词:编写、代码、调试、修复、重构、实现、函数、类、脚本、API、错误、bug、编译、测试、PR、提交
- 提及的文件扩展名:.py、.js、.ts、.go、.rs、.java等
- 输入中包含的代码块
分析意图
- 关键词:分析、解释、比较、研究、理解、为什么、如何、评估、评定、审查、调查、检查
- 长篇幅问题
- "帮我理解..."
创意意图
- 关键词:写(故事/诗歌/文章)、创作、头脑风暴、想象、设计、起草、构思
- 虚构/叙事类请求
- 营销/文案类请求
实时意图
- 关键词:现在、今天、当前、最新、趋势、新闻、正在发生、直播、价格、比分、天气
- X/Twitter提及
- 股票/加密货币代码
- 体育比分
通用意图(默认)
- 简单问答
- 翻译
- 摘要
- 对话交流
混合意图(检测到多个意图)
当请求包含多个明确意图时(例如:"编写代码来分析这些数据并富有创意地解释它"):
- 识别主要意图——主要交付成果是什么?
- 路由至能力最强的模型— 混合任务需要多功能性
- 默认使用COMPLEX复杂度— 多意图 = 多步骤
例如:
- "编写代码并解释其工作原理" → 代码(主要)+ 分析 → 路由到Opus
- "总结这个内容并获取其最新消息" → 实时性优先 → Grok
- "基于真实时事创作故事" → 实时性 + 创造性 → Grok(实时性优先)
语言处理
非英语请求正常处理——所有支持的模型都具备多语言能力:
| 模型 | 非英语支持 |
|---|---|
| Opus/Sonnet/Haiku | 优秀(100+种语言) |
| GPT-5 | 优秀(100+种语言) |
| Gemini Pro/Flash | 优秀(100+种语言) |
| Grok | 良好(主要语言) |
意图检测仍能正常工作因为:
- 关键词模式包括常见的非英语等效词
- 通过文件扩展名、代码块(与语言无关)检测代码意图
- 通过查询长度估算复杂度(跨语言适用)
边界情况:如果因语言原因意图不明确,则默认使用 GENERAL 意图和 MEDIUM 复杂度。
复杂度信号
简单复杂度 ($)
- 短查询 (<50 词)
- 单个问号
- "快速问题"、"直接告诉我"、"简要说明"
- 是/否格式
- 单位换算、定义
中等复杂度 ($$)
- 中等长度查询 (50-200 词)
- 涉及多个方面
- "解释"、"描述"、"比较"
- 提供了一些上下文
复杂复杂度 ($$$)
- 长查询 (>200 词) 或复杂任务
- "逐步地"、"彻底地"、"详细地"
- 多部分问题
- 关键/重要限定词
- 研究、分析或创造性工作
路由矩阵
| 意图 | 简单 | 中等 | 复杂 |
|---|---|---|---|
| 代码 | Sonnet | Opus | Opus |
| 分析 | Flash | GPT-5 | Opus |
| 创意 | Sonnet | Opus | Opus |
| 实时 | Grok | Grok | Grok-3 |
| 通用 | Flash | 十四行诗 | 作品 |
令牌耗尽与自动模型切换
当模型在会话中途变得不可用(令牌配额耗尽、达到速率限制、API错误)时,路由器会自动切换到下一个最佳可用模型并通知用户。
通知格式
当因耗尽而发生模型切换时,用户会收到通知:
┌─────────────────────────────────────────────────────────────────┐
│ ⚠️ MODEL SWITCH NOTICE │
│ │
│ Your request could not be completed on claude-opus-4-5 │
│ (reason: token quota exhausted). │
│ │
│ ✅ Request completed using: anthropic/claude-sonnet-4-5 │
│ │
│ The response below was generated by the fallback model. │
└─────────────────────────────────────────────────────────────────┘
切换原因
| 原因 | 描述 |
|---|---|
令牌配额耗尽 | 达到每日/每月令牌限制 |
超过速率限制 | 每分钟请求过多 |
超出上下文窗口 | 输入对于模型而言过大 |
API超时 | 模型响应时间过长 |
API错误 | 提供商返回错误 |
模型不可用 | 模型暂时离线 |
实施
def execute_with_fallback(primary_model: str, fallback_chain: list[str], request: str) -> Response:
"""
Execute request with automatic fallback and user notification.
"""
attempted_models = []
switch_reason = None
# Try primary model first
models_to_try = [primary_model] + fallback_chain
for model in models_to_try:
try:
response = call_model(model, request)
# If we switched models, prepend notification
if attempted_models:
notification = build_switch_notification(
failed_model=attempted_models[0],
reason=switch_reason,
success_model=model
)
return Response(
content=notification + "\n\n---\n\n" + response.content,
model_used=model,
switched=True
)
return Response(content=response.content, model_used=model, switched=False)
except TokenQuotaExhausted:
attempted_models.append(model)
switch_reason = "token quota exhausted"
log_fallback(model, switch_reason)
continue
except RateLimitExceeded:
attempted_models.append(model)
switch_reason = "rate limit exceeded"
log_fallback(model, switch_reason)
continue
except ContextWindowExceeded:
attempted_models.append(model)
switch_reason = "context window exceeded"
log_fallback(model, switch_reason)
continue
except APITimeout:
attempted_models.append(model)
switch_reason = "API timeout"
log_fallback(model, switch_reason)
continue
except APIError as e:
attempted_models.append(model)
switch_reason = f"API error: {e.code}"
log_fallback(model, switch_reason)
continue
# All models exhausted
return build_exhaustion_error(attempted_models)
def build_switch_notification(failed_model: str, reason: str, success_model: str) -> str:
"""Build user-facing notification when model switch occurs."""
return f"""⚠️ **MODEL SWITCH NOTICE**
Your request could not be completed on `{failed_model}` (reason: {reason}).
✅ **Request completed using:** `{success_model}`
The response below was generated by the fallback model."""
def build_exhaustion_error(attempted_models: list[str]) -> Response:
"""Build error when all models are exhausted."""
models_tried = ", ".join(attempted_models)
return Response(
content=f"""❌ **REQUEST FAILED**
Unable to complete your request. All available models have been exhausted.
**Models attempted:** {models_tried}
**What you can do:**
1. **Wait** — Token quotas typically reset hourly or daily
2. **Simplify** — Try a shorter or simpler request
3. **Check status** — Run `/router status` to see model availability
If this persists, your human may need to check API quotas or add additional providers.""",
model_used=None,
switched=False,
failed=True
)
令牌耗尽时的备用优先级
当某个模型耗尽时,路由器会为相同任务类型选择下一个最佳模型
| : | 原始模型 |
|---|---|
| 备用优先级(同等能力) | Opus |
| Sonnet → GPT-5 → Grok-3 → Gemini Pro | Sonnet |
| GPT-5 → Grok-3 → Opus → Haiku | GPT-5 |
| Sonnet → Opus → Grok-3 → Gemini Pro | Gemini Pro |
| Flash → GPT-5 → Opus → Sonnet | Grok-2/3 |
(警告:无实时备用方案可用)
用户确认
- 模型切换后,代理应在响应中注明:
- 原始模型不可用
- 响应质量可能与原始模型的典型输出存在差异
这确保了透明度并设定了适当的预期
流式响应与回退机制
使用流式响应时,回退处理需要特别考虑:
async def execute_with_streaming_fallback(primary_model: str, fallback_chain: list[str], request: str):
"""
Handle streaming responses with mid-stream fallback.
If a model fails DURING streaming (not before), the partial response is lost.
Strategy: Don't start streaming until first chunk received successfully.
"""
models_to_try = [primary_model] + fallback_chain
for model in models_to_try:
try:
# Test with non-streaming ping first (optional, adds latency)
# await test_model_availability(model)
# Start streaming
stream = await call_model_streaming(model, request)
first_chunk = await stream.get_first_chunk(timeout=10_000) # 10s timeout for first chunk
# If we got here, model is responding — continue streaming
yield first_chunk
async for chunk in stream:
yield chunk
return # Success
except (FirstChunkTimeout, StreamError) as e:
log_fallback(model, str(e))
continue # Try next model
# All models failed
yield build_exhaustion_error(models_to_try)
关键洞察:在确认使用某个模型前,等待其首个数据块。如果首个数据块超时,则应在向用户显示任何部分响应前进行回退
重试时间配置
RETRY_CONFIG = {
"initial_timeout_ms": 30_000, # 30s for first attempt
"fallback_timeout_ms": 20_000, # 20s for fallback attempts (faster fail)
"max_retries_per_model": 1, # Don't retry same model
"backoff_multiplier": 1.5, # Not used (no same-model retry)
"circuit_breaker_threshold": 3, # Failures before skipping model entirely
"circuit_breaker_reset_ms": 300_000 # 5 min before trying failed model again
}
熔断机制:如果一个模型在5分钟内失败3次,则在接下来的5分钟内完全跳过该模型。这可以避免反复访问已宕机的服务
回退链
当首选模型失败时(如达到速率限制、API宕机、错误等),级联至下一个选项:
代码任务
Opus → Sonnet → GPT-5 → Gemini Pro
分析任务
Opus → GPT-5 → Gemini Pro → Sonnet
创意任务
Opus → GPT-5 → Sonnet → Gemini Pro
实时任务
Grok-2 → Grok-3 → (warn: no real-time fallback)
通用任务
Flash → Haiku → Sonnet → GPT-5
长文本处理(按规模分级)
┌─────────────────────────────────────────────────────────────────┐
│ LONG CONTEXT FALLBACK CHAIN │
├─────────────────────────────────────────────────────────────────┤
│ TOKEN COUNT │ FALLBACK CHAIN │
│ ───────────────────┼───────────────────────────────────────────│
│ 128K - 200K │ Opus (200K) → Sonnet (200K) → Gemini Pro │
│ 200K - 1M │ Gemini Pro → Flash (1M) → ERROR_MESSAGE │
│ > 1M │ ERROR_MESSAGE (no model supports this) │
└─────────────────────┴───────────────────────────────────────────┘
实现方式:
def handle_long_context(token_count: int, available_models: dict) -> str | ErrorMessage:
"""Route long-context requests with graceful degradation."""
# Tier 1: 128K - 200K tokens (Opus/Sonnet can handle)
if token_count <= 200_000:
for model in ["opus", "sonnet", "haiku", "gemini-pro", "flash"]:
if model in available_models and get_context_limit(model) >= token_count:
return model
# Tier 2: 200K - 1M tokens (only Gemini)
elif token_count <= 1_000_000:
for model in ["gemini-pro", "flash"]:
if model in available_models:
return model
# Tier 3: > 1M tokens (nothing available)
# Fall through to error
# No suitable model found — return helpful error
return build_context_error(token_count, available_models)
def build_context_error(token_count: int, available_models: dict) -> ErrorMessage:
"""Build a helpful error message when no model can handle the input."""
# Find the largest available context window
max_available = max(
(get_context_limit(m) for m in available_models),
default=0
)
# Determine what's missing
missing_models = []
if "gemini-pro" not in available_models and "flash" not in available_models:
missing_models.append("Gemini Pro/Flash (1M context)")
if token_count <= 200_000 and "opus" not in available_models:
missing_models.append("Opus (200K context)")
# Format token count for readability
if token_count >= 1_000_000:
token_display = f"{token_count / 1_000_000:.1f}M"
else:
token_display = f"{token_count // 1000}K"
return ErrorMessage(
title="Context Window Exceeded",
message=f"""Your input is approximately **{token_display} tokens**, which exceeds the context window of all currently available models.
**Required:** Gemini Pro (1M context) {"— currently unavailable" if "gemini-pro" not in available_models else ""}
**Your max available:** {max_available // 1000}K tokens
**Options:**
1. **Wait and retry** — Gemini may be temporarily down
2. **Reduce input size** — Remove unnecessary content to fit within {max_available // 1000}K tokens
3. **Split into chunks** — I can process your input sequentially in smaller pieces
Would you like me to help split this into manageable chunks?""",
recoverable=True,
suggested_action="split_chunks"
)
错误输出示例:
⚠️ Context Window Exceeded
Your input is approximately **340K tokens**, which exceeds the context
window of all currently available models.
Required: Gemini Pro (1M context) — currently unavailable
Your max available: 200K tokens
Options:
1. Wait and retry — Gemini may be temporarily down
2. Reduce input size — Remove unnecessary content to fit within 200K tokens
3. Split into chunks — I can process your input sequentially in smaller pieces
Would you like me to help split this into manageable chunks?
动态模型发现
路由器在运行时自动检测可用的服务提供商:
1. Check configured auth profiles
2. Build available model list from authenticated providers
3. Construct routing table using ONLY available models
4. If preferred model unavailable, use best available alternative
示例:如果仅配置了 Anthropic 和 Google:
- 代码任务 → Opus(Anthropic 可用 ✓)
- 实时任务 → ⚠️ 无 Grok → 回退至 Opus 并提醒用户
- 长文档 → Gemini Pro(Google 可用 ✓)
成本优化
当复杂度为低时,路由器会考虑成本:
| 模型 | 成本层级 | 使用场景 |
|---|---|---|
| Gemini Flash | $ | 简单任务,高并发量 |
| Claude Haiku | $ | 简单任务,快速响应 |
| Claude Sonnet | $$ | 中等复杂度 |
| Grok 2 | $$ | 只需实时需求 |
| GPT-5 | $$ | 通用后备方案 |
| Gemini Pro | $$$ | 长上下文需求 |
| Claude Opus | $$$$ | 复杂/关键任务 |
规则:绝不使用Opus($$$)处理Flash($)能胜任的任务。
用户控制
显示路由决策
添加[显示路由]至任意消息:
[show routing] What's the weather in NYC?
输出包含:
[Routed → xai/grok-2-latest | Reason: REALTIME intent detected | Fallback: none available]
强制指定模型
显式覆盖:
- "使用grok: ..." → 强制使用Grok
- "使用claude: ..." → 强制使用Opus
- "使用gemini: ..." → 强制使用Gemini Pro
- "使用flash: ..." → 强制使用Gemini Flash
- "use gpt: ..." → 强制使用GPT-5
检查路由器状态
询问:"router status" 或 "/router" 以查看:
- 可用提供商
- 已配置模型
- 当前路由表
- 最近的路由决策
实现说明
关于代理实现
处理请求时:
1. DETECT available models (check auth profiles)
2. CLASSIFY intent (code/analysis/creative/realtime/general)
3. ESTIMATE complexity (simple/medium/complex)
4. CHECK special cases (context size, vision, explicit override)
5. FILTER by cost tier based on complexity ← BEFORE model selection
6. SELECT model from filtered pool using routing matrix
7. VERIFY model available, else use fallback chain (also cost-filtered)
8. EXECUTE request with selected model
9. IF failure, try next in fallback chain
10. LOG routing decision (for debugging)
成本感知路由流程(关键顺序)
def route_with_fallback(request):
"""
Main routing function with CORRECT execution order.
Cost filtering MUST happen BEFORE routing table lookup.
"""
# Step 1: Discover available models
available_models = discover_providers()
# Step 2: Classify intent
intent = classify_intent(request)
# Step 3: Estimate complexity
complexity = estimate_complexity(request)
# Step 4: Check special-case overrides (these bypass cost filtering)
if user_override := get_user_model_override(request):
return execute_with_fallback(user_override, []) # No cost filter for explicit override
if token_count > 128_000:
return handle_long_context(token_count, available_models) # Special handling
if needs_realtime(request):
return execute_with_fallback("grok-2", ["grok-3"]) # Realtime bypasses cost
# ┌─────────────────────────────────────────────────────────────┐
# │ STEP 5: FILTER BY COST TIER — THIS MUST COME FIRST! │
# │ │
# │ Cost filtering happens BEFORE the routing table lookup, │
# │ NOT after. This ensures "what's 2+2?" never considers │
# │ Opus even momentarily. │
# └─────────────────────────────────────────────────────────────┘
allowed_tiers = get_allowed_tiers(complexity)
# SIMPLE → ["$"]
# MEDIUM → ["$", "$$"]
# COMPLEX → ["$", "$$", "$$$"]
cost_filtered_models = {
model: meta for model, meta in available_models.items()
if COST_TIERS.get(model) in allowed_tiers
}
# Step 6: NOW select from cost-filtered pool using routing preferences
preferences = ROUTING_PREFERENCES.get((intent, complexity), [])
for model in preferences:
if model in cost_filtered_models: # Only consider cost-appropriate models
selected_model = model
break
else:
# No preferred model in cost-filtered pool — use cheapest available
selected_model = select_cheapest(cost_filtered_models)
# Step 7: Build cost-filtered fallback chain
task_type = get_task_type(intent, complexity)
full_chain = MASTER_FALLBACK_CHAINS.get(task_type, [])
filtered_chain = [m for m in full_chain if m in cost_filtered_models and m != selected_model]
# Step 8-10: Execute with fallback + logging
return execute_with_fallback(selected_model, filtered_chain)
def get_allowed_tiers(complexity: str) -> list[str]:
"""Return allowed cost tiers for a given complexity level."""
return {
"SIMPLE": ["$"], # Budget only — no exceptions
"MEDIUM": ["$", "$$"], # Budget + standard
"COMPLEX": ["$", "$$", "$$$", "$$$$"], # All tiers — complex tasks deserve the best
}.get(complexity, ["$", "$$"])
# Example flow for "what's 2+2?":
#
# 1. available_models = {opus, sonnet, haiku, flash, grok-2, ...}
# 2. intent = GENERAL
# 3. complexity = SIMPLE
# 4. (no special cases)
# 5. allowed_tiers = ["$"] ← SIMPLE means $ only
# cost_filtered_models = {haiku, flash, grok-2} ← Opus/Sonnet EXCLUDED
# 6. preferences for (GENERAL, SIMPLE) = [flash, haiku, grok-2, sonnet]
# first match in cost_filtered = flash ✓
# 7. fallback_chain = [haiku, grok-2] ← Also cost-filtered
# 8. execute with flash
#
# Result: Opus is NEVER considered, not even momentarily.
成本优化:两种方法
┌─────────────────────────────────────────────────────────────────┐
│ COST OPTIMIZATION IMPLEMENTATION OPTIONS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ APPROACH 1: Explicit filter_by_cost() (shown above) │
│ ───────────────────────────────────────────────────────────── │
│ • Calls get_allowed_tiers(complexity) explicitly │
│ • Filters available_models BEFORE routing table lookup │
│ • Most defensive — impossible to route wrong tier │
│ • Recommended for security-critical deployments │
│ │
│ APPROACH 2: Preference ordering (implicit) │
│ ───────────────────────────────────────────────────────────── │
│ • ROUTING_PREFERENCES lists cheapest capable models first │
│ • For SIMPLE tasks: [flash, haiku, grok-2, sonnet] │
│ • First available match wins → naturally picks cheapest │
│ • Simpler code, relies on correct preference ordering │
│ │
│ This implementation uses BOTH for defense-in-depth: │
│ • Preference ordering provides first line of cost awareness │
│ • Explicit filter_by_cost() guarantees tier enforcement │
│ │
│ For alternative implementations that rely solely on │
│ preference ordering, see references/models.md for the │
│ filter_by_cost() function if explicit enforcement is needed. │
│ │
└─────────────────────────────────────────────────────────────────┘
使用不同模型生成
使用 sessions_spawn 进行模型路由:
sessions_spawn(
task: "user's request",
model: "selected/model-id",
label: "task-type-query"
)
安全性
- 切勿将敏感数据发送给不可信的模型
- API密钥仅通过环境/认证配置文件处理
- 参见
references/security.md以获取完整的安全指南
模型详情
参见参考资料/模型.md了解详细功能和定价信息。


微信扫一扫,打赏作者吧~