网淘吧来吧,欢迎您!

A.I. Smart Router

2026-03-31 新闻来源:网淘吧 围观:23
电脑广告
手机广告

A.I.智能路由

通过分级分类、自动故障切换处理和成本优化,智能地将请求路由到最优的AI模型。

工作原理(默认静默模式)

路由器透明运行——用户正常发送消息,即可获得最适合其任务的最佳模型响应。无需特殊指令。

A.I. Smart Router

可选可见性:在任何消息中包含[显示路由]即可查看路由决策。

分级分类系统

路由器采用三级决策流程:

┌─────────────────────────────────────────────────────────────────┐
│                    TIER 1: INTENT DETECTION                      │
│  Classify the primary purpose of the request                     │
├─────────────────────────────────────────────────────────────────┤
│  CODE        │ ANALYSIS    │ CREATIVE   │ REALTIME  │ GENERAL   │
│  write/debug │ research    │ writing    │ news/live │ Q&A/chat  │
│  refactor    │ explain     │ stories    │ X/Twitter │ translate │
│  review      │ compare     │ brainstorm │ prices    │ summarize │
└──────┬───────┴──────┬──────┴─────┬──────┴─────┬─────┴─────┬─────┘
       │              │            │            │           │
       ▼              ▼            ▼            ▼           ▼
┌─────────────────────────────────────────────────────────────────┐
│                  TIER 2: COMPLEXITY ESTIMATION                   │
├─────────────────────────────────────────────────────────────────┤
│  SIMPLE (Tier $)        │ MEDIUM (Tier $$)    │ COMPLEX (Tier $$$)│
│  • One-step task        │ • Multi-step task   │ • Deep reasoning  │
│  • Short response OK    │ • Some nuance       │ • Extensive output│
│  • Factual lookup       │ • Moderate context  │ • Critical task   │
│  → Haiku/Flash          │ → Sonnet/Grok/GPT   │ → Opus/GPT-5      │
└──────────────────────────┴─────────────────────┴───────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────┐
│                TIER 3: SPECIAL CASE OVERRIDES                    │
├─────────────────────────────────────────────────────────────────┤
│  CONDITION                           │ OVERRIDE TO              │
│  ─────────────────────────────────────┼─────────────────────────│
│  Context >100K tokens                │ → Gemini Pro (1M ctx)    │
│  Context >500K tokens                │ → Gemini Pro ONLY        │
│  Needs real-time data                │ → Grok (regardless)      │
│  Image/vision input                  │ → Opus or Gemini Pro     │
│  User explicit override              │ → Requested model        │
└──────────────────────────────────────┴──────────────────────────┘

意图检测模式

代码意图

  • 关键词:编写、代码、调试、修复、重构、实现、函数、类、脚本、API、错误、bug、编译、测试、PR、提交
  • 提及的文件扩展名:.py、.js、.ts、.go、.rs、.java等
  • 输入中包含的代码块

分析意图

  • 关键词:分析、解释、比较、研究、理解、为什么、如何、评估、评定、审查、调查、检查
  • 长篇幅问题
  • "帮我理解..."

创意意图

  • 关键词:写(故事/诗歌/文章)、创作、头脑风暴、想象、设计、起草、构思
  • 虚构/叙事类请求
  • 营销/文案类请求

实时意图

  • 关键词:现在、今天、当前、最新、趋势、新闻、正在发生、直播、价格、比分、天气
  • X/Twitter提及
  • 股票/加密货币代码
  • 体育比分

通用意图(默认)

  • 简单问答
  • 翻译
  • 摘要
  • 对话交流

混合意图(检测到多个意图)

当请求包含多个明确意图时(例如:"编写代码来分析这些数据并富有创意地解释它"):

  1. 识别主要意图——主要交付成果是什么?
  2. 路由至能力最强的模型— 混合任务需要多功能性
  3. 默认使用COMPLEX复杂度— 多意图 = 多步骤

例如:

  • "编写代码并解释其工作原理" → 代码(主要)+ 分析 → 路由到Opus
  • "总结这个内容并获取其最新消息" → 实时性优先 → Grok
  • "基于真实时事创作故事" → 实时性 + 创造性 → Grok(实时性优先)

语言处理

非英语请求正常处理——所有支持的模型都具备多语言能力:

模型非英语支持
Opus/Sonnet/Haiku优秀(100+种语言)
GPT-5优秀(100+种语言)
Gemini Pro/Flash优秀(100+种语言)
Grok良好(主要语言)

意图检测仍能正常工作因为:

  • 关键词模式包括常见的非英语等效词
  • 通过文件扩展名、代码块(与语言无关)检测代码意图
  • 通过查询长度估算复杂度(跨语言适用)

边界情况:如果因语言原因意图不明确,则默认使用 GENERAL 意图和 MEDIUM 复杂度。

复杂度信号

简单复杂度 ($)

  • 短查询 (<50 词)
  • 单个问号
  • "快速问题"、"直接告诉我"、"简要说明"
  • 是/否格式
  • 单位换算、定义

中等复杂度 ($$)

  • 中等长度查询 (50-200 词)
  • 涉及多个方面
  • "解释"、"描述"、"比较"
  • 提供了一些上下文

复杂复杂度 ($$$)

  • 长查询 (>200 词) 或复杂任务
  • "逐步地"、"彻底地"、"详细地"
  • 多部分问题
  • 关键/重要限定词
  • 研究、分析或创造性工作

路由矩阵

意图简单中等复杂
代码SonnetOpusOpus
分析FlashGPT-5Opus
创意SonnetOpusOpus
实时GrokGrokGrok-3
通用Flash十四行诗作品

令牌耗尽与自动模型切换

当模型在会话中途变得不可用(令牌配额耗尽、达到速率限制、API错误)时,路由器会自动切换到下一个最佳可用模型并通知用户

通知格式

当因耗尽而发生模型切换时,用户会收到通知:

┌─────────────────────────────────────────────────────────────────┐
│  ⚠️ MODEL SWITCH NOTICE                                         │
│                                                                  │
│  Your request could not be completed on claude-opus-4-5         │
│  (reason: token quota exhausted).                               │
│                                                                  │
│  ✅ Request completed using: anthropic/claude-sonnet-4-5        │
│                                                                  │
│  The response below was generated by the fallback model.        │
└─────────────────────────────────────────────────────────────────┘

切换原因

原因描述
令牌配额耗尽达到每日/每月令牌限制
超过速率限制每分钟请求过多
超出上下文窗口输入对于模型而言过大
API超时模型响应时间过长
API错误提供商返回错误
模型不可用模型暂时离线

实施

def execute_with_fallback(primary_model: str, fallback_chain: list[str], request: str) -> Response:
    """
    Execute request with automatic fallback and user notification.
    """
    attempted_models = []
    switch_reason = None
    
    # Try primary model first
    models_to_try = [primary_model] + fallback_chain
    
    for model in models_to_try:
        try:
            response = call_model(model, request)
            
            # If we switched models, prepend notification
            if attempted_models:
                notification = build_switch_notification(
                    failed_model=attempted_models[0],
                    reason=switch_reason,
                    success_model=model
                )
                return Response(
                    content=notification + "\n\n---\n\n" + response.content,
                    model_used=model,
                    switched=True
                )
            
            return Response(content=response.content, model_used=model, switched=False)
            
        except TokenQuotaExhausted:
            attempted_models.append(model)
            switch_reason = "token quota exhausted"
            log_fallback(model, switch_reason)
            continue
            
        except RateLimitExceeded:
            attempted_models.append(model)
            switch_reason = "rate limit exceeded"
            log_fallback(model, switch_reason)
            continue
            
        except ContextWindowExceeded:
            attempted_models.append(model)
            switch_reason = "context window exceeded"
            log_fallback(model, switch_reason)
            continue
            
        except APITimeout:
            attempted_models.append(model)
            switch_reason = "API timeout"
            log_fallback(model, switch_reason)
            continue
            
        except APIError as e:
            attempted_models.append(model)
            switch_reason = f"API error: {e.code}"
            log_fallback(model, switch_reason)
            continue
    
    # All models exhausted
    return build_exhaustion_error(attempted_models)


def build_switch_notification(failed_model: str, reason: str, success_model: str) -> str:
    """Build user-facing notification when model switch occurs."""
    return f"""⚠️ **MODEL SWITCH NOTICE**

Your request could not be completed on `{failed_model}` (reason: {reason}).

✅ **Request completed using:** `{success_model}`

The response below was generated by the fallback model."""


def build_exhaustion_error(attempted_models: list[str]) -> Response:
    """Build error when all models are exhausted."""
    models_tried = ", ".join(attempted_models)
    return Response(
        content=f"""❌ **REQUEST FAILED**

Unable to complete your request. All available models have been exhausted.

**Models attempted:** {models_tried}

**What you can do:**
1. **Wait** — Token quotas typically reset hourly or daily
2. **Simplify** — Try a shorter or simpler request
3. **Check status** — Run `/router status` to see model availability

If this persists, your human may need to check API quotas or add additional providers.""",
        model_used=None,
        switched=False,
        failed=True
    )

令牌耗尽时的备用优先级

当某个模型耗尽时,路由器会为相同任务类型选择下一个最佳模型

原始模型
备用优先级(同等能力)Opus
Sonnet → GPT-5 → Grok-3 → Gemini ProSonnet
GPT-5 → Grok-3 → Opus → HaikuGPT-5
Sonnet → Opus → Grok-3 → Gemini ProGemini Pro
Flash → GPT-5 → Opus → SonnetGrok-2/3

(警告:无实时备用方案可用)

用户确认

  1. 模型切换后,代理应在响应中注明:
  2. 原始模型不可用
  3. 响应质量可能与原始模型的典型输出存在差异

这确保了透明度并设定了适当的预期

流式响应与回退机制

使用流式响应时,回退处理需要特别考虑:

async def execute_with_streaming_fallback(primary_model: str, fallback_chain: list[str], request: str):
    """
    Handle streaming responses with mid-stream fallback.
    
    If a model fails DURING streaming (not before), the partial response is lost.
    Strategy: Don't start streaming until first chunk received successfully.
    """
    models_to_try = [primary_model] + fallback_chain
    
    for model in models_to_try:
        try:
            # Test with non-streaming ping first (optional, adds latency)
            # await test_model_availability(model)
            
            # Start streaming
            stream = await call_model_streaming(model, request)
            first_chunk = await stream.get_first_chunk(timeout=10_000)  # 10s timeout for first chunk
            
            # If we got here, model is responding — continue streaming
            yield first_chunk
            async for chunk in stream:
                yield chunk
            return  # Success
            
        except (FirstChunkTimeout, StreamError) as e:
            log_fallback(model, str(e))
            continue  # Try next model
    
    # All models failed
    yield build_exhaustion_error(models_to_try)

关键洞察:在确认使用某个模型前,等待其首个数据块。如果首个数据块超时,则应在向用户显示任何部分响应前进行回退

重试时间配置

RETRY_CONFIG = {
    "initial_timeout_ms": 30_000,     # 30s for first attempt
    "fallback_timeout_ms": 20_000,    # 20s for fallback attempts (faster fail)
    "max_retries_per_model": 1,       # Don't retry same model
    "backoff_multiplier": 1.5,        # Not used (no same-model retry)
    "circuit_breaker_threshold": 3,   # Failures before skipping model entirely
    "circuit_breaker_reset_ms": 300_000  # 5 min before trying failed model again
}

熔断机制:如果一个模型在5分钟内失败3次,则在接下来的5分钟内完全跳过该模型。这可以避免反复访问已宕机的服务

回退链

当首选模型失败时(如达到速率限制、API宕机、错误等),级联至下一个选项:

代码任务

Opus → Sonnet → GPT-5 → Gemini Pro

分析任务

Opus → GPT-5 → Gemini Pro → Sonnet

创意任务

Opus → GPT-5 → Sonnet → Gemini Pro

实时任务

Grok-2 → Grok-3 → (warn: no real-time fallback)

通用任务

Flash → Haiku → Sonnet → GPT-5

长文本处理(按规模分级)

┌─────────────────────────────────────────────────────────────────┐
│                  LONG CONTEXT FALLBACK CHAIN                     │
├─────────────────────────────────────────────────────────────────┤
│  TOKEN COUNT        │ FALLBACK CHAIN                            │
│  ───────────────────┼───────────────────────────────────────────│
│  128K - 200K        │ Opus (200K) → Sonnet (200K) → Gemini Pro  │
│  200K - 1M          │ Gemini Pro → Flash (1M) → ERROR_MESSAGE   │
│  > 1M               │ ERROR_MESSAGE (no model supports this)    │
└─────────────────────┴───────────────────────────────────────────┘

实现方式:

def handle_long_context(token_count: int, available_models: dict) -> str | ErrorMessage:
    """Route long-context requests with graceful degradation."""
    
    # Tier 1: 128K - 200K tokens (Opus/Sonnet can handle)
    if token_count <= 200_000:
        for model in ["opus", "sonnet", "haiku", "gemini-pro", "flash"]:
            if model in available_models and get_context_limit(model) >= token_count:
                return model
    
    # Tier 2: 200K - 1M tokens (only Gemini)
    elif token_count <= 1_000_000:
        for model in ["gemini-pro", "flash"]:
            if model in available_models:
                return model
    
    # Tier 3: > 1M tokens (nothing available)
    # Fall through to error
    
    # No suitable model found — return helpful error
    return build_context_error(token_count, available_models)


def build_context_error(token_count: int, available_models: dict) -> ErrorMessage:
    """Build a helpful error message when no model can handle the input."""
    
    # Find the largest available context window
    max_available = max(
        (get_context_limit(m) for m in available_models),
        default=0
    )
    
    # Determine what's missing
    missing_models = []
    if "gemini-pro" not in available_models and "flash" not in available_models:
        missing_models.append("Gemini Pro/Flash (1M context)")
    if token_count <= 200_000 and "opus" not in available_models:
        missing_models.append("Opus (200K context)")
    
    # Format token count for readability
    if token_count >= 1_000_000:
        token_display = f"{token_count / 1_000_000:.1f}M"
    else:
        token_display = f"{token_count // 1000}K"
    
    return ErrorMessage(
        title="Context Window Exceeded",
        message=f"""Your input is approximately **{token_display} tokens**, which exceeds the context window of all currently available models.

**Required:** Gemini Pro (1M context) {"— currently unavailable" if "gemini-pro" not in available_models else ""}
**Your max available:** {max_available // 1000}K tokens

**Options:**
1. **Wait and retry** — Gemini may be temporarily down
2. **Reduce input size** — Remove unnecessary content to fit within {max_available // 1000}K tokens
3. **Split into chunks** — I can process your input sequentially in smaller pieces

Would you like me to help split this into manageable chunks?""",
        
        recoverable=True,
        suggested_action="split_chunks"
    )

错误输出示例:

⚠️ Context Window Exceeded

Your input is approximately **340K tokens**, which exceeds the context 
window of all currently available models.

Required: Gemini Pro (1M context) — currently unavailable
Your max available: 200K tokens

Options:
1. Wait and retry — Gemini may be temporarily down
2. Reduce input size — Remove unnecessary content to fit within 200K tokens
3. Split into chunks — I can process your input sequentially in smaller pieces

Would you like me to help split this into manageable chunks?

动态模型发现

路由器在运行时自动检测可用的服务提供商:

1. Check configured auth profiles
2. Build available model list from authenticated providers
3. Construct routing table using ONLY available models
4. If preferred model unavailable, use best available alternative

示例:如果仅配置了 Anthropic 和 Google:

  • 代码任务 → Opus(Anthropic 可用 ✓)
  • 实时任务 → ⚠️ 无 Grok → 回退至 Opus 并提醒用户
  • 长文档 → Gemini Pro(Google 可用 ✓)

成本优化

当复杂度为低时,路由器会考虑成本:

模型成本层级使用场景
Gemini Flash$简单任务,高并发量
Claude Haiku$简单任务,快速响应
Claude Sonnet$$中等复杂度
Grok 2$$只需实时需求
GPT-5$$通用后备方案
Gemini Pro$$$长上下文需求
Claude Opus$$$$复杂/关键任务

规则:绝不使用Opus($$$)处理Flash($)能胜任的任务。

用户控制

显示路由决策

添加[显示路由]至任意消息:

[show routing] What's the weather in NYC?

输出包含:

[Routed → xai/grok-2-latest | Reason: REALTIME intent detected | Fallback: none available]

强制指定模型

显式覆盖:

  • "使用grok: ..." → 强制使用Grok
  • "使用claude: ..." → 强制使用Opus
  • "使用gemini: ..." → 强制使用Gemini Pro
  • "使用flash: ..." → 强制使用Gemini Flash
  • "use gpt: ..." → 强制使用GPT-5

检查路由器状态

询问:"router status" 或 "/router" 以查看:

  • 可用提供商
  • 已配置模型
  • 当前路由表
  • 最近的路由决策

实现说明

关于代理实现

处理请求时:

1. DETECT available models (check auth profiles)
2. CLASSIFY intent (code/analysis/creative/realtime/general)
3. ESTIMATE complexity (simple/medium/complex)
4. CHECK special cases (context size, vision, explicit override)
5. FILTER by cost tier based on complexity ← BEFORE model selection
6. SELECT model from filtered pool using routing matrix
7. VERIFY model available, else use fallback chain (also cost-filtered)
8. EXECUTE request with selected model
9. IF failure, try next in fallback chain
10. LOG routing decision (for debugging)

成本感知路由流程(关键顺序)

def route_with_fallback(request):
    """
    Main routing function with CORRECT execution order.
    Cost filtering MUST happen BEFORE routing table lookup.
    """
    
    # Step 1: Discover available models
    available_models = discover_providers()
    
    # Step 2: Classify intent
    intent = classify_intent(request)
    
    # Step 3: Estimate complexity
    complexity = estimate_complexity(request)
    
    # Step 4: Check special-case overrides (these bypass cost filtering)
    if user_override := get_user_model_override(request):
        return execute_with_fallback(user_override, [])  # No cost filter for explicit override
    
    if token_count > 128_000:
        return handle_long_context(token_count, available_models)  # Special handling
    
    if needs_realtime(request):
        return execute_with_fallback("grok-2", ["grok-3"])  # Realtime bypasses cost
    
    # ┌─────────────────────────────────────────────────────────────┐
    # │  STEP 5: FILTER BY COST TIER — THIS MUST COME FIRST!       │
    # │                                                             │
    # │  Cost filtering happens BEFORE the routing table lookup,   │
    # │  NOT after. This ensures "what's 2+2?" never considers     │
    # │  Opus even momentarily.                                    │
    # └─────────────────────────────────────────────────────────────┘
    
    allowed_tiers = get_allowed_tiers(complexity)
    # SIMPLE  → ["$"]
    # MEDIUM  → ["$", "$$"]
    # COMPLEX → ["$", "$$", "$$$"]
    
    cost_filtered_models = {
        model: meta for model, meta in available_models.items()
        if COST_TIERS.get(model) in allowed_tiers
    }
    
    # Step 6: NOW select from cost-filtered pool using routing preferences
    preferences = ROUTING_PREFERENCES.get((intent, complexity), [])
    
    for model in preferences:
        if model in cost_filtered_models:  # Only consider cost-appropriate models
            selected_model = model
            break
    else:
        # No preferred model in cost-filtered pool — use cheapest available
        selected_model = select_cheapest(cost_filtered_models)
    
    # Step 7: Build cost-filtered fallback chain
    task_type = get_task_type(intent, complexity)
    full_chain = MASTER_FALLBACK_CHAINS.get(task_type, [])
    filtered_chain = [m for m in full_chain if m in cost_filtered_models and m != selected_model]
    
    # Step 8-10: Execute with fallback + logging
    return execute_with_fallback(selected_model, filtered_chain)


def get_allowed_tiers(complexity: str) -> list[str]:
    """Return allowed cost tiers for a given complexity level."""
    return {
        "SIMPLE":  ["$"],                      # Budget only — no exceptions
        "MEDIUM":  ["$", "$$"],                # Budget + standard
        "COMPLEX": ["$", "$$", "$$$", "$$$$"], # All tiers — complex tasks deserve the best
    }.get(complexity, ["$", "$$"])


# Example flow for "what's 2+2?":
#
# 1. available_models = {opus, sonnet, haiku, flash, grok-2, ...}
# 2. intent = GENERAL
# 3. complexity = SIMPLE
# 4. (no special cases)
# 5. allowed_tiers = ["$"]  ← SIMPLE means $ only
#    cost_filtered_models = {haiku, flash, grok-2}  ← Opus/Sonnet EXCLUDED
# 6. preferences for (GENERAL, SIMPLE) = [flash, haiku, grok-2, sonnet]
#    first match in cost_filtered = flash ✓
# 7. fallback_chain = [haiku, grok-2]  ← Also cost-filtered
# 8. execute with flash
#
# Result: Opus is NEVER considered, not even momentarily.

成本优化:两种方法

┌─────────────────────────────────────────────────────────────────┐
│           COST OPTIMIZATION IMPLEMENTATION OPTIONS               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  APPROACH 1: Explicit filter_by_cost() (shown above)            │
│  ─────────────────────────────────────────────────────────────  │
│  • Calls get_allowed_tiers(complexity) explicitly               │
│  • Filters available_models BEFORE routing table lookup         │
│  • Most defensive — impossible to route wrong tier              │
│  • Recommended for security-critical deployments                │
│                                                                  │
│  APPROACH 2: Preference ordering (implicit)                     │
│  ─────────────────────────────────────────────────────────────  │
│  • ROUTING_PREFERENCES lists cheapest capable models first      │
│  • For SIMPLE tasks: [flash, haiku, grok-2, sonnet]            │
│  • First available match wins → naturally picks cheapest        │
│  • Simpler code, relies on correct preference ordering          │
│                                                                  │
│  This implementation uses BOTH for defense-in-depth:            │
│  • Preference ordering provides first line of cost awareness    │
│  • Explicit filter_by_cost() guarantees tier enforcement        │
│                                                                  │
│  For alternative implementations that rely solely on            │
│  preference ordering, see references/models.md for the          │
│  filter_by_cost() function if explicit enforcement is needed.   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

使用不同模型生成

使用 sessions_spawn 进行模型路由:

sessions_spawn(
  task: "user's request",
  model: "selected/model-id",
  label: "task-type-query"
)

安全性

  • 切勿将敏感数据发送给不可信的模型
  • API密钥仅通过环境/认证配置文件处理
  • 参见references/security.md以获取完整的安全指南

模型详情

参见参考资料/模型.md了解详细功能和定价信息。

免责申明
部分文章来自各大搜索引擎,如有侵权,请与我联系删除。
打赏
文章底部电脑广告
手机广告位-内容正文底部
上一篇:Finance Tracker 下一篇:Publisher

相关文章

您是本站第349246名访客 今日有175篇新文章/评论