Implementation Plan: Multi-Model Hybrid Configuration
Branch: 003-multi-model-hybrid | Date: 2026-01-09 | Spec: spec.md
Input: Feature specification from /specs/003-multi-model-hybrid/spec.md
Summary
Implement multi-provider LLM support (Gemini, OpenAI, Anthropic) with role-based optimal model assignment to achieve 50% cost reduction ($1.00 → $0.50), 70% speed improvement (10 min → 2-3 min), and 100% elimination of fuzzy replacement errors through full-rewrite architecture.
Technical Approach:
- Create
BaseLLMClientabstraction layer with provider-specific implementations - Configure agent-to-model mapping: Recruiter→Gemini Flash, Technical Writer→o3-mini, Copywriter→Claude Sonnet, Designers/Revisor→Gemini Flash
- Replace fuzzy string matching in RevisionService with full-file rewrite using Gemini’s long-context capabilities
- Add CLI options for 3 API keys with
--modeloverride for testing - Log token usage per agent for cost validation
Technical Context
Language/Version: Python 3.13 (existing, LangGraph compatible) Primary Dependencies:
- Existing:
anthropic>=0.25.0,langgraph>=1.0.0,click>=8.1.0 - New:
google-genai[aiohttp]>=1.0.0,openai>=1.0.0
Storage: In-memory state (LangGraph StateGraph), filesystem for resume files
Testing: pytest with async support (pytest-asyncio)
Target Platform: macOS/Linux CLI tool (existing)
Project Type: Single Python package (monorepo package packages/resume-review)
Performance Goals:
- Review execution time ≤ 3 minutes (SC-001)
- API cost ≤ $0.55 per review (SC-002)
Constraints:
- Must maintain backward compatibility with existing CLI
- Must preserve YAML frontmatter exactly (SC-006)
- Zero tolerance for revision errors (SC-003)
Scale/Scope:
- 6 agents (Recruiter, Technical Writer, Copywriter, 2 Designers, Revisor)
- 3 LLM providers (Gemini, OpenAI, Anthropic)
- ~500 new lines of code, ~100 lines modified
Constitution Check
GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.
I. Single Source of Truth
Status: ✅ PASS - Not applicable (no resume content changes)
Reasoning: Feature modifies evaluation/revision system, not resume source files. resume/resume-ja.qmd remains sole source.
II. Automated Generation
Status: ✅ PASS - Not applicable (no build pipeline changes)
Reasoning: Feature enhances AI review process but doesn’t change Quarto build pipeline (QMD→PDF/HTML/MDX).
III. Preview-First Workflow
Status: ✅ PASS - Compatible with existing workflow
Reasoning: Feature improves quality of generated revisions, making preview step even more valuable. No changes to preview commands.
IV. Deployment Simplicity
Status: ✅ PASS - No deployment changes
Reasoning: Feature is CLI tool enhancement, not web deployment change. Vercel auto-deploy unaffected.
V. Toolchain Consistency
Status: ✅ PASS - Adds new optional toolchain components
Reasoning:
- Adds optional API keys for Gemini/OpenAI (users can still use Claude-only mode)
- No changes to Quarto/LaTeX/font requirements
- New Python dependencies documented in
pyproject.toml
Documentation Updates Required:
- README: Add Gemini/OpenAI API key setup instructions (optional)
- README: Document
--modeloverride flag for testing
Constitution Compliance Summary
| Principle | Status | Notes |
|---|---|---|
| I. Single Source of Truth | ✅ PASS | No resume source changes |
| II. Automated Generation | ✅ PASS | No build pipeline changes |
| III. Preview-First | ✅ PASS | Compatible with existing workflow |
| IV. Deployment Simplicity | ✅ PASS | CLI-only changes |
| V. Toolchain Consistency | ✅ PASS | Adds optional dependencies |
Gate Result: ✅ PROCEED - No violations, all principles satisfied
Project Structure
Documentation (this feature)
specs/003-multi-model-hybrid/
├── spec.md # Feature specification (P0, P1, P2 user stories)
├── plan.md # This file (implementation plan)
├── research.md # Phase 0: API research, model names, best practices
├── data-model.md # Phase 1: LLMResponse, ModelConfig, TokenUsage entities
├── quickstart.md # Phase 1: Developer onboarding guide
├── contracts/ # Phase 1: Interface contracts
│ └── llm-client-interface.md # BaseLLMClient contract with 3 implementations
└── tasks.md # Phase 2: NOT created by /speckit.plan (see /speckit.tasks)
Source Code (repository root)
Existing Structure (no changes):
packages/resume-review/ # Python package root
├── src/
│ ├── agents/ # Existing agents (MODIFIED)
│ │ ├── base.py # BaseAgent (MODIFIED: accept llm_client param)
│ │ ├── recruiter.py # RecruiterAgent (minimal changes)
│ │ ├── technical_writer.py # TechnicalWriterAgent (minimal changes)
│ │ ├── copywriter.py # CopywriterAgent (minimal changes)
│ │ ├── ux_designer.py # UXDesignerAgent (minimal changes)
│ │ └── visual_designer.py # VisualDesignerAgent (minimal changes)
│ │
│ ├── models/ # Existing data models (unchanged)
│ │ ├── feedback.py
│ │ ├── portfolio.py
│ │ └── session.py
│ │
│ ├── workflow/ # Existing workflow (MODIFIED)
│ │ ├── state.py # ReviewState (MODIFIED: add API key fields)
│ │ ├── nodes/
│ │ │ ├── supervisor.py # MODIFIED: construct clients for agents
│ │ │ ├── revisor.py # MODIFIED: full rewrite logic
│ │ │ ├── aggregator.py # (unchanged)
│ │ │ └── portfolio.py # (unchanged)
│ │ ├── graph.py # (unchanged)
│ │ ├── runner.py # ReviewWorkflow (MODIFIED: pass API keys to state)
│ │ └── conditions.py # (unchanged)
│ │
│ ├── services/ # NEW FILES HERE
│ │ ├── llm_client.py # NEW: BaseLLMClient + 3 implementations
│ │ ├── llm_factory.py # NEW: LLMClientFactory
│ │ ├── qmd_parser.py # (existing, unchanged)
│ │ ├── quarto_validator.py # (existing, unchanged)
│ │ └── screenshot.py # (existing, unchanged)
│ │
│ ├── config/ # Existing config (MODIFIED)
│ │ ├── model_config.py # NEW: Agent-to-model mappings
│ │ ├── settings.py # (existing, possibly extended)
│ │ ├── prompts.py # (existing, possibly extended for Revisor)
│ │ └── weights.py # (existing, unchanged)
│ │
│ ├── utils/ # (existing, unchanged)
│ └── cli.py # MODIFIED: add --gemini/openai/anthropic-api-key options
│
├── tests/
│ ├── unit/
│ │ ├── services/
│ │ │ ├── test_llm_client.py # NEW: Test 3 client implementations
│ │ │ └── test_llm_factory.py # NEW: Test model selection logic
│ │ ├── agents/
│ │ │ └── test_base.py # MODIFIED: Test with injected clients
│ │ └── workflow/
│ │ └── test_state.py # MODIFIED: Test new state fields
│ │
│ └── integration/
│ └── test_multi_model.py # NEW: End-to-end with real APIs
│
├── pyproject.toml # MODIFIED: add google-genai, openai deps
└── README.md # MODIFIED: document new API keys
Structure Decision: Monorepo package structure (existing). New code fits cleanly into existing services/ and config/ directories. Agents modified via dependency injection (no structural changes).
Complexity Tracking
Not applicable - No constitution violations to justify.
Implementation Phases
Phase 0: Research ✅ COMPLETE
Artifact: research.md (generated)
Key Findings:
- Gemini uses
google-genaiSDK withgemini-3.0-flashmodel (stable, recommended) - OpenAI uses
AsyncOpenAIwitho3-minimodel - Anthropic already integrated, no changes needed
- Full rewrite eliminates 30-40% fuzzy matching error rate
Phase 1: Design & Contracts ✅ COMPLETE
Artifacts:
data-model.md: Entities (LLMResponse, ModelConfiguration, TokenUsage, RevisionInstruction)contracts/llm-client-interface.md: BaseLLMClient interface with 3 implementationsquickstart.md: Developer onboarding guide
Key Designs:
BaseLLMClientabstract base class withgenerate_async()methodLLMClientFactoryfor model selection (hybrid vs override mode)AGENT_MODEL_MAPfor agent-to-model assignments- Token usage tracking via
ReviewState.token_usagedict
Phase 2: Tasks (Next Step)
Command: /speckit.tasks
Purpose: Generate detailed task breakdown for implementation (NOT done by /speckit.plan)
File-Level Change Summary
New Files (Create)
| File | Lines | Purpose |
|---|---|---|
src/services/llm_client.py | ~400 | BaseLLMClient + Gemini/OpenAI/Anthropic implementations |
src/services/llm_factory.py | ~150 | LLMClientFactory with model selection logic |
src/config/model_config.py | ~50 | AGENT_MODEL_MAP, MODEL_PRICING constants |
tests/unit/services/test_llm_client.py | ~200 | Unit tests for 3 clients |
tests/unit/services/test_llm_factory.py | ~100 | Unit tests for factory |
tests/integration/test_multi_model.py | ~150 | Integration tests with real APIs |
Total New: ~1050 lines
Modified Files (Edit)
| File | Changes | Impact |
|---|---|---|
src/agents/base.py | ~20 lines | Change __init__ to accept llm_client, update evaluate_async |
src/workflow/state.py | ~10 lines | Add gemini_api_key, openai_api_key, anthropic_api_key, token_usage fields |
src/workflow/nodes/supervisor.py | ~30 lines | Construct clients via LLMClientFactory, pass to agents |
src/workflow/nodes/revisor.py | ~50 lines | Implement full rewrite logic, YAML validation |
src/workflow/runner.py | ~15 lines | Pass 3 API keys to initial state |
src/cli.py | ~40 lines | Add 3 API key options, --model override, verbose output |
pyproject.toml | ~5 lines | Add google-genai[aiohttp], openai dependencies |
README.md | ~20 lines | Document new API keys, hybrid mode, override mode |
Total Modified: ~190 lines
Deleted Files (Optional Cleanup)
| File | Reason |
|---|---|
src/services/revision.py | Replaced by full rewrite logic in RevisorAgent |
Note: Can be kept initially and marked deprecated, removed in future cleanup.
Critical Implementation Decisions
Decision 1: Dependency Injection vs Direct Construction
Chosen: Dependency injection (pass llm_client to agent constructor)
Rationale:
- Enables testing with
MockLLMClient - Agents don’t need provider-specific knowledge
- Factory pattern centralizes model selection logic
Alternative Rejected: Agents construct clients directly
- Would require agents to know about providers
- Harder to test (need to mock SDK clients)
- Duplicates model selection logic across 6 agents
Decision 2: Full Rewrite vs Improved Fuzzy Matching
Chosen: Full rewrite (Revisor generates complete file)
Rationale:
- Eliminates 30-40% error rate completely (SC-003)
- Simpler implementation (remove RevisionService complexity)
- Gemini 2.0 Flash handles 8000 token output efficiently (~$0.024 per rewrite)
Alternative Rejected: Improve fuzzy matching with AST parsing
- Still error-prone (markdown structure variability)
- More complex code to maintain
- Doesn’t address root cause (string matching fragility)
Decision 3: Static Config vs Dynamic Config
Chosen: Static AGENT_MODEL_MAP with environment variable override
Rationale:
- Simple, no external files or databases
- Easy to test (hardcoded mappings)
--modelflag provides flexibility when needed
Alternative Rejected: YAML/JSON config file
- Adds file I/O complexity
- Users unlikely to change mappings (optimized based on research)
- Harder to version control (separate config file)
Decision 4: LiteLLM vs Custom Abstraction
Chosen: Custom BaseLLMClient abstraction
Rationale:
- Only 3 providers (LiteLLM overkill for 100+ providers)
- Full control over error handling and token tracking
- No external service dependencies
- ~400 lines of code (manageable)
Alternative Rejected: LiteLLM router
- Heavy dependency for simple use case
- Adds complexity (router, load balancing, fallback logic)
- Overkill for sequential agent execution (no load balancing needed)
Risk Mitigation
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Gemini model names change | Medium | High | Use versioned IDs (gemini-3.0-flash), add alias support |
| Provider rate limits hit | Medium | Medium | Exponential backoff (tenacity), clear error messages |
| Full rewrite truncates content | Low | High | Validate output length, max_tokens=8000, explicit “no truncation” prompt |
| YAML frontmatter corruption | Low | High | Post-generation validation, use original frontmatter if mismatch |
| Cost exceeds projections | Medium | Low | Log every API call cost, warn if exceeds $0.70 threshold |
| Japanese quality degrades | Medium | Medium | Keep Copywriter on Claude (Japanese specialist), test with real resumes |
| API key management confusion | High | Low | Clear error messages, README documentation, backward compatibility |
Testing Strategy
Unit Tests (No API Calls)
Coverage:
test_llm_client.py: TestMockLLMClient, interface compliancetest_llm_factory.py: Test model selection (hybrid vs override)test_base.py: Test agent with injected client
Run:
pytest tests/unit -v
Integration Tests (Real API Calls)
Coverage:
test_multi_model.py: Full review with all 3 providers- Validate SC-001 through SC-009
Run:
pytest tests/integration -v -m integration
Note: Marked with @pytest.mark.integration, skipped by default
End-to-End Validation
Test Resume: Create test file with known issues:
- Anachronistic technology (Next.js 14 in 2015)
- Incompatible stack (Django + Flask)
- Incomplete sentences (“開発を担”)
- Duplicate projects
Validation:
- Run review:
resume-review review --input test_resume.qmd --save-iterations - Verify SC-001: Time < 3 minutes
- Verify SC-002: Cost < $0.55 (check logs)
- Verify SC-003: No “Fuzzy replacement failed” errors
- Verify SC-004: Technical Writer detects anachronistic tech
- Verify SC-005: Copywriter fixes incomplete sentences
- Verify SC-006: YAML frontmatter unchanged
Success Criteria Validation
| ID | Criteria | Measurement | Target | Test Method |
|---|---|---|---|---|
| SC-001 | Execution time | time command | ≤ 3 min | End-to-end test |
| SC-002 | API cost | Sum of token_usage logs | ≤ $0.55 | Cost calculation script |
| SC-003 | Revision success | Count errors in 20 runs | 0 errors | Batch test script |
| SC-004 | Technical detection | Inject test inconsistencies | ≥ 90% | Test resume with issues |
| SC-005 | Copywriter corrections | Inject incomplete sentences | 100% | Test resume with issues |
| SC-006 | YAML preservation | Compare pre/post frontmatter | 100% | YAML diff validation |
| SC-007 | Hybrid execution | Check logs for different models | All agents | Verbose output check |
| SC-008 | Override functionality | Run with --model flag | All use override | Override test |
| SC-009 | Verbose reporting | Run with --verbose | Model assignments visible | Output inspection |
Rollout Plan
Step 1: Feature Flag (Optional)
Add --enable-hybrid-models flag to CLI, default False initially.
Rationale: Allows users to opt-in while feature stabilizes.
Step 2: Phased Rollout
- Week 1: Test with personal resume (author)
- Week 2: Share with early adopters (document findings)
- Week 3: Make default (remove feature flag)
Step 3: Monitoring
- Log token usage for first 100 reviews
- Track cost distribution across providers
- Monitor for unexpected errors (rate limits, API failures)
Step 4: Documentation Updates
- README: Add “Multi-Model Configuration” section
- Add troubleshooting guide for API key issues
- Document cost optimization benefits
Dependencies & Prerequisites
Python Packages
Add to pyproject.toml:
[project.dependencies]
google-genai = {version = ">=1.0.0", extras = ["aiohttp"]}
openai = ">=1.0.0"
anthropic = ">=0.25.0" # Already present
API Keys
Required for Hybrid Mode:
GEMINI_API_KEYorGOOGLE_API_KEYOPENAI_API_KEYANTHROPIC_API_KEY
Required for Override Mode:
- API key for the provider of
--model
Development Environment
- Python 3.13
- pytest with
pytest-asyncio - Access to Gemini, OpenAI, Anthropic APIs (free tier OK for testing)
Next Steps
- ✅ Phase 0 Complete: Research documented in
research.md - ✅ Phase 1 Complete: Design artifacts created (
data-model.md,contracts/,quickstart.md) - Phase 2 Next: Run
/speckit.tasksto generate detailed task breakdown - Implementation: Follow task order from
tasks.md(after/speckit.tasks) - Validation: Run success criteria tests (SC-001 through SC-009)
- Review: Get user feedback on cost/speed improvements
- Merge: Integrate into main branch after validation
Implementation Timeline
| Phase | Duration | Deliverable |
|---|---|---|
| Phase 0: Research | ✅ Complete | research.md |
| Phase 1: Design | ✅ Complete | data-model.md, contracts/, quickstart.md |
| Phase 2: Tasks | 1 hour | tasks.md (run /speckit.tasks) |
| Phase 3: LLM Clients | 2 days | llm_client.py, llm_factory.py, model_config.py |
| Phase 4: Agent Integration | 1 day | Modified agents + workflow nodes |
| Phase 5: Full Rewrite | 1 day | New revisor logic |
| Phase 6: CLI & Logging | 1 day | CLI options, token logging |
| Phase 7: Testing | 2 days | Unit + integration tests |
| Phase 8: Validation | 1 day | SC-001 through SC-009 verification |
Total: ~9 working days (1.8 weeks)
Maintenance & Future Enhancements
Maintenance Tasks
- Monitor model deprecations (Gemini, OpenAI, Anthropic)
- Update pricing constants if API costs change
- Test compatibility with new SDK versions
Future Enhancements (Out of Scope for MVP)
- Cost tracking dashboard: Visualize token usage over time
- Automatic fallback: Retry with alternative provider if primary fails
- Model benchmarking: A/B test different model combinations
- Streaming responses: Use
generate_stream_async()for real-time feedback - Custom prompts per provider: Fine-tune prompts for each model’s strengths
Summary
This plan provides a complete blueprint for implementing multi-model hybrid configuration:
- Architecture: Dependency-injected
BaseLLMClientwith 3 provider implementations - Configuration: Static agent-to-model mapping with CLI override
- Reliability: Full rewrite eliminates fuzzy matching errors
- Cost: 50% reduction through optimal model selection
- Speed: 70% improvement through faster models
- Testing: Comprehensive unit, integration, and end-to-end tests
- Validation: 9 measurable success criteria
Ready for: Phase 2 task generation (/speckit.tasks)
Artifacts Generated
✅ specs/003-multi-model-hybrid/spec.md (user scenarios, requirements, success criteria)
✅ specs/003-multi-model-hybrid/research.md (API research, model selection, best practices)
✅ specs/003-multi-model-hybrid/data-model.md (entities, relationships, validation rules)
✅ specs/003-multi-model-hybrid/contracts/llm-client-interface.md (interface contract)
✅ specs/003-multi-model-hybrid/quickstart.md (developer onboarding guide)
✅ specs/003-multi-model-hybrid/plan.md (this file - implementation plan)
Next: Run /speckit.tasks to generate tasks.md with detailed implementation steps.