Last updated: April 2026. This document reflects the post-refactor architecture. The original design document is preserved in
docs/EXPERIMENTAL_DETAILS.md.
AgentQuant is a regime-adaptive quantitative research platform. The v2.0 refactor transforms it from a stateless parameter-tuning script into a defensible research tool with:
- A real ReAct agent loop (analyze → hypothesize → backtest → reflect → store)
- Scientific comparability: LLM proposals are constrained to a canonical
ParameterGrid, making LLM vs. grid-search comparisons valid - Statistical regime detection: VIX percentile (relative) instead of hardcoded absolute thresholds
- Look-ahead bias guards enforced at the backtest engine level
- Cross-session memory via SQLite so the agent can learn across runs
- 42 unit tests and a GitHub Actions CI gate
The agent runs a bounded ReAct loop with 5 explicit typed nodes:
analyze ──► hypothesize ──► backtest ──► reflect
▲ │
└────────── (retry ≤ N) ◄──────┘
│
store
class AgentState(TypedDict, total=False):
ohlcv_data: Dict[str, pd.DataFrame]
features_df: pd.DataFrame
context: Optional[RegimeContext]
proposals: List[Proposal]
results: List[Dict]
best_result: Optional[Dict]
iteration: int
max_iterations: int
strategy_type: str
asset: str
should_continue: bool
memory_context: str
run_log: List[str]Nodes:
| Node | Purpose |
|---|---|
analyze |
Build RegimeContext from features + regime detector |
hypothesize |
Generate proposals via ProposalGenerator |
backtest |
Run tournament: all proposals → rank by Sharpe |
reflect |
Accept if Sharpe ≥ min_acceptable_sharpe, else retry |
store |
Persist best result to StrategyMemory SQLite |
The loop retries at most config.agent.max_iterations times (default 3). On max iteration, accepts the best result found regardless of threshold.
class BasePlanner(ABC):
def generate_proposals(self, prompt: str, n: int) -> List[Dict]: ...
def is_available(self) -> bool: ...Concrete implementations:
GeminiPlanner—google-generativeaiSDKLangChainPlanner—langchain-google-genaiOpenAIPlanner—openaiFallbackPlanner— returns[], triggering grid search
create_planner() factory checks credential availability in order and returns the first viable planner.
Single entrypoint replacing 4 legacy planner files:
Fallback chain:
- LLM (via
BasePlanner) → validates JSON response - Grid search with regime-aware prior (shorter windows in Crisis/Bear, longer in LowVol/Bull)
- Random sample from canonical grid
All proposals are constrained to the ParameterGrid — the LLM selects from the grid, not generates free-form integers. This makes A/B comparison against grid-search baselines statistically valid.
@dataclass
class Proposal:
params: Dict[str, Any]
confidence: float # LLM self-assessed
regime_characteristic_used: str
reasoning: str
generation_method: str # "llm" | "grid_search" | "random"SQLite-backed memory providing cross-session learning:
CREATE TABLE strategy_runs (
run_id TEXT PRIMARY KEY,
timestamp TEXT,
regime TEXT,
strategy_type TEXT,
params TEXT, -- JSON
sharpe REAL,
total_return REAL,
max_drawdown REAL,
confidence REAL,
generation_method TEXT,
reasoning TEXT
)StrategyMemory.to_prompt_context(regime) injects past results into the next LLM prompt so the agent can avoid repeating failures.
Old approach (removed):
if vix > 30: return "High Volatility"
if vix > 20: return "Medium Volatility"This breaks when VIX structurally shifts (e.g., post-2020 baseline is higher).
New approach:
vix_percentile = percentileofscore(vix_history_252d, current_vix)
# Crisis: >85th | HighVol: >65th | MidVol: >35th | LowVol: <35th{VolRegime}-{TrendRegime}: e.g., LowVol-Bull, Crisis-Bear, HighVol-Neutral
Vol regimes: based on VIX percentile vs. trailing 252d
Trend regimes: based on 63-day momentum: Bull (>5%), Bear (<-5%), Neutral
vol_confidence = 2 * abs(vix_percentile / 100 - 0.5) # distance from 50th pct
mom_confidence = min(abs(momentum_63d) / 0.10, 1.0) # normalized abs momentum
regime_confidence = (vol_confidence + mom_confidence) / 2If hmmlearn is installed, _try_hmm_regime() fits a 3-state Gaussian HMM on (returns, realized_vol) features and labels states by volatility rank. Used informatively; rule-based label takes precedence.
Features computed (all single-level string columns after flattening):
| Feature | Details |
|---|---|
volatility_5d/21d/63d |
Realized vol, annualized (√252 scale) |
momentum_21d/63d/252d |
close.pct_change(N) |
sma_21/50/63/200 |
Simple moving averages |
price_vs_sma63/200 |
(close/sma) - 1 |
rsi_14 |
Wilder's smoothing RSI, bounded [0, 100] |
macd/macd_signal/macd_hist |
EMA(12,26,9) standard |
bb_upper/lower/width/pct_b |
Bollinger Bands (20, 2σ) |
atr_14 |
Average True Range, Wilder's EWM |
drawdown_from_peak |
Rolling 252d drawdown |
vix_close |
Forward-filled from ^VIX data |
Stationarity check: ADF test (via statsmodels) on momentum_63d, volatility_21d, rsi_14 — logs a warning if non-stationary. Informational; does not block execution.
@enforce_lookback(min_periods=200)
def compute_sma200(close): ...
enforcer = WarmupEnforcer(min_warmup_periods=252)
enforcer.check(df, eval_start) # raises InsufficientWarmupError if violatedAll strategies produce pd.Series of {-1, 0, 1} via the Strategy ABC. The runner applies:
strategy_return[t] = daily_return[t] × signal[t-1] − transaction_cost[t]
Transaction costs:
cost = commission + slippage + (market_impact_bps / 10000)
Applied proportionally to signal.diff().abs() (trade turnover).
Warmup enforcement: If eval_start is provided, the runner calls WarmupEnforcer.check() using slow_window (or window) as min_window. Raises InsufficientWarmupError rather than silently using stale indicators.
Single source of truth — replaces all inline Sharpe calculations:
PerformanceMetrics.sharpe(returns) # annualized
PerformanceMetrics.max_drawdown(equity) # positive fraction
PerformanceMetrics.calmar(equity) # ann_ret / max_dd
PerformanceMetrics.sortino(returns) # downside deviation
PerformanceMetrics.bootstrap_sharpe(returns, n=200, pct=5) # p5 bootstrap
PerformanceMetrics.from_equity(equity) # → dict of all metricsAll strategies implement Strategy(ABC) with:
generate_signal(df, params) → pd.Series— values in{-1, 0, 1}param_schema → Dict— JSON-serializable schema for validation
| Class | Strategy |
|---|---|
MomentumStrategy |
Dual MA crossover |
MeanReversionStrategy |
Bollinger Band reversion |
VolatilityStrategy |
Go long when realized vol < threshold |
TrendFollowingStrategy |
Triple MA (short > mid > long) |
BreakoutStrategy |
Price breaks rolling high/low + threshold |
RegimeBasedStrategy |
Dispatches to momentum/mean-reversion/vol by regime |
Registry: STRATEGY_REGISTRY: Dict[str, Strategy]
Legacy code using calculate_momentum_signal() or run_multi_asset_strategy() from multi_strategy.py continues to work via thin shims.
Pydantic v2 AppConfig with field-level validation:
cfg = load_config() # returns AppConfig
cfg["reference_asset"] # dict-like access (backward compat)
cfg.backtest.min_warmup_periods # typed attribute access
cfg.get_strategy("momentum") # → StrategyConfig with param gridInvalid config (bad log level, unknown LLM provider) raises ValidationError at import time — fails fast instead of silently using defaults.
TTL-based cache invalidation:
def _is_cache_valid(file_path) -> bool:
age_hours = (now - file_mtime).total_seconds() / 3600
return age_hours <= config.cache.ttl_hours # default 24hAll print() calls replaced with logger. FRED fetch is fully optional and gracefully skipped if FRED_API_KEY is absent.
run = ResultsStore.make_run("walk_forward", windows=[...], notes="v2 run")
store = ResultsStore()
run_id = store.save_run(run)Each record stores: run_id, timestamp, git_hash, config_snapshot (JSON), windows (JSON), aggregate_metrics.
| File | Coverage |
|---|---|
test_config.py |
Pydantic validation, invalid log level, invalid provider, dict access |
test_metrics.py |
Sharpe (flat, known), drawdown (zero, known), from_equity keys, zero-signal flat equity |
test_regime.py |
LowVol-Bull, Crisis-Bear, confidence score, no-VIX fallback, VIX spike |
test_features.py |
Expected columns, RSI bounds, momentum accuracy, no-VIX, no-NaN |
test_strategies.py |
All 6 strategies registered, valid signals, backward compat, invalid strategy error |
test_backtest.py |
Returns result dict, invalid strategy raises, zero-signal flat equity, metrics keys |
test_proposal_generator.py |
Fallback without API key, valid params, correct count |
Run: pytest tests/ -v
- Matrix: Python 3.10, 3.11, 3.12
- Steps:
pip install -e ".[dev]"→ruff check→pytest tests/ - Security: Checks that
.envis not tracked by git
| What was removed | Why | Replacement |
|---|---|---|
src/agent/planner.py |
Hardcoded Gemini, one-shot | base_planner.py + proposal_generator.py |
src/agent/langchain_planner.py |
Inconsistent, no fallback | proposal_generator.py |
src/agent/langchain_planner_new.py |
Dead code | Deleted |
src/agent/simple_planner.py |
Disconnected | Merged into fallback chain |
config.json |
Duplicate config | Deleted; config.yaml is canonical |
new_env.ini |
Duplicate env template | Deleted; .env.example is canonical |
quick_test.py |
Root-level test script | Moved to tests/ |
requirements.txt |
Duplicate dep list | pyproject.toml with optional groups |
- Walk-forward with
ResultsStore: Updateexperiments/walk_forward.pyto write each window toResultsStorewith git hash for full reproducibility - Mann-Whitney U ablation: Compare LLM-generated proposals vs. grid-search baselines using statistical test over ≥30 windows
- Bootstrap Sharpe tournament: Rank proposals by
bootstrap_sharpe_p5instead of raw Sharpe to penalize lucky parameter sets - Alpaca data source: Add as optional alternative to yfinance (avoids adjusted-price inconsistencies)
- LangGraph native StateGraph: Replace the pure-Python loop in
agent_graph.pywithlanggraph.StateGraphfor conditional edge visualization and checkpointing
For usage details, see README.md. For experiment results, see docs/EXPERIMENTAL_DETAILS.md.