Caliper: Quantitative ML Trading Platform

Overview

Caliper is a private, full-stack quantitative ML trading monorepo: Python services and shared packages for market data, features, backtest, execution, risk, ML (including probability_model), simulation, evaluation, regime allocation, cross-sectional ranking, and wallet intelligence — plus a Next.js 14 Model Observatory dashboard. Work from Jan–Apr 2026 delivered 17 sprints in main through v2.7.0, from core equity tooling through Polymarket, unified FeatureSnapshot features, simulation/evaluation, probability modeling, regime + HRP allocation, fleet strategies, and on-chain-informed signals.

Design priority is correctness and safety over raw PnL: paper mode by default, strict RiskManager gating, and observability for every automated decision.

Problem & Context

Most retail-facing algo stacks hide risk and model behavior. I wanted an end-to-end system that could:

Ingest and store time-series data efficiently (TimescaleDB hypertables).
Backtest with realistic slippage and commissions, plus walk-forward optimization.
Enforce layered automated risk (kill switch, circuit breaker, limits).
Integrate ML with confidence gating, drift detection, SHAP explainability, and human-in-the-loop approvals.
Extend to prediction-market execution without forking the risk story — via a shared FeatureSnapshot abstraction and the same allocator/risk path.

Constraints

Paper trading by default; live mode requires explicit env validation.
No secrets in git — .env.example only; Doppler-style workflow for real keys.
All orders through RiskManager — no bypass of kill switch or circuit breaker.
Python 3.11 — some TA libraries target 3.12+; indicators implemented with pandas/numpy where needed.

Approach & Design Decisions

Monorepo (Python + Next.js): atomic schema and consumer changes; one Docker Compose for API + Timescale + Redis.
TimescaleDB for bars, pm.features, simulation/evaluation tables, and probability predictions (Alembic through revision 005).
BFF pattern: dashboard calls FastAPI; Vercel rewrites keep the backend URL off the client.
Adapter execution: BrokerClient → AlpacaClient; Polymarket path uses session orchestration + PolymarketMMStrategy.
ML safety first: drift (PSI, KL, mean shift), ABSTAIN outputs, baselines/regret, and HITL before trusting production models.

Implementation Highlights

Equities: DataProvider → PriceBar feature pipeline; event-driven backtest; OMS with client_order_id idempotency.
Polymarket (Sprint 10): Gamma/CLOB clients, fee engine, session orchestrator, quoting strategy, DB schema for orders/trades.
Sprints 11–12: UnifiedSignal, FeatureSnapshot (four feature families), CLOBSource + BinanceSource, FeatureBuilder + FeatureStore, GET /v1/features/{market_id}/latest|history.
Sprint 13: SimulatedOrderBook, ExecutionSimulator, FeeEngine, AdverseSelectionModel, ReplayEngine, SimulationRunner, SimulationValidator, evaluation metrics + regime matrix + baselines; /v1/simulation/* and /v1/evaluation/* (some responses still stub-backed until full DB wiring).
Sprint 14: probability_model — calibration, lead-lag tests, /v1/probability/* (AC-9 test wiring still open per project status).
Sprints 15–16: regime detection + HRP allocator (/v1/regime/*, /v1/allocation/*); cross-sectional 5-factor ranker, cooldown selection, four paper fleet strategies; dashboard overhaul.
Sprint 17: reward density, wallet intelligence (KMeans k=4), smart-money signals, composite aggregation with weight learning.

Results & Evaluation

17 sprints shipped through v2.7.0; 550+ pytest tests in repo (per workflow-core portfolio extraction). SMA crossover backtest math verified on sample AAPL bars.
Polymarket bot: session orchestrator and quoting implemented; extended paper PnL validation still on the roadmap (no fabricated production metrics).
Simulation + evaluation: determinism, fill-rate, and regime test criteria exercised; some API responses still stub-backed until full DB integration.
Probability stack: library + migration + router merged; AC-9 and live DB reads called out as remaining work in source docs.
Roadmap: further live/paper validation and out-of-sample ML metrics depend on training runs — not claimed here.

Tradeoffs & Limitations

Simulation/evaluation/probability APIs: some routes still stub or mock until persisted runs/reports are fully read from pm.* tables.
Sprint 14 AC-9 (probability module test suite) not landed per quant ticket index.
No CI/CD in repo at last extraction; tests run locally.
Dashboard uses polling, not WebSockets.
Repo and detailed metrics stay private until a deliberate public-safe review.

Notes / Redactions

Private project: no live credentials, no real-money results, and no fabricated performance numbers in this case study.