An intelligent proxy agent that sits between your app and LLM providers. It deduplicates, caches, routes, and falls back — so every free token counts.
StretchLLM operates as a local proxy with zero data collection. Every optimization runs on your machine.
Classifies each prompt by complexity and routes it to the cheapest model that can handle it — simple lookups go to small models, complex reasoning to large ones.
Caches responses by embedding similarity, not just exact string match. Paraphrased questions hit the cache too, saving an average of 18% of daily calls.
Tracks token and request-count budgets per provider in real time. Visual dashboard shows remaining capacity and burn rate.
When one provider's quota is exhausted, requests automatically cascade down a priority ladder of alternative models with zero downtime.
Built-in guardrails prevent runaway loops, abuse from downstream clients, and accidental quota burns. Configurable per-client and per-minute caps.
Install with npx or pip, point your app to localhost:4117, and you're running. No cloud account, no billing, no configuration file required.
Each incoming prompt is classified, deduplicated, and dispatched in under 5ms.
Your app sends a standard OpenAI-compatible request to localhost:4117. No SDK changes needed.
A lightweight local classifier (< 2MB) scores the prompt from 1-5.
Embedding-based similarity search in local SQLite (cosine ≥ 0.93). Cache hit → instant response in ~12ms.
Routes to the cheapest model matching the complexity score. If quota is depleted, falls through the ladder automatically.
Streams the response back to your app. Simultaneously writes embedding + response to the cache for future reuse.
Real sample data from a test session showing cache hits across paraphrased queries.
Live dashboard showing remaining free-tier capacity across all configured providers.
When your primary provider runs out, requests seamlessly fall to the next available model. Zero downtime, zero code changes.
Primary — largest free quota, fast inference
Fast fallback — strong reasoning at low cost
Quality fallback — good at nuanced tasks
Efficient reserve — open weights, self-hostable
Last resort — runs locally via Ollama, unlimited
StretchLLM works as a drop-in proxy. No changes to your existing code needed.
One command — pick your ecosystem.
Drop your free-tier keys into the config or env vars.
Change one line — your base URL.
Open the built-in dashboard to watch savings in real time.
Hard limits and circuit breakers keep runaway processes from burning through your free tier in seconds.
Each downstream client is capped at 60 requests/min by default. Configurable per API key.
If error rate exceeds 40% in a 30-second window, the provider is temporarily removed from the ladder for 5 minutes.
Hard daily cap per provider prevents accidental overuse. Resets at midnight UTC.
Identifies repeated identical prompts within a sliding window and auto-blocks after 5 duplicates in 10 seconds.
Every request, routing decision, and cache event is logged locally in JSON Lines format for full traceability.
API keys are stored in OS keychain (macOS/Linux) or Windows Credential Manager. Never written to disk in plaintext.