Open Source · MIT License · v2.4.1

Stretch your free LLM quota by ~30%

An intelligent proxy agent that sits between your app and LLM providers. It deduplicates, caches, routes, and falls back — so every free token counts.

~30%
Quota Savings
4.2k
GitHub Stars
12ms
Avg Cache Latency
6
Provider Supported
Everything you need to maximize free-tier tokens

StretchLLM operates as a local proxy with zero data collection. Every optimization runs on your machine.

🔀

Smart Request Routing

Classifies each prompt by complexity and routes it to the cheapest model that can handle it — simple lookups go to small models, complex reasoning to large ones.

💾

Semantic Cache & Reuse

Caches responses by embedding similarity, not just exact string match. Paraphrased questions hit the cache too, saving an average of 18% of daily calls.

📊

Real-time Quota Gauge

Tracks token and request-count budgets per provider in real time. Visual dashboard shows remaining capacity and burn rate.

🪜

Fallback Model Ladder

When one provider's quota is exhausted, requests automatically cascade down a priority ladder of alternative models with zero downtime.

🛡️

Safety & Rate Limits

Built-in guardrails prevent runaway loops, abuse from downstream clients, and accidental quota burns. Configurable per-client and per-minute caps.

🚀

One-Command Setup

Install with npx or pip, point your app to localhost:4117, and you're running. No cloud account, no billing, no configuration file required.

How requests find the right model

Each incoming prompt is classified, deduplicated, and dispatched in under 5ms.

1

Prompt Intake

Your app sends a standard OpenAI-compatible request to localhost:4117. No SDK changes needed.

2

Complexity Classifier

A lightweight local classifier (< 2MB) scores the prompt from 1-5.

1-2 Simple factual / lookup   3 Moderate reasoning   4-5 Complex / creative
3

Cache Lookup

Embedding-based similarity search in local SQLite (cosine ≥ 0.93). Cache hit → instant response in ~12ms.

4

Model Dispatch

Routes to the cheapest model matching the complexity score. If quota is depleted, falls through the ladder automatically.

5

Response & Cache Write

Streams the response back to your app. Simultaneously writes embedding + response to the cache for future reuse.

Semantic deduplication in action

Real sample data from a test session showing cache hits across paraphrased queries.

🗂️ Recent Cache Log

"What is the capital of France?"HIT
"Capital city of France?"HIT (0.97)
"Explain quicksort algorithm"MISS → GPT-4o-mini
"How does quicksort work?"HIT (0.95)
"Write a haiku about rain"MISS → Claude Haiku
"Compose a rain-themed haiku"HIT (0.93)
"Python list comprehension syntax"HIT
"Prove P ≠ NP" MISS → GPT-4o

📈 Session Statistics

62%
Hit Rate
847
Cache Entries
12ms
Avg Hit Latency
4.1MB
DB Size

Token Savings Breakdown

Cache reuse 18.2%
Smart routing 9.6%
Deduplication 3.8%
Total savings: ~31.6%
Track every provider in real time

Live dashboard showing remaining free-tier capacity across all configured providers.

OpenAI

Free tier · GPT-4o-mini
65%
Used Remaining

Anthropic

Free tier · Claude 3.5 Haiku
50%
Used Remaining

Google

Free tier · Gemini 2.0 Flash
84%
Used Remaining
Automatic model cascade on quota exhaustion

When your primary provider runs out, requests seamlessly fall to the next available model. Zero downtime, zero code changes.

1

Gemini 2.0 Flash

Primary — largest free quota, fast inference

Free: 1500 RPD
~380ms
2

GPT-4o-mini

Fast fallback — strong reasoning at low cost

Free: 200 RPD
~420ms
3

Claude 3.5 Haiku

Quality fallback — good at nuanced tasks

Free: 100 RPD
~510ms
4

Mistral Small

Efficient reserve — open weights, self-hostable

Free: 500 RPD
~290ms
5

Llama 3.1 8B (local)

Last resort — runs locally via Ollama, unlimited

Unlimited
~850ms
Running in under two minutes

StretchLLM works as a drop-in proxy. No changes to your existing code needed.

01

Install

One command — pick your ecosystem.

# npm npx stretchllm@latest # pip pip install stretchllm && stretchllm
02

Add API Keys

Drop your free-tier keys into the config or env vars.

# .env OPENAI_API_KEY=sk-free-... ANTHROPIC_API_KEY=sk-ant-... GOOGLE_API_KEY=AIza...
03

Point Your App

Change one line — your base URL.

import OpenAI from "openai"; const client = new OpenAI({ baseURL: "http://localhost:4117/v1" });
04

Monitor & Tune

Open the built-in dashboard to watch savings in real time.

# Open dashboard stretchllm dashboard # Or visit http://localhost:4117/ui
Guardrails that protect your quota

Hard limits and circuit breakers keep runaway processes from burning through your free tier in seconds.

🔒 Per-Client Rate Limit

Each downstream client is capped at 60 requests/min by default. Configurable per API key.

client_web_app: 44/60 RPM

Circuit Breaker

If error rate exceeds 40% in a 30-second window, the provider is temporarily removed from the ladder for 5 minutes.

openai error rate: 4.8% ✓

🧱 Daily Token Budget

Hard daily cap per provider prevents accidental overuse. Resets at midnight UTC.

gemini: 82.5k / 150k tokens

🚫 Loop Detection

Identifies repeated identical prompts within a sliding window and auto-blocks after 5 duplicates in 10 seconds.

0 loops blocked today

📋 Audit Log

Every request, routing decision, and cache event is logged locally in JSON Lines format for full traceability.

{"ts":"14:32:07","action":"route","model":"gemini-2.0-flash","score":2,"cache":"miss","latency_ms":391}

🔑 Key Isolation

API keys are stored in OS keychain (macOS/Linux) or Windows Credential Manager. Never written to disk in plaintext.

✓ 3 keys secured in system keychain