⚡ Open Source · MIT License · v2.4.1

Stretch your free LLM quota by ~30%

An intelligent proxy agent that sits between your app and LLM providers. It deduplicates, caches, routes, and falls back — so every free token counts.

~30%

Quota Savings

4.2k

GitHub Stars

12ms

Avg Cache Latency

Provider Supported

Core Features

Everything you need to maximize free-tier tokens

StretchLLM operates as a local proxy with zero data collection. Every optimization runs on your machine.

🔀

Smart Request Routing

Classifies each prompt by complexity and routes it to the cheapest model that can handle it — simple lookups go to small models, complex reasoning to large ones.

💾

Semantic Cache & Reuse

Caches responses by embedding similarity, not just exact string match. Paraphrased questions hit the cache too, saving an average of 18% of daily calls.

📊

Real-time Quota Gauge

Tracks token and request-count budgets per provider in real time. Visual dashboard shows remaining capacity and burn rate.

🪜

Fallback Model Ladder

When one provider's quota is exhausted, requests automatically cascade down a priority ladder of alternative models with zero downtime.

🛡️

Safety & Rate Limits

Built-in guardrails prevent runaway loops, abuse from downstream clients, and accidental quota burns. Configurable per-client and per-minute caps.

🚀

One-Command Setup

Install with npx or pip, point your app to localhost:4117, and you're running. No cloud account, no billing, no configuration file required.

Routing Engine

How requests find the right model

Each incoming prompt is classified, deduplicated, and dispatched in under 5ms.

Prompt Intake

Your app sends a standard OpenAI-compatible request to localhost:4117. No SDK changes needed.

Complexity Classifier

A lightweight local classifier (< 2MB) scores the prompt from 1-5.

1-2 Simple factual / lookup 3 Moderate reasoning 4-5 Complex / creative

Cache Lookup

Embedding-based similarity search in local SQLite (cosine ≥ 0.93). Cache hit → instant response in ~12ms.

Model Dispatch

Routes to the cheapest model matching the complexity score. If quota is depleted, falls through the ladder automatically.

Response & Cache Write

Streams the response back to your app. Simultaneously writes embedding + response to the cache for future reuse.

Cache & Reuse

Semantic deduplication in action

Real sample data from a test session showing cache hits across paraphrased queries.

🗂️ Recent Cache Log

"What is the capital of France?"HIT

"Capital city of France?"HIT (0.97)

"Explain quicksort algorithm"MISS → GPT-4o-mini

"How does quicksort work?"HIT (0.95)

"Write a haiku about rain"MISS → Claude Haiku

"Compose a rain-themed haiku"HIT (0.93)

"Python list comprehension syntax"HIT

"Prove P ≠ NP" MISS → GPT-4o

📈 Session Statistics

62%

Hit Rate

847

Cache Entries

12ms

Avg Hit Latency

4.1MB

DB Size

Token Savings Breakdown

Cache reuse 18.2%

Smart routing 9.6%

Deduplication 3.8%

Total savings: ~31.6%

Quota Gauge

Track every provider in real time

Live dashboard showing remaining free-tier capacity across all configured providers.

OpenAI

Free tier · GPT-4o-mini

65%

Used Remaining

Anthropic

Free tier · Claude 3.5 Haiku

50%

Used Remaining

Google

Free tier · Gemini 2.0 Flash

84%

Used Remaining

Fallback Ladder

Automatic model cascade on quota exhaustion

When your primary provider runs out, requests seamlessly fall to the next available model. Zero downtime, zero code changes.

Gemini 2.0 Flash

Primary — largest free quota, fast inference

Free: 1500 RPD

~380ms

GPT-4o-mini

Fast fallback — strong reasoning at low cost

Free: 200 RPD

~420ms

Claude 3.5 Haiku

Quality fallback — good at nuanced tasks

Free: 100 RPD

~510ms

Mistral Small

Efficient reserve — open weights, self-hostable

Free: 500 RPD

~290ms

Llama 3.1 8B (local)

Last resort — runs locally via Ollama, unlimited

Unlimited

~850ms

Get Started

Running in under two minutes

StretchLLM works as a drop-in proxy. No changes to your existing code needed.

Install

One command — pick your ecosystem.

# npm
npx stretchllm@latest

# pip
pip install stretchllm && stretchllm

Add API Keys

Drop your free-tier keys into the config or env vars.

# .env
OPENAI_API_KEY=sk-free-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AIza...

Point Your App

Change one line — your base URL.

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:4117/v1"
});

Monitor & Tune

Open the built-in dashboard to watch savings in real time.

# Open dashboard
stretchllm dashboard

# Or visit
http://localhost:4117/ui

Safety & Limits

Guardrails that protect your quota

Hard limits and circuit breakers keep runaway processes from burning through your free tier in seconds.

🔒 Per-Client Rate Limit

Each downstream client is capped at 60 requests/min by default. Configurable per API key.

client_web_app: 44/60 RPM

⚡ Circuit Breaker

If error rate exceeds 40% in a 30-second window, the provider is temporarily removed from the ladder for 5 minutes.

openai error rate: 4.8% ✓

🧱 Daily Token Budget

Hard daily cap per provider prevents accidental overuse. Resets at midnight UTC.

gemini: 82.5k / 150k tokens

🚫 Loop Detection

Identifies repeated identical prompts within a sliding window and auto-blocks after 5 duplicates in 10 seconds.

0 loops blocked today

📋 Audit Log

Every request, routing decision, and cache event is logged locally in JSON Lines format for full traceability.

{"ts":"14:32:07","action":"route","model":"gemini-2.0-flash","score":2,"cache":"miss","latency_ms":391}

🔑 Key Isolation

API keys are stored in OS keychain (macOS/Linux) or Windows Credential Manager. Never written to disk in plaintext.

✓ 3 keys secured in system keychain