- TypeScript 95.7%
- Shell 2%
- JavaScript 1%
- HTML 0.8%
- CSS 0.4%
- Other 0.1%
|
Some checks failed
CI / Install, test, build (push) Failing after 1m43s
Adds a pre-first-token stall watchdog to the streaming path: while no bytes have reached the client yet (so failover is still possible), a route that connects but produces no content/tool/reasoning delta for STREAM_FIRST_TOKEN_STALL_MS (default 30s, env-tunable) is abandoned and the retry loop moves to the next provider/key. Any real delta — including reasoning traces — resets the watchdog, so a slow-but-thinking model is never cut. After headers are flushed we can't fail over, so the existing 90s inactivity timeout still governs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|---|---|---|
| .gitea/workflows | ||
| .github/workflows | ||
| client | ||
| desktop | ||
| docker | ||
| docs | ||
| k8s | ||
| repo-assets | ||
| scripts | ||
| server | ||
| shared | ||
| .dockerignore | ||
| .env.example | ||
| .gitignore | ||
| docker-compose.yml | ||
| Dockerfile | ||
| ecosystem.config.cjs | ||
| install.sh | ||
| LICENSE | ||
| migrate-from-old.sh | ||
| package-lock.json | ||
| package.json | ||
| README.md | ||
| SETUP.md | ||
| working-proxies.txt | ||
FreeLLMAPI
One OpenAI-compatible endpoint. Every free LLM provider. Real-time dashboard.
Aggregate the free tiers from Google, Groq, Cerebras, NVIDIA, Mistral, OpenRouter, GitHub Models, Cohere, Cloudflare, HuggingFace, Z.ai, Ollama, Kilo, Pollinations, LLM7, OVH AI Endpoints, plus any custom OpenAI-compatible endpoint — behind a single /v1/chat/completions drop-in. A smart router picks the best available key for each request, fails over transparently when a provider rate-limits you, queues requests instead of dropping them, routes outbound traffic through an HTTP proxy pool of your choice, and tracks everything in a live WebSocket dashboard.
What's new in v1.3
- 🛰️ HTTP proxy pool — pool many proxies, choose a strategy (round-robin, least-latency, random, or single), per-key assignment that rotates hourly, dashboard with stats, geolocation, health-checker, bulk select / enable / disable / test / delete, and a France-only
find-proxiesscraper. A single proxy URL set in Settings is also tracked in the dashboard. - 📊 Live dashboard v2 — in-flight requests now show the proxy in use, input tokens and a growing output token counter that ticks up as the stream comes back. New session totals (Tokens IN / Tokens OUT).
- 🤖 AI Profiles v2 — profile context window is auto-set to the smallest model's window so a profile mixing 128K + 8K models never hallucinates. Round-robin API, per-profile stats panel, profile strategies (failover, round-robin, least-latency).
- 🔑 Keys page v2 — model discovery now syncs the catalog: new models from the provider are added, stale ones are removed. Bulk auto-discover for custom endpoints, search bar, inline label/RPM/TPM editing.
- 🌐 17 providers — Groq, Cerebras, Mistral, OpenRouter, NVIDIA NIM, GitHub Models, Cohere, Cloudflare, Google, Z.ai, HuggingFace, Ollama Cloud, Kilo Gateway, Pollinations, LLM7, OVH AI Endpoints, plus any custom OpenAI-compatible URL.
- 🛠 Operational fixes — fixed fd-leak crash on test-all, dispatcher cache, SOCKS test endpoint, profile creation bug, model-stats scoping per user, pool-strategy persistence across restarts.
- 🧪 505 server tests passing in 10s. Forgejo Actions workflow on every push and every
v*tag.
Contents
- What's new in v1.3
- Features
- Supported providers
- Quick start
- Install script
- Docker
- Node.js + PM2
- Migrating from an older install
- Using the API
- HTTP proxy pool
- Dashboard
- Configuration
- How routing works
- Project structure
- Continuous integration
- Contributing
- Disclaimer
Features
Gateway
- Single OpenAI-compatible endpoint —
/v1/chat/completions,/v1/models,/v1/embeddings. Any OpenAI client library works unchanged. - Smart failover router — tries provider keys in scored order (success rate, latency, rate-limit headroom). Falls over to the next key/provider silently.
- Request queueing — when all keys for a model are rate-limited, requests are held and retried instead of immediately returning 429. Configurable timeout.
- Per-user isolation — each user has their own provider keys, gateway keys, profiles, proxies, and request history. Zero cross-tenant data leakage.
- Gateway API keys — mint scoped keys (
freellmapi-…) for your apps with per-key RPM/TPD limits. Enable/disable without deleting. - Outbound rate limits — set per-provider-key RPM/TPM caps enforced between the router and the upstream API.
- Context handoff — injects a compact system message when a session switches model mid-stream so the new model knows where things left off.
- HTTP proxy pool — distribute outbound calls across many HTTP proxies (round-robin, least-latency, random), or pin to a single proxy. Per-key hourly assignment.
- AES-256-GCM encryption — all provider API keys are encrypted at rest. The plaintext never leaves your server.
AI Profiles
- Virtual models — create a named profile that fans out across a list of real models. Appears in
/v1/modelslike any other model. - Routing strategies — failover, round-robin, or least-latency per profile.
- Auto context window — profile's
context_windowis automatically set to the smallest model's window so a profile mixing big and small models never hallucinates. - AUTO model — pseudo-model that picks the best available option across all your configured keys by score.
- Embeddings parity — profiles work for embedding requests too.
- Per-profile stats — request volume, error rate, latency, and token counts scoped to the profile.
Dashboard
- Live WebSocket dashboard — in-flight request counter, per-request routing events (which provider/model/proxy is being tried), per-key RPM gauges, overview counters — all pushed in real time without polling.
- In-flight proxy + tokens — active rows show the proxy name, input tokens, and a growing output token counter that ticks up live as the model streams.
- Session token totals — Tokens IN and Tokens OUT aggregated over the dashboard session.
- Per-user live counts — counts and events are scoped to the logged-in user.
- Keys page — masked key display, inline label/RPM/TPM editing, health status, one-click model discovery, custom endpoint with bulk auto-discover.
- Per-key deep stats — click any key to expand: latency percentiles (p50/p95/p99), 24-hour hourly bar chart, per-model breakdown, error breakdown, recent 25 requests, active cooldowns.
- Live status dots — each key's indicator updates in real time when the server detects a 401/403/429. No manual refresh needed.
- Playground — interactive chat with model selector, system prompt editor, and a live routing indicator showing which provider/model/proxy is handling the current request.
- Profiles page — create and manage AI Profiles with a searchable model picker; add and remove models from the provider with one click.
- Proxy pool page — bulk select, enable / disable / delete / test, per-proxy stats panel, geolocation, health checker.
- Analytics — request volume, error rates, cost estimates, per-provider breakdown, exportable history.
Operations
- Structured JSON logger — every request logs timestamp, level, req_id, user, key prefix, proxy, provider, and model. Secrets masked. Daily rotation with configurable retention.
- Universal installer —
install.shauto-detects Docker vs Node.js+PM2, generates keys, writes.env, builds, and starts the service. Prompts to migrate from an older install. - Migration script —
migrate-from-old.shfinds an older FreeLLMAPI install, verifies the encryption key, and imports provider keys and request history non-destructively. find-proxiesscript — scans a list of candidate free-proxy URLs, geo-IP-verifies the country, tests connectivity, and writes the working set toworking-proxies.txtfor one-click pool import.- Docker — multi-stage Dockerfile,
docker-compose.yml, named volume for the database. - PM2 —
ecosystem.config.cjswith autorestart, memory cap, and structured log rotation. - Kubernetes / Podman —
k8s/manifests forpodman play kubeorkubectl apply. - Forgejo Actions —
.gitea/workflows/ci.ymlandrelease.ymlrun install + build + test on every push and everyv*tag.
Supported providers
| Provider | Models | Auth |
|---|---|---|
| Google Gemini | Gemini 2.5 Flash, 2.0 Flash, 2.5 Pro preview | API key |
| Groq | Llama 3.3/4, Qwen3, Gemma, compound-beta | API key |
| Cerebras | Qwen3 235B, Llama 3.3 70B | API key |
| Mistral | Large 3, Medium 3.5, Codestral, Devstral | API key |
| OpenRouter | 20+ free-tier models via :free routes |
API key |
| GitHub Models | GPT-4.1, GPT-4o, Phi-4, Llama 3.3 | GitHub PAT |
| Cloudflare Workers AI | Kimi K2, GLM-4.7, Llama, Granite | Account ID + token |
| Cohere | Command R+, Command-A (trial key) | API key |
| NVIDIA NIM | 40 RPM free tier (eval-only ToS) | API key |
| HuggingFace | Inference router → DeepSeek, Kimi, Qwen3 | API key |
| Z.ai (Zhipu) | GLM-4.5, GLM-4.7 Flash | API key |
| Ollama Cloud | GLM-4.7, Kimi K2, Qwen3 | API key |
| Kilo Gateway | :free routes, no key required |
Keyless |
| Pollinations | GPT-OSS 20B, no key required | Keyless |
| LLM7 | GPT-OSS, Llama 3.1, GLM, no key required | Keyless |
| OVH AI Endpoints | Qwen3.5 397B, Llama 3.3, no key required | Keyless |
| Custom endpoint | Any OpenAI-compatible URL — llama.cpp, LM Studio, vLLM, Ollama, Novita, etc. | API key (or keyless) |
Keyless providers don't require an API key. The Keys page stores a sentinel row for them so routing treats the platform as configured.
Quick start
git clone https://git.pandem.fr/outage.sh/FreeLLMapi freellmapi
cd freellmapi
# Recommended — universal installer
bash install.sh
# Or manually
cp .env.example .env
# Edit .env — set ENCRYPTION_KEY, ADMIN_USERNAME, ADMIN_PASSWORD
npm install && npm run build
node server/dist/index.js
Open http://localhost:3001 and log in with your admin credentials.
Install script
bash install.sh # auto-detects Docker vs Node.js+PM2
bash install.sh --docker # force Docker Compose
bash install.sh --node # force Node.js + PM2
bash install.sh --port 8080 # custom port
bash install.sh --lan # bind to 0.0.0.0 (LAN access)
bash install.sh --yes # non-interactive / CI
The script detects Docker or Node.js, generates ENCRYPTION_KEY, prompts for admin credentials, writes .env, builds, starts the service, waits for the health check, and prints the access URL and management commands. After .env is written it asks if you want to migrate keys from an older install.
Docker
cp .env.example .env
# Set ENCRYPTION_KEY, ADMIN_USERNAME, ADMIN_PASSWORD in .env
docker compose up -d
# Verify
curl http://localhost:3001/api/ping
# Logs
docker compose logs -f
# Update
git pull && docker compose build && docker compose up -d
The database is stored in a named Docker volume (freellmapi-data) and survives container restarts and image rebuilds.
LAN access: set HOST_BIND=0.0.0.0 in .env to expose the port to your local network.
Node.js + PM2
Requires Node.js 20+.
cp .env.example .env # edit before proceeding
npm install
npm run build
pm2 start ecosystem.config.cjs
pm2 save
pm2 startup # enable autostart on reboot
# Logs
pm2 logs api-gateway
# Update
git pull && npm run build && pm2 restart api-gateway
The database lives at server/data/freeapi.db. Back it up before updates.
Migrating from an older install
If you ran a previous version of FreeLLMAPI and want to carry your provider keys forward:
# Auto-search the machine for an old install
bash migrate-from-old.sh
# Or point directly at the old repo directory
bash migrate-from-old.sh /path/to/old/freellmapi
The script:
- Locates the old
freeapi.dband reads itsENCRYPTION_KEY - Verifies decryption works before touching anything
- Handles
ENCRYPTION_KEYmismatches — adopt the old key (recommended if the new DB is empty) or re-encrypt everything on the fly with the new key - Imports all
api_keysrows, skipping duplicates. Optionally imports request history. - Never deletes from either database — fully non-destructive
After running, restart the server and go to Keys → Check all to confirm the imported keys are still valid with each provider.
Using the API
FreeLLMAPI speaks standard OpenAI API. Point any client at your server:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:3001/v1",
api_key="freellmapi-your-gateway-key-here",
)
response = client.chat.completions.create(
model="auto", # router picks best available, or name a specific model
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
curl http://localhost:3001/v1/chat/completions \
-H "Authorization: Bearer freellmapi-your-gateway-key-here" \
-H "Content-Type: application/json" \
-d '{"model":"auto","messages":[{"role":"user","content":"Hello!"}]}'
Listing models
curl http://localhost:3001/v1/models \
-H "Authorization: Bearer freellmapi-your-gateway-key-here"
Returns all models from your configured providers plus any AI Profiles. Use auto to let the router choose, or request a specific model by name.
Gateway API keys
Create keys in the dashboard under Keys → Gateway keys. The raw key is shown once at creation — only the hash is stored.
HTTP proxy pool
v1.3 ships a full HTTP proxy pool. Configure it from the Proxies page or via env / .env:
| Env var | Default | Description |
|---|---|---|
PROXY_URL |
— | Single proxy URL. Wins over DB. e.g. http://user:pass@host:port |
PROXY_POOL_STRATEGY |
none |
none (off), random, round-robin, least-latency |
PROXY_BYPASS |
— | Comma-separated platform names that skip the proxy (e.g. groq,google) |
Single proxy — set PROXY_URL once. The dashboard tracks request count, success rate, latency, and bytes.
Pool — add rows in the Proxies page (label, URL, optional country), pick a strategy, and the router distributes each request across the pool. Per-key assignment rotates every hour to spread load and avoid burning any single egress.
Bypass list — platforms you don't want proxied (e.g. low-latency providers like Groq that you want to reach directly).
SOCKS proxies — supported via socks5:// and socks4:// URLs.
find-proxies script — scans a candidate URL list, geo-IP-verifies the country (France by default), probes connectivity, and writes the working set to working-proxies.txt for bulk import:
# 1. Find a batch of working proxies
node scripts/find-proxies.mjs --out working-proxies.txt
# 2. Import in the dashboard: Proxies → Bulk import → paste file
Dashboard
Live dashboard (v1.3 highlights)
- In-flight proxy name — every active row shows which proxy the request is using
- Growing token counts —
output_tokensticks up live as the model streams - Session totals —
Tokens INandTokens OUTaggregated over the dashboard session - Per-user counts — events scoped to the logged-in user
- Real-time routing events — which provider/model was tried, latency, outcome, all pushed over WebSocket
Keys page
Add provider credentials for any supported service. Features per key:
- Live status dot — updates in real time: green (healthy), amber (rate-limited), red (error/disabled/401)
- Inline editing — pencil icon to edit label, RPM cap, TPM cap without re-entering the key
- Stats panel — click the dot or
⌄chevron to expand deep stats:- Total requests, success rate, error count, rate-limit hits
- Latency: min / p50 / avg / p95 / p99 / max
- 24-hour bar chart, colour-coded by error rate
- Per-model breakdown with success rate and token counts
- Error breakdown with occurrence counts
- Last 25 requests with model, latency, tokens, and timestamp
- Active cooldowns with reset times
- Model discovery — one click calls the provider's
/v1/modelsendpoint, adds new models to the catalog, and removes stale ones. Custom endpoint mode imports all advertised models in one shot.
Custom endpoints: use the "Custom" provider to connect any OpenAI-compatible URL. Toggle "Auto-discover" to bulk-register all models the endpoint advertises.
AI Profiles
Profiles act as virtual models. Build a ranked list of real models, choose a routing strategy, and the profile appears in /v1/models like any other model.
- Context window — auto-set to the smallest model's window. Add a 4K model to a profile that has a 128K model and the cap drops to 4K; remove it and the cap goes back up. Prevents the larger model from drifting on a too-large context.
- Routing strategies —
failover(try in order),round-robin,least-latency - Search bar — filter by model name, provider, or model ID when building a profile
- Per-profile stats — request volume, error rate, latency, and token counts
Proxies
The proxy pool page lets you:
- Add HTTP / HTTPS / SOCKS proxies with labels and (optional) country
- Pick a pool strategy:
none,random,round-robin,least-latency - Bulk select with checkboxes and apply enable / disable / delete / test across many rows
- Click a row to expand its stats panel (request count, success rate, latency, bytes)
- Health-checker pings every enabled proxy on a schedule and marks failing ones
Playground
Interactive chat with:
- Full model selector (providers, profiles,
auto) - System prompt editor
- Live routing indicator — shows the provider, model, and proxy the router is currently trying while a request is in flight
Analytics
Request history with per-provider breakdown, token usage, latency trends, and cost estimates.
Configuration
| Variable | Default | Description |
|---|---|---|
ENCRYPTION_KEY |
(required) | 64-char hex string for AES-256-GCM. Generate: node -e "console.log(require('crypto').randomBytes(32).toString('hex'))" |
PORT |
3001 |
HTTP listen port |
HOST_BIND |
127.0.0.1 |
Bind interface. Set 0.0.0.0 for LAN access. |
ADMIN_USERNAME |
— | Bootstrap admin username (first run only) |
ADMIN_PASSWORD |
— | Bootstrap admin password, min 8 chars (first run only) |
PROXY_RATE_LIMIT_RPM |
120 |
Max /v1 requests per minute per client IP. 0 = disabled. |
REQUEST_ANALYTICS_RETENTION_DAYS |
90 |
Days to keep request history |
REQUEST_ANALYTICS_MAX_ROWS |
100000 |
Maximum request history rows |
REQUEST_QUEUE_TIMEOUT_SECONDS |
60 |
Seconds to hold a queued request before returning 429 |
MODEL_REFRESH_INTERVAL_MINUTES |
— | Periodic live model list refresh. Unset = boot-time only. |
FREELLMAPI_CONTEXT_HANDOFF |
— | Set on_model_switch to inject context messages on model switch |
PROXY_URL |
— | Single proxy URL. e.g. http://user:pass@host:port |
PROXY_POOL_STRATEGY |
none |
none / random / round-robin / least-latency |
PROXY_BYPASS |
— | Comma-separated platform names to skip the proxy |
DASHBOARD_ORIGINS |
— | Extra CORS origins for the dashboard (comma-separated) |
LOG_LEVEL |
INFO |
DEBUG | INFO | WARN | ERROR | FATAL |
GATEWAY_INSTANCES |
1 |
PM2: number of gateway worker processes |
How routing works
When a request arrives at /v1/chat/completions:
- Resolve the model — a Profile name expands to its candidate list;
autoscores all available models; anything else looks up provider keys for that model ID. - Score and sort — each candidate key is scored by recent success rate, average latency, and rate-limit headroom. Lower-scored candidates move to the back of the queue.
- Pick a proxy — for the resolved key, look up its current pool assignment (or fall back to the single
PROXY_URL); the assignment rotates every hour so each key gets a fair share. - Try in order — the router picks the first key that isn't on cooldown, hasn't hit its outbound RPM/TPM cap, and has a healthy proxy.
- On failure — a rate-limit response (429) puts the key on a cooldown; an auth failure (401/403) marks it invalid and pushes a real-time status update to the dashboard.
- Queue if exhausted — if all candidates are on cooldown the request parks in an in-memory queue and retries when the earliest cooldown expires, up to
REQUEST_QUEUE_TIMEOUT_SECONDS. - Context handoff — if the winning model differs from the previous turn in the same session, a compact system message is prepended so the new model has context.
Every routing decision is logged (provider, model, key prefix, proxy, latency) and streamed to the live dashboard via WebSocket.
Project structure
freellmapi/
├── server/src/
│ ├── routes/ REST endpoints — /api/*, /v1/*
│ │ (proxy, keys, profiles, gateway-keys, admin, fallback, embeddings, responses, settings…)
│ ├── services/ Router, queue, health checker, scoring, WebSocket push, auth, context handoff, proxy-health
│ ├── providers/ Per-provider OpenAI-compatible adapters (google, openai-compat, cohere, cloudflare)
│ ├── db/ SQLite (better-sqlite3), migration runner, model catalog migrations
│ └── lib/ Crypto, proxy pool, logger, error handling
├── client/src/
│ ├── pages/ Keys, Profiles, Proxies, Gateway keys, Playground, Analytics, Live dashboard, Fallback, Premium, Admin…
│ └── components/ Shared UI components (shadcn/ui)
├── shared/ TypeScript types shared between server and client
├── desktop/ Electron desktop wrapper
├── scripts/ find-proxies.mjs and other ops helpers
├── k8s/ Kubernetes / Podman play kube manifests
├── .gitea/workflows/ Forgejo Actions (ci.yml, release.yml)
├── .github/workflows/ GitHub Actions (ci.yml, docker.yml)
├── ecosystem.config.cjs PM2 process definition
├── Dockerfile Multi-stage Docker build
├── docker-compose.yml
├── install.sh Universal installer
├── migrate-from-old.sh Import keys from an older FreeLLMAPI install
└── find-proxies.mjs Free-proxy scraper / verifier
Continuous integration
Two Forgejo Actions workflows ship with the repo and are picked up automatically by your Forgejo runner:
| Workflow | Trigger | What it does |
|---|---|---|
.gitea/workflows/ci.yml |
push to main, pull request |
npm ci → build server → build client → run 505 server tests |
.gitea/workflows/release.yml |
push of any v* tag (e.g. v1.3) |
same as CI, then posts a green-build summary to the run page |
A push of the v1.3 tag will start a release build on the runner and report success or failure on the release page. Run the same suite locally with:
npm install
npm run build
npm test -w server
Contributing
git clone https://git.pandem.fr/outage.sh/FreeLLMapi freellmapi
cd freellmapi
npm install
npm run dev # server on :3001, Vite dev server on :5173
npm run build— full production build (server + client)npm test -w server— server test suite (505 tests)npm run build:server— server onlynpm run desktop:dev— Electron desktop wrapper in dev modenpm run desktop:dist— build a distributable Electron binary
PRs welcome. Keep changes focused; include tests for new server behaviour.
Disclaimer
FreeLLMAPI routes requests to third-party AI provider APIs under their respective free-tier terms. It does not circumvent rate limits — it distributes load across multiple API keys you legitimately own. Review each provider's Terms of Service before use. NVIDIA's free NIM tier is for evaluation only. The authors are not responsible for ToS violations or API key misuse.