No description
Find a file
2026-05-07 08:50:13 -03:00
docker fix: Small fixes 2026-05-07 08:50:13 -03:00
runpod feat: Project version 0.1.0 2026-05-07 00:18:42 -03:00
src feat: Project version 0.1.0 2026-05-07 00:18:42 -03:00
static fix: Small fixes 2026-05-07 08:50:13 -03:00
tests feat: Project version 0.1.0 2026-05-07 00:18:42 -03:00
.env.example feat: Project version 0.1.0 2026-05-07 00:18:42 -03:00
.gitignore feat: Project version 0.1.0 2026-05-07 00:18:42 -03:00
app.py feat: Project version 0.1.0 2026-05-07 00:18:42 -03:00
DEPLOY_DOCKER.md feat: Project version 0.1.0 2026-05-07 00:18:42 -03:00
main.py feat: Project version 0.1.0 2026-05-07 00:18:42 -03:00
pyproject.toml feat: Project version 0.1.0 2026-05-07 00:18:42 -03:00
pytest.ini feat: Project version 0.1.0 2026-05-07 00:18:42 -03:00
README.md feat: Project version 0.1.0 2026-05-07 00:18:42 -03:00
RULES.md feat: Project version 0.1.0 2026-05-07 00:18:42 -03:00
uv.lock feat: Project version 0.1.0 2026-05-07 00:18:42 -03:00

Hagalaz

A lightweight, OpenAI-compatible API server for running Hugging Face models locally. Supports text generation with real-time streaming, reasoning model detection, image generation via Stable Diffusion, audio transcription, and text-to-speech.

Features

  • OpenAI-compatible API - Drop-in replacement for OpenAI API endpoints
  • Real-time Streaming - True token-by-token streaming (not batched post-generation)
  • Reasoning Models - Automatic <think> tag detection for DeepSeek-R1 and similar models
  • 4-bit Quantization - Run large models on GPUs with ~6GB VRAM via BitsAndBytes
  • Response Caching - LRU cache for non-streaming requests
  • Image Generation - Stable Diffusion support via /v1/images/generations
  • Audio Transcription - Whisper-based speech-to-text with streaming support via /v1/audio/transcriptions
  • Text-to-Speech - Bark-based speech synthesis via /v1/audio/speech
  • Rate Limiting - Per-endpoint rate limits using slowapi (IP-based)
  • Response Caching - aiocache integration for model lists and chat completions
  • API Key Authentication - SQLite-backed key management with bcrypt hashing, admin keys, and hot-reload
  • Hugging Face Auth - Gated model support via HF_TOKEN

Quick Start

# Install dependencies
uv sync

# Run with all models
uv run python main.py \
  --model-id unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit \
  --image-model-id stabilityai/sd-turbo \
  --audio-model-id openai/whisper-base \
  --tts-model-id suno/bark-small \
  --port 8000

# Run with only a text model
uv run python main.py \
  --load llm \
  --model-id unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit \
  --port 8000

# Run with a GGUF model (auto-detect best quantization)
uv run python main.py \
  --load llm \
  --model-id unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF \
  --port 8000

# Run with a specific GGUF file
uv run python main.py \
  --load llm \
  --model-id TheBloke/Llama-2-7B-GGUF \
  --gguf-file llama-2-7b.Q5_K_M.gguf \
  --port 8000

# Run with only an image model
uv run python main.py \
  --load image \
  --image-model-id stabilityai/sd-turbo \
  --port 8000

# Run with only audio transcription
uv run python main.py \
  --load audio \
  --audio-model-id openai/whisper-base \
  --port 8000

# Run with only text-to-speech
uv run python main.py \
  --load tts \
  --tts-model-id suno/bark-small \
  --port 8000

Configuration

Create a .env file for gated models:

HF_TOKEN=hf_your_token_here

Or export directly:

export HF_TOKEN=hf_your_token_here

API Key Authentication

The server supports API key authentication with SQLite-backed persistence and in-memory caching for fast validation. All endpoints except /health require authentication when enabled.

Enable/Disable

Set in .env:

# Enable API key authentication (default: true)
API_KEY_ENABLED=true

# Database path (default: ./data/api_keys.db)
API_KEY_DB_PATH=./data/api_keys.db

# bcrypt rounds for hashing (default: 12)
API_KEY_BCRYPT_ROUNDS=12

# Key prefix (default: sk-)
API_KEY_KEY_PREFIX=sk-

# Random key length in hex chars (default: 48)
API_KEY_KEY_LENGTH=48

Managing Keys

Use the included CLI tool manage_keys.py:

# Create a new key
uv run python -m src.manage_keys add "My App Key"

# Create an admin key (required for hot-reload)
uv run python -m src.manage_keys add --admin "Admin Key"

# List all keys
uv run python -m src.manage_keys list

# Deactivate a key
uv run python -m src.manage_keys deactivate <key_id>

# Reactivate a key
uv run python -m src.manage_keys reactivate <key_id>

# Hot-reload active keys without restarting server
uv run python -m src.manage_keys reload

Important: The full key is shown only once on creation. Store it securely.

Using Keys in Requests

Include the key in the Authorization header:

# Chat completions with authentication
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }'

# Image generation with authentication
curl -X POST http://localhost:8000/v1/images/generations \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "A serene mountain landscape",
    "n": 1,
    "size": "512x512"
  }'

# Hot-reload keys (admin only)
curl -X POST http://localhost:8000/v1/admin/keys/reload \
  -H "Authorization: Bearer sk-your-admin-key-here"

Key Features

  • bcrypt hashing: Keys stored as hashes, full key shown only once on creation
  • Soft delete: Deactivated keys are kept in DB with deleted_at for audit trail
  • In-memory cache: Active keys loaded into memory for O(1) validation
  • Hot-reload: Add/remove keys without server restart via admin endpoint
  • Admin flag: Admin keys can trigger hot-reload via POST /v1/admin/keys/reload

Docker Deployment

The project includes Docker support for easy deployment on cloud platforms like RunPod, as well as local testing.

Quick Start with Docker

# Pull and run with default settings (LOAD=llm, minimal model)
docker run -p 8000:8000 \
  -e AUTO_CREATE_ADMIN_KEY=true \
  yourusername/hagalaz:latest

# Run with all models
docker run -p 8000:8000 \
  -e LOAD=all \
  -e MODEL_ID=unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit \
  -e IMAGE_MODEL_ID=stabilityai/sd-turbo \
  -e AUDIO_MODEL_ID=openai/whisper-base \
  -e TTS_MODEL_ID=suno/bark-small \
  -e HF_TOKEN=hf_your_token_here \
  -e AUTO_CREATE_ADMIN_KEY=true \
  -v /path/to/models:/app/models \
  yourusername/hagalaz:cuda

Building Images

# General purpose (CPU/GPU-agnostic)
docker build -f docker/Dockerfile -t hagalaz:latest .

# CUDA for GPU hosts
docker build -f docker/Dockerfile.cuda -t hagalaz:cuda .

# RunPod optimized
docker build -f docker/Dockerfile.runpod -t hagalaz:runpod .

Docker Compose

cd docker

# Start with general purpose image
docker-compose up hagalaz

# Start with CUDA image
docker-compose up hagalaz-cuda

Environment Variables

All configuration is done via environment variables:

Variable Description Default
LOAD Models to load: llm, image, audio, tts, both, all llm
MODEL_ID Chat model ID unsloth/Llama-3.2-1B-Instruct-unsloth-bnb-4bit
IMAGE_MODEL_ID Image generation model ID (none)
AUDIO_MODEL_ID Audio transcription model ID (none)
TTS_MODEL_ID Text-to-speech model ID (none)
PORT Server port 8000
CACHE_SIZE Response cache size 128
WORKERS Uvicorn workers 1
MODELS_DIR Models storage directory /app/models
HF_TOKEN HuggingFace authentication token (none)
AUTO_CREATE_ADMIN_KEY Auto-create admin key on startup false
ADMIN_KEY_NAME Name for auto-created admin key admin
AUTO_CREATE_KEYS Comma-separated regular key names (none)

RunPod Deployment

Pod Mode (Persistent Server)

Deploy as a persistent pod on RunPod:

docker run -p 8000:8000 \
  -e LOAD=all \
  -e MODEL_ID=unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit \
  -e AUTO_CREATE_ADMIN_KEY=true \
  -v /runpod-volume:/app/models \
  yourusername/hagalaz:runpod

The admin key will be printed to logs on first startup.

Serverless Mode

Use the included handler for RunPod serverless:

{
  "input": {
    "endpoint": "chat/completions",
    "model": "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "max_tokens": 512
  }
}

Supported endpoints: chat/completions, images/generations, audio/transcriptions, audio/speech, models/list

Note: Streaming is disabled in serverless mode. All responses are returned as complete JSON.

Volume Mounts

For persistent storage, mount these directories:

Container Path Description
/app/models Downloaded HuggingFace models
/app/data API key database

Example:

docker run -p 8000:8000 \
  -v /path/to/models:/app/models \
  -v /path/to/data:/app/data \
  hagalaz:cuda

See DEPLOY_DOCKER.md for detailed build and publish instructions.

OpenCode Integration

Add to ~/.config/opencode/opencode.json:

{
  "model": "local/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
  "provider": {
    "local": {
      "name": "Local",
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://localhost:8000/v1",
        "apiKey": "empty-for-local"
      },
      "models": {
        "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit": {
          "name": "DeepSeek 1.5B (Unsloth)",
          "reasoning": true,
          "interleaved": {
            "field": "reasoning_content"
          }
        }
      }
    }
  }
}

With API key authentication enabled:

{
  "model": "local/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
  "provider": {
    "local": {
      "name": "Local",
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://localhost:8000/v1",
        "apiKey": "sk-your-key-here"
      },
      "models": {
        "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit": {
          "name": "DeepSeek 1.5B (Unsloth)",
          "reasoning": true,
          "interleaved": {
            "field": "reasoning_content"
          }
        }
      }
    }
  }
}

For image generation support, add the image model:

{
  "model": "local/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
  "provider": {
    "local": {
      "name": "Local",
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://localhost:8000/v1",
        "apiKey": "empty-for-local"
      },
      "models": {
        "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit": {
          "name": "DeepSeek 1.5B (Unsloth)",
          "reasoning": true,
          "interleaved": {
            "field": "reasoning_content"
          }
        },
        "stabilityai/sd-turbo": {
          "name": "SD Turbo",
          "attachment": true,
          "modalities": {
            "input": ["text"],
            "output": ["image"]
          }
        }
      }
    }
  }
}

Project Structure

.
├── main.py              # Application entry point
├── app.py               # Backward-compatible wrapper
├── src/
│   ├── __init__.py
│   ├── config.py        # CLI arguments and global config
│   ├── api/             # HTTP endpoints
│   │   ├── __init__.py
│   │   ├── routes.py    # OpenAI-compatible endpoints
│   │   ├── middleware.py # Auth middleware (Bearer token validation)
│   │   ├── rate_limit.py # Rate limiting configuration
│   │   └── cache.py     # aiocache configuration
│   ├── core/            # Business logic and services
│   │   ├── __init__.py
│   │   ├── models.py    # HF auth and text model loading
│   │   ├── images.py    # Stable Diffusion image generation
│   │   ├── cache.py     # LRU response cache
│   │   ├── inference.py # Async inference worker queue
│   │   ├── api_keys.py  # API key management (SQLite + bcrypt)
│   │   └── audio.py     # Audio transcription & TTS
│   ├── streaming/       # Real-time token streaming
│   │   ├── __init__.py
│   │   └── streamer.py  # Token streamer with reasoning detection
│   └── utils/           # Utilities
│       ├── __init__.py
│       └── text.py      # Text parsing and conversation utilities
├── src/manage_keys.py   # CLI tool for API key management
├── docker/              # Docker deployment files
│   ├── Dockerfile               # General purpose image
│   ├── Dockerfile.cuda          # CUDA image for GPU hosts
│   ├── Dockerfile.runpod        # RunPod-optimized image
│   ├── docker_start.py          # Container entrypoint
│   ├── docker-compose.yml       # Local testing
│   └── .dockerignore            # Build exclusions
├── runpod/              # RunPod serverless handler
│   ├── handler.py       # Serverless handler interface
│   └── README.md        # RunPod deployment guide
├── DEPLOY_DOCKER.md     # Docker build & publish guide
├── tests/               # Test suite
│   ├── conftest.py      # Shared pytest fixtures
│   └── unit/
│       ├── core/        # Core module tests
│       │   ├── test_config.py
│       │   ├── test_cache.py
│       │   ├── test_models.py
│       │   ├── test_images.py
│       │   ├── test_audio.py
│       │   └── test_api_keys.py
│       └── api/         # API endpoint tests
│           ├── test_routes.py
│           └── test_rate_limit.py
├── .env.example         # Example environment variables
├── pyproject.toml       # Project dependencies
├── uv.lock              # Locked dependency versions
└── .env                 # Environment variables

CLI Options

Option Description Default
--load Which models to load: llm, image, audio, tts, both, or all all
--model-id Hugging Face chat model ID Required when loading llm/both/all
--image-model-id Hugging Face image model ID Required when loading image/both/all
--audio-model-id Hugging Face audio transcription model ID Required when loading audio/all
--tts-model-id Hugging Face text-to-speech model ID Required when loading tts/all
--port Server port 8000
--cache-size Response cache size 128
--workers Uvicorn workers (1 recommended) 1
--strip-reasoning Remove <think> tags from output False
--models-dir Directory to store downloaded models ./models
--gguf-file Specific GGUF file to load (e.g., model-Q4_K_M.gguf) Auto-detected
--gguf-auto-detect Auto-detect best GGUF from repository True

API Endpoints

Chat Completions

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false,
    "max_tokens": 512,
    "temperature": 0.7
  }'

Streaming example:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "stream": true
  }'

Image Generation

curl -X POST http://localhost:8000/v1/images/generations \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "A serene mountain landscape at sunset, digital art",
    "n": 1,
    "size": "512x512",
    "response_format": "b64_json"
  }'

List Models

curl http://localhost:8000/v1/models \
  -H "Authorization: Bearer sk-your-key-here"

Audio Transcription

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@audio.mp3"

Streaming example:

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@audio.mp3" \
  -F "stream=true"

Text-to-Speech

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello, this is a test of the text to speech system.",
    "voice": "v2/en_speaker_6",
    "response_format": "mp3"
  }'

Health Check

curl http://localhost:8000/health

Text Models (Chat)

Model Size VRAM Notes
unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit 1.5B ~3GB Recommended - Reasoning model, optimized 4-bit
unsloth/Llama-3.2-1B-Instruct-unsloth-bnb-4bit 1B ~2GB Very fast, minimal VRAM
google/gemma-2-2b-it 2B ~4GB Lightweight, decent quality
unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit 3B ~4GB Good balance speed/quality
meta-llama/Llama-3.2-3B-Instruct 3B ~5GB Requires HF_TOKEN (gated)
microsoft/Phi-3-mini-4k-instruct 3.8B ~5GB Fast, good instruction following
unsloth/Phi-3-mini-4k-instruct-bnb-4bit 3.8B ~4GB Unsloth optimized version
unsloth/Qwen2.5-7B-Instruct-bnb-4bit 7B ~6GB Recommended - Higher quality, efficient
Qwen/Qwen2.5-7B-Instruct 7B ~8GB Higher quality, needs more VRAM
unsloth/Mistral-Small-3.2-24B-Instruct-2506-bnb-4bit 25B ~14GB Best quality, needs 16GB+ VRAM

GGUF Models (Alternative Format)

Model Size VRAM Notes
unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF 1.5B ~3GB Recommended - GGUF Q4_K_M, fast inference
unsloth/Qwen2.5-7B-Instruct-GGUF 7B ~6GB High quality GGUF format
TheBloke/Llama-2-7B-GGUF 7B ~6GB Llama 2, widely compatible
TheBloke/Mistral-7B-Instruct-v0.2-GGUF 7B ~6GB Mistral, good instruction following

GGUF vs BNB:

  • GGUF: Single-file format, easier to manage, dequantized at load time
  • BNB 4-bit: Native transformers quantization, potentially better VRAM efficiency
  • Both supported simultaneously — use whichever fits your workflow

Unsloth models are pre-quantized with optimized 4-bit BNB, loading faster and using less VRAM than original models. They are the recommended choice for this server.

Image Models

Model VRAM Speed Quality
stabilityai/sd-turbo 3-5GB Very Fast Good
stabilityai/sdxl-turbo 4-6GB Very Fast Very Good
runwayml/stable-diffusion-v1-5 4-6GB Medium Good
stabilityai/stable-diffusion-2-1 5-7GB Medium Better
ByteDance/SDXL-Lightning 6-8GB Fast Very Good
stabilityai/stable-diffusion-xl-base-1.0 8-10GB Slow Best
stabilityai/stable-diffusion-3-medium-diffusers 10-12GB Slow Best

Audio Models (Transcription)

Model VRAM Speed Quality
openai/whisper-tiny 2-4GB Very Fast Good
openai/whisper-base 3-5GB Fast Good
openai/whisper-small 4-6GB Medium Better
openai/whisper-medium 6-8GB Medium Better
openai/whisper-large-v3-turbo 6-8GB Fast Best
openai/whisper-large-v3 8-10GB Slow Best

Text-to-Speech Models

Model VRAM Speed Quality
suno/bark-small 4-6GB Medium Good
suno/bark 6-8GB Slow Better
microsoft/speecht5_tts 2-4GB Fast Good

Note: Running multiple models simultaneously requires sufficient VRAM. With 6-8GB VRAM, use smaller models (DeepSeek 1.5B + SD Turbo + Whisper Tiny). With 12GB+ VRAM, you can run larger combinations. Use --load to selectively load only the models you need.

Reasoning Models

Models like DeepSeek-R1 output reasoning inside <think> tags. The server automatically:

  1. Streams reasoning tokens as reasoning_content / reasoning_text
  2. Detects </think> to switch to content streaming
  3. Emits both fields for maximum client compatibility

In OpenCode, use /thinking to toggle reasoning visibility.

To strip reasoning entirely (faster, less output):

uv run python main.py \
  --model-id unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit \
  --strip-reasoning \
  --port 8000

Rate Limiting

The server includes per-endpoint rate limiting based on client IP address using slowapi:

Endpoint Rate Limit
/v1/chat/completions 30 requests/minute
/v1/images/generations 10 requests/minute
/v1/audio/transcriptions 20 requests/minute
/v1/audio/speech 20 requests/minute
/v1/models 60 requests/minute
/health 100 requests/minute

When rate limit is exceeded, the server returns HTTP 429 with:

{
  "error": {
    "message": "Rate limit exceeded: ...",
    "type": "rate_limit_exceeded",
    "code": "rate_limit_exceeded"
  }
}

Caching

The server uses aiocache for response caching:

  • Chat completions (non-streaming): Cached for 5 minutes based on messages, max_tokens, and temperature
  • Model list: Cached for 1 minute
  • Streaming responses: Not cached

Cache is stored in memory by default. For production with multiple workers, configure Redis:

# In src/api/cache.py
from aiocache import Cache

cache = Cache(Cache.REDIS, endpoint="localhost", port=6379, serializer=JsonSerializer())

Development

# Run with auto-reload (development only)
uv run uvicorn main:create_app --factory --reload --port 8000

# Format code
uv run ruff format .

# Type check
uv run pyright

Testing

The project uses pytest for testing. Tests are organized under the tests/ directory:

tests/
├── conftest.py                  # Shared fixtures
├── unit/
│   ├── core/                    # Core module tests
│   │   ├── test_config.py       # Configuration tests
│   │   ├── test_cache.py        # Cache tests
│   │   ├── test_models.py       # Model loading & GGUF tests
│   │   ├── test_images.py       # Image generation tests
│   │   ├── test_audio.py        # Audio/TTS tests
│   │   └── test_api_keys.py     # API key management tests
│   └── api/
│       ├── test_routes.py       # API endpoint tests
│       └── test_rate_limit.py   # Rate limiting tests
└── integration/
    └── (future e2e tests)

Running Tests

# Run all tests
uv run pytest tests/ -v

# Run specific test file
uv run pytest tests/unit/core/test_models.py -v

# Run with coverage report
uv run pytest tests/ -v --cov=src --cov-report=term-missing

# Run with HTML coverage report
uv run pytest tests/ --cov=src --cov-report=html

# Run with coverage and fail if below threshold
uv run pytest tests/ --cov=src --cov-fail-under=80

Test Coverage

Current test coverage focuses on:

  • Configuration: Settings classes, environment variables, CLI arguments
  • Cache: LRU eviction, key generation, hit/miss logic
  • Model Loading: HF auth, GGUF detection, auto-quantization selection
  • Image Generation: Pipeline calls, base64 conversion
  • Audio: Transcription, text-to-speech, base64 conversion
  • API Key Management: Key generation, bcrypt hashing, database operations, validation
  • API Routes: Health check, model listing, error handling

Docker Testing

# Build and test locally
docker build -f docker/Dockerfile -t hagalaz:test .
docker run -p 8000:8000 -e AUTO_CREATE_ADMIN_KEY=true hagalaz:test

# Test with docker-compose
cd docker
docker-compose up --build

Coverage reports are generated in htmlcov/ when using --cov-report=html.

Troubleshooting

ImportError: cannot import name 'BaseStreamer'

The BaseStreamer class path changed in newer transformers versions. This is handled automatically in src/streaming/streamer.py.

Out of Memory

  • Reduce --cache-size (default: 128)
  • Use smaller models
  • Enable CPU offload for image models (edit src/core/images.py)
  • Run only specific models with --load llm, --load image, --load audio, or --load tts

Model Access Denied

Gated models require authentication:

  1. Get token from https://huggingface.co/settings/tokens
  2. Add to .env: HF_TOKEN=hf_...
  3. Or run: huggingface-cli login

API Key Authentication Issues

401 Unauthorized:

  • Ensure Authorization: Bearer sk-... header is present in requests
  • The /health endpoint is the only one that does not require authentication
  • Create a key first: uv run python -m src.manage_keys add "My Key"

Key not recognized after creation:

  • Hot-reload keys: uv run python -m src.manage_keys reload
  • Or restart the server

Admin operations failing:

  • Admin keys are created with --admin flag: uv run python -m src.manage_keys add --admin "Admin"
  • Only admin keys can trigger POST /v1/admin/keys/reload

Slow Generation

  • Use sd-turbo instead of full SD models for images
  • Reduce max_tokens in requests
  • Enable use_cache=True (already enabled by default)

License

MIT