No description

Find a file

Rodolfo De Nadai 00769da682 fix: Small fixes		2026-05-07 08:50:13 -03:00
docker	fix: Small fixes	2026-05-07 08:50:13 -03:00
runpod	feat: Project version 0.1.0	2026-05-07 00:18:42 -03:00
src	feat: Project version 0.1.0	2026-05-07 00:18:42 -03:00
static	fix: Small fixes	2026-05-07 08:50:13 -03:00
tests	feat: Project version 0.1.0	2026-05-07 00:18:42 -03:00
.env.example	feat: Project version 0.1.0	2026-05-07 00:18:42 -03:00
.gitignore	feat: Project version 0.1.0	2026-05-07 00:18:42 -03:00
app.py	feat: Project version 0.1.0	2026-05-07 00:18:42 -03:00
DEPLOY_DOCKER.md	feat: Project version 0.1.0	2026-05-07 00:18:42 -03:00
main.py	feat: Project version 0.1.0	2026-05-07 00:18:42 -03:00
pyproject.toml	feat: Project version 0.1.0	2026-05-07 00:18:42 -03:00
pytest.ini	feat: Project version 0.1.0	2026-05-07 00:18:42 -03:00
README.md	feat: Project version 0.1.0	2026-05-07 00:18:42 -03:00
RULES.md	feat: Project version 0.1.0	2026-05-07 00:18:42 -03:00
uv.lock	feat: Project version 0.1.0	2026-05-07 00:18:42 -03:00

README.md

Hagalaz

A lightweight, OpenAI-compatible API server for running Hugging Face models locally. Supports text generation with real-time streaming, reasoning model detection, image generation via Stable Diffusion, audio transcription, and text-to-speech.

Features

OpenAI-compatible API - Drop-in replacement for OpenAI API endpoints
Real-time Streaming - True token-by-token streaming (not batched post-generation)
Reasoning Models - Automatic <think> tag detection for DeepSeek-R1 and similar models
4-bit Quantization - Run large models on GPUs with ~6GB VRAM via BitsAndBytes
Response Caching - LRU cache for non-streaming requests
Image Generation - Stable Diffusion support via /v1/images/generations
Audio Transcription - Whisper-based speech-to-text with streaming support via /v1/audio/transcriptions
Text-to-Speech - Bark-based speech synthesis via /v1/audio/speech
Rate Limiting - Per-endpoint rate limits using slowapi (IP-based)
Response Caching - aiocache integration for model lists and chat completions
API Key Authentication - SQLite-backed key management with bcrypt hashing, admin keys, and hot-reload
Hugging Face Auth - Gated model support via HF_TOKEN

Quick Start

# Install dependencies
uv sync

# Run with all models
uv run python main.py \
  --model-id unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit \
  --image-model-id stabilityai/sd-turbo \
  --audio-model-id openai/whisper-base \
  --tts-model-id suno/bark-small \
  --port 8000

# Run with only a text model
uv run python main.py \
  --load llm \
  --model-id unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit \
  --port 8000

# Run with a GGUF model (auto-detect best quantization)
uv run python main.py \
  --load llm \
  --model-id unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF \
  --port 8000

# Run with a specific GGUF file
uv run python main.py \
  --load llm \
  --model-id TheBloke/Llama-2-7B-GGUF \
  --gguf-file llama-2-7b.Q5_K_M.gguf \
  --port 8000

# Run with only an image model
uv run python main.py \
  --load image \
  --image-model-id stabilityai/sd-turbo \
  --port 8000

# Run with only audio transcription
uv run python main.py \
  --load audio \
  --audio-model-id openai/whisper-base \
  --port 8000

# Run with only text-to-speech
uv run python main.py \
  --load tts \
  --tts-model-id suno/bark-small \
  --port 8000

Configuration

Create a .env file for gated models:

HF_TOKEN=hf_your_token_here

Or export directly:

export HF_TOKEN=hf_your_token_here

API Key Authentication

The server supports API key authentication with SQLite-backed persistence and in-memory caching for fast validation. All endpoints except /health require authentication when enabled.

Enable/Disable

Set in .env:

# Enable API key authentication (default: true)
API_KEY_ENABLED=true

# Database path (default: ./data/api_keys.db)
API_KEY_DB_PATH=./data/api_keys.db

# bcrypt rounds for hashing (default: 12)
API_KEY_BCRYPT_ROUNDS=12

# Key prefix (default: sk-)
API_KEY_KEY_PREFIX=sk-

# Random key length in hex chars (default: 48)
API_KEY_KEY_LENGTH=48

Managing Keys

Use the included CLI tool manage_keys.py:

# Create a new key
uv run python -m src.manage_keys add "My App Key"

# Create an admin key (required for hot-reload)
uv run python -m src.manage_keys add --admin "Admin Key"

# List all keys
uv run python -m src.manage_keys list

# Deactivate a key
uv run python -m src.manage_keys deactivate <key_id>

# Reactivate a key
uv run python -m src.manage_keys reactivate <key_id>

# Hot-reload active keys without restarting server
uv run python -m src.manage_keys reload

Important: The full key is shown only once on creation. Store it securely.

Using Keys in Requests

Include the key in the Authorization header:

# Chat completions with authentication
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }'

# Image generation with authentication
curl -X POST http://localhost:8000/v1/images/generations \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "A serene mountain landscape",
    "n": 1,
    "size": "512x512"
  }'

# Hot-reload keys (admin only)
curl -X POST http://localhost:8000/v1/admin/keys/reload \
  -H "Authorization: Bearer sk-your-admin-key-here"

Key Features

bcrypt hashing: Keys stored as hashes, full key shown only once on creation
Soft delete: Deactivated keys are kept in DB with deleted_at for audit trail
In-memory cache: Active keys loaded into memory for O(1) validation
Hot-reload: Add/remove keys without server restart via admin endpoint
Admin flag: Admin keys can trigger hot-reload via POST /v1/admin/keys/reload

Docker Deployment

The project includes Docker support for easy deployment on cloud platforms like RunPod, as well as local testing.

Quick Start with Docker

# Pull and run with default settings (LOAD=llm, minimal model)
docker run -p 8000:8000 \
  -e AUTO_CREATE_ADMIN_KEY=true \
  yourusername/hagalaz:latest

# Run with all models
docker run -p 8000:8000 \
  -e LOAD=all \
  -e MODEL_ID=unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit \
  -e IMAGE_MODEL_ID=stabilityai/sd-turbo \
  -e AUDIO_MODEL_ID=openai/whisper-base \
  -e TTS_MODEL_ID=suno/bark-small \
  -e HF_TOKEN=hf_your_token_here \
  -e AUTO_CREATE_ADMIN_KEY=true \
  -v /path/to/models:/app/models \
  yourusername/hagalaz:cuda

Building Images

# General purpose (CPU/GPU-agnostic)
docker build -f docker/Dockerfile -t hagalaz:latest .

# CUDA for GPU hosts
docker build -f docker/Dockerfile.cuda -t hagalaz:cuda .

# RunPod optimized
docker build -f docker/Dockerfile.runpod -t hagalaz:runpod .

Docker Compose

cd docker

# Start with general purpose image
docker-compose up hagalaz

# Start with CUDA image
docker-compose up hagalaz-cuda

Environment Variables

All configuration is done via environment variables:

Variable	Description	Default
`LOAD`	Models to load: `llm`, `image`, `audio`, `tts`, `both`, `all`	`llm`
`MODEL_ID`	Chat model ID	`unsloth/Llama-3.2-1B-Instruct-unsloth-bnb-4bit`
`IMAGE_MODEL_ID`	Image generation model ID	(none)
`AUDIO_MODEL_ID`	Audio transcription model ID	(none)
`TTS_MODEL_ID`	Text-to-speech model ID	(none)
`PORT`	Server port	`8000`
`CACHE_SIZE`	Response cache size	`128`
`WORKERS`	Uvicorn workers	`1`
`MODELS_DIR`	Models storage directory	`/app/models`
`HF_TOKEN`	HuggingFace authentication token	(none)
`AUTO_CREATE_ADMIN_KEY`	Auto-create admin key on startup	`false`
`ADMIN_KEY_NAME`	Name for auto-created admin key	`admin`
`AUTO_CREATE_KEYS`	Comma-separated regular key names	(none)

RunPod Deployment

Pod Mode (Persistent Server)

Deploy as a persistent pod on RunPod:

docker run -p 8000:8000 \
  -e LOAD=all \
  -e MODEL_ID=unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit \
  -e AUTO_CREATE_ADMIN_KEY=true \
  -v /runpod-volume:/app/models \
  yourusername/hagalaz:runpod

The admin key will be printed to logs on first startup.

Serverless Mode

Use the included handler for RunPod serverless:

{
  "input": {
    "endpoint": "chat/completions",
    "model": "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "max_tokens": 512
  }
}

Supported endpoints: chat/completions, images/generations, audio/transcriptions, audio/speech, models/list

Note: Streaming is disabled in serverless mode. All responses are returned as complete JSON.

Volume Mounts

For persistent storage, mount these directories:

Container Path	Description
`/app/models`	Downloaded HuggingFace models
`/app/data`	API key database

Example:

docker run -p 8000:8000 \
  -v /path/to/models:/app/models \
  -v /path/to/data:/app/data \
  hagalaz:cuda

See DEPLOY_DOCKER.md for detailed build and publish instructions.

OpenCode Integration

Add to ~/.config/opencode/opencode.json:

{
  "model": "local/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
  "provider": {
    "local": {
      "name": "Local",
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://localhost:8000/v1",
        "apiKey": "empty-for-local"
      },
      "models": {
        "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit": {
          "name": "DeepSeek 1.5B (Unsloth)",
          "reasoning": true,
          "interleaved": {
            "field": "reasoning_content"
          }
        }
      }
    }
  }
}

With API key authentication enabled:

{
  "model": "local/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
  "provider": {
    "local": {
      "name": "Local",
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://localhost:8000/v1",
        "apiKey": "sk-your-key-here"
      },
      "models": {
        "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit": {
          "name": "DeepSeek 1.5B (Unsloth)",
          "reasoning": true,
          "interleaved": {
            "field": "reasoning_content"
          }
        }
      }
    }
  }
}

For image generation support, add the image model:

{
  "model": "local/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
  "provider": {
    "local": {
      "name": "Local",
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://localhost:8000/v1",
        "apiKey": "empty-for-local"
      },
      "models": {
        "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit": {
          "name": "DeepSeek 1.5B (Unsloth)",
          "reasoning": true,
          "interleaved": {
            "field": "reasoning_content"
          }
        },
        "stabilityai/sd-turbo": {
          "name": "SD Turbo",
          "attachment": true,
          "modalities": {
            "input": ["text"],
            "output": ["image"]
          }
        }
      }
    }
  }
}

Project Structure

.
├── main.py              # Application entry point
├── app.py               # Backward-compatible wrapper
├── src/
│   ├── __init__.py
│   ├── config.py        # CLI arguments and global config
│   ├── api/             # HTTP endpoints
│   │   ├── __init__.py
│   │   ├── routes.py    # OpenAI-compatible endpoints
│   │   ├── middleware.py # Auth middleware (Bearer token validation)
│   │   ├── rate_limit.py # Rate limiting configuration
│   │   └── cache.py     # aiocache configuration
│   ├── core/            # Business logic and services
│   │   ├── __init__.py
│   │   ├── models.py    # HF auth and text model loading
│   │   ├── images.py    # Stable Diffusion image generation
│   │   ├── cache.py     # LRU response cache
│   │   ├── inference.py # Async inference worker queue
│   │   ├── api_keys.py  # API key management (SQLite + bcrypt)
│   │   └── audio.py     # Audio transcription & TTS
│   ├── streaming/       # Real-time token streaming
│   │   ├── __init__.py
│   │   └── streamer.py  # Token streamer with reasoning detection
│   └── utils/           # Utilities
│       ├── __init__.py
│       └── text.py      # Text parsing and conversation utilities
├── src/manage_keys.py   # CLI tool for API key management
├── docker/              # Docker deployment files
│   ├── Dockerfile               # General purpose image
│   ├── Dockerfile.cuda          # CUDA image for GPU hosts
│   ├── Dockerfile.runpod        # RunPod-optimized image
│   ├── docker_start.py          # Container entrypoint
│   ├── docker-compose.yml       # Local testing
│   └── .dockerignore            # Build exclusions
├── runpod/              # RunPod serverless handler
│   ├── handler.py       # Serverless handler interface
│   └── README.md        # RunPod deployment guide
├── DEPLOY_DOCKER.md     # Docker build & publish guide
├── tests/               # Test suite
│   ├── conftest.py      # Shared pytest fixtures
│   └── unit/
│       ├── core/        # Core module tests
│       │   ├── test_config.py
│       │   ├── test_cache.py
│       │   ├── test_models.py
│       │   ├── test_images.py
│       │   ├── test_audio.py
│       │   └── test_api_keys.py
│       └── api/         # API endpoint tests
│           ├── test_routes.py
│           └── test_rate_limit.py
├── .env.example         # Example environment variables
├── pyproject.toml       # Project dependencies
├── uv.lock              # Locked dependency versions
└── .env                 # Environment variables

CLI Options

Option	Description	Default
`--load`	Which models to load: `llm`, `image`, `audio`, `tts`, `both`, or `all`	`all`
`--model-id`	Hugging Face chat model ID	Required when loading llm/both/all
`--image-model-id`	Hugging Face image model ID	Required when loading image/both/all
`--audio-model-id`	Hugging Face audio transcription model ID	Required when loading audio/all
`--tts-model-id`	Hugging Face text-to-speech model ID	Required when loading tts/all
`--port`	Server port	8000
`--cache-size`	Response cache size	128
`--workers`	Uvicorn workers (1 recommended)	1
`--strip-reasoning`	Remove `<think>` tags from output	False
`--models-dir`	Directory to store downloaded models	`./models`
`--gguf-file`	Specific GGUF file to load (e.g., `model-Q4_K_M.gguf`)	Auto-detected
`--gguf-auto-detect`	Auto-detect best GGUF from repository	True

API Endpoints

Chat Completions

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false,
    "max_tokens": 512,
    "temperature": 0.7
  }'

Streaming example:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "stream": true
  }'

Image Generation

curl -X POST http://localhost:8000/v1/images/generations \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "A serene mountain landscape at sunset, digital art",
    "n": 1,
    "size": "512x512",
    "response_format": "b64_json"
  }'

List Models

curl http://localhost:8000/v1/models \
  -H "Authorization: Bearer sk-your-key-here"

Audio Transcription

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@audio.mp3"

Streaming example:

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@audio.mp3" \
  -F "stream=true"

Text-to-Speech

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Authorization: Bearer sk-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello, this is a test of the text to speech system.",
    "voice": "v2/en_speaker_6",
    "response_format": "mp3"
  }'

Health Check

curl http://localhost:8000/health

Recommended Models

Text Models (Chat)

Model	Size	VRAM	Notes
`unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit`	1.5B	~3GB	Recommended - Reasoning model, optimized 4-bit
`unsloth/Llama-3.2-1B-Instruct-unsloth-bnb-4bit`	1B	~2GB	Very fast, minimal VRAM
`google/gemma-2-2b-it`	2B	~4GB	Lightweight, decent quality
`unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit`	3B	~4GB	Good balance speed/quality
`meta-llama/Llama-3.2-3B-Instruct`	3B	~5GB	Requires HF_TOKEN (gated)
`microsoft/Phi-3-mini-4k-instruct`	3.8B	~5GB	Fast, good instruction following
`unsloth/Phi-3-mini-4k-instruct-bnb-4bit`	3.8B	~4GB	Unsloth optimized version
`unsloth/Qwen2.5-7B-Instruct-bnb-4bit`	7B	~6GB	Recommended - Higher quality, efficient
`Qwen/Qwen2.5-7B-Instruct`	7B	~8GB	Higher quality, needs more VRAM
`unsloth/Mistral-Small-3.2-24B-Instruct-2506-bnb-4bit`	25B	~14GB	Best quality, needs 16GB+ VRAM

GGUF Models (Alternative Format)

Model	Size	VRAM	Notes
`unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF`	1.5B	~3GB	Recommended - GGUF Q4_K_M, fast inference
`unsloth/Qwen2.5-7B-Instruct-GGUF`	7B	~6GB	High quality GGUF format
`TheBloke/Llama-2-7B-GGUF`	7B	~6GB	Llama 2, widely compatible
`TheBloke/Mistral-7B-Instruct-v0.2-GGUF`	7B	~6GB	Mistral, good instruction following

GGUF vs BNB:

GGUF: Single-file format, easier to manage, dequantized at load time
BNB 4-bit: Native transformers quantization, potentially better VRAM efficiency
Both supported simultaneously — use whichever fits your workflow

Unsloth models are pre-quantized with optimized 4-bit BNB, loading faster and using less VRAM than original models. They are the recommended choice for this server.

Image Models

Model	VRAM	Speed	Quality
`stabilityai/sd-turbo`	3-5GB	Very Fast	Good
`stabilityai/sdxl-turbo`	4-6GB	Very Fast	Very Good
`runwayml/stable-diffusion-v1-5`	4-6GB	Medium	Good
`stabilityai/stable-diffusion-2-1`	5-7GB	Medium	Better
`ByteDance/SDXL-Lightning`	6-8GB	Fast	Very Good
`stabilityai/stable-diffusion-xl-base-1.0`	8-10GB	Slow	Best
`stabilityai/stable-diffusion-3-medium-diffusers`	10-12GB	Slow	Best

Audio Models (Transcription)

Model	VRAM	Speed	Quality
`openai/whisper-tiny`	2-4GB	Very Fast	Good
`openai/whisper-base`	3-5GB	Fast	Good
`openai/whisper-small`	4-6GB	Medium	Better
`openai/whisper-medium`	6-8GB	Medium	Better
`openai/whisper-large-v3-turbo`	6-8GB	Fast	Best
`openai/whisper-large-v3`	8-10GB	Slow	Best

Text-to-Speech Models

Model	VRAM	Speed	Quality
`suno/bark-small`	4-6GB	Medium	Good
`suno/bark`	6-8GB	Slow	Better
`microsoft/speecht5_tts`	2-4GB	Fast	Good

Note: Running multiple models simultaneously requires sufficient VRAM. With 6-8GB VRAM, use smaller models (DeepSeek 1.5B + SD Turbo + Whisper Tiny). With 12GB+ VRAM, you can run larger combinations. Use --load to selectively load only the models you need.

Reasoning Models

Models like DeepSeek-R1 output reasoning inside <think> tags. The server automatically:

Streams reasoning tokens as reasoning_content / reasoning_text
Detects </think> to switch to content streaming
Emits both fields for maximum client compatibility

In OpenCode, use /thinking to toggle reasoning visibility.

To strip reasoning entirely (faster, less output):

uv run python main.py \
  --model-id unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit \
  --strip-reasoning \
  --port 8000

Rate Limiting

The server includes per-endpoint rate limiting based on client IP address using slowapi:

Endpoint	Rate Limit
`/v1/chat/completions`	30 requests/minute
`/v1/images/generations`	10 requests/minute
`/v1/audio/transcriptions`	20 requests/minute
`/v1/audio/speech`	20 requests/minute
`/v1/models`	60 requests/minute
`/health`	100 requests/minute

When rate limit is exceeded, the server returns HTTP 429 with:

{
  "error": {
    "message": "Rate limit exceeded: ...",
    "type": "rate_limit_exceeded",
    "code": "rate_limit_exceeded"
  }
}

Caching

The server uses aiocache for response caching:

Chat completions (non-streaming): Cached for 5 minutes based on messages, max_tokens, and temperature
Model list: Cached for 1 minute
Streaming responses: Not cached

Cache is stored in memory by default. For production with multiple workers, configure Redis:

# In src/api/cache.py
from aiocache import Cache

cache = Cache(Cache.REDIS, endpoint="localhost", port=6379, serializer=JsonSerializer())

Development

# Run with auto-reload (development only)
uv run uvicorn main:create_app --factory --reload --port 8000

# Format code
uv run ruff format .

# Type check
uv run pyright

Testing

The project uses pytest for testing. Tests are organized under the tests/ directory:

tests/
├── conftest.py                  # Shared fixtures
├── unit/
│   ├── core/                    # Core module tests
│   │   ├── test_config.py       # Configuration tests
│   │   ├── test_cache.py        # Cache tests
│   │   ├── test_models.py       # Model loading & GGUF tests
│   │   ├── test_images.py       # Image generation tests
│   │   ├── test_audio.py        # Audio/TTS tests
│   │   └── test_api_keys.py     # API key management tests
│   └── api/
│       ├── test_routes.py       # API endpoint tests
│       └── test_rate_limit.py   # Rate limiting tests
└── integration/
    └── (future e2e tests)

Running Tests

# Run all tests
uv run pytest tests/ -v

# Run specific test file
uv run pytest tests/unit/core/test_models.py -v

# Run with coverage report
uv run pytest tests/ -v --cov=src --cov-report=term-missing

# Run with HTML coverage report
uv run pytest tests/ --cov=src --cov-report=html

# Run with coverage and fail if below threshold
uv run pytest tests/ --cov=src --cov-fail-under=80

Test Coverage

Current test coverage focuses on:

Configuration: Settings classes, environment variables, CLI arguments
Cache: LRU eviction, key generation, hit/miss logic
Model Loading: HF auth, GGUF detection, auto-quantization selection
Image Generation: Pipeline calls, base64 conversion
Audio: Transcription, text-to-speech, base64 conversion
API Key Management: Key generation, bcrypt hashing, database operations, validation
API Routes: Health check, model listing, error handling

Docker Testing

# Build and test locally
docker build -f docker/Dockerfile -t hagalaz:test .
docker run -p 8000:8000 -e AUTO_CREATE_ADMIN_KEY=true hagalaz:test

# Test with docker-compose
cd docker
docker-compose up --build

Coverage reports are generated in htmlcov/ when using --cov-report=html.

Troubleshooting

ImportError: cannot import name 'BaseStreamer'

The BaseStreamer class path changed in newer transformers versions. This is handled automatically in src/streaming/streamer.py.

Out of Memory

Reduce --cache-size (default: 128)
Use smaller models
Enable CPU offload for image models (edit src/core/images.py)
Run only specific models with --load llm, --load image, --load audio, or --load tts

Model Access Denied

Gated models require authentication:

Get token from https://huggingface.co/settings/tokens
Add to .env: HF_TOKEN=hf_...
Or run: huggingface-cli login

API Key Authentication Issues

401 Unauthorized:

Ensure Authorization: Bearer sk-... header is present in requests
The /health endpoint is the only one that does not require authentication
Create a key first: uv run python -m src.manage_keys add "My Key"

Key not recognized after creation:

Hot-reload keys: uv run python -m src.manage_keys reload
Or restart the server

Admin operations failing:

Admin keys are created with --admin flag: uv run python -m src.manage_keys add --admin "Admin"
Only admin keys can trigger POST /v1/admin/keys/reload

Slow Generation

Use sd-turbo instead of full SD models for images
Reduce max_tokens in requests
Enable use_cache=True (already enabled by default)

License

MIT