| docker | ||
| runpod | ||
| src | ||
| static | ||
| tests | ||
| .env.example | ||
| .gitignore | ||
| app.py | ||
| DEPLOY_DOCKER.md | ||
| main.py | ||
| pyproject.toml | ||
| pytest.ini | ||
| README.md | ||
| RULES.md | ||
| uv.lock | ||
Hagalaz
A lightweight, OpenAI-compatible API server for running Hugging Face models locally. Supports text generation with real-time streaming, reasoning model detection, image generation via Stable Diffusion, audio transcription, and text-to-speech.
Features
- OpenAI-compatible API - Drop-in replacement for OpenAI API endpoints
- Real-time Streaming - True token-by-token streaming (not batched post-generation)
- Reasoning Models - Automatic
<think>tag detection for DeepSeek-R1 and similar models - 4-bit Quantization - Run large models on GPUs with ~6GB VRAM via BitsAndBytes
- Response Caching - LRU cache for non-streaming requests
- Image Generation - Stable Diffusion support via
/v1/images/generations - Audio Transcription - Whisper-based speech-to-text with streaming support via
/v1/audio/transcriptions - Text-to-Speech - Bark-based speech synthesis via
/v1/audio/speech - Rate Limiting - Per-endpoint rate limits using slowapi (IP-based)
- Response Caching - aiocache integration for model lists and chat completions
- API Key Authentication - SQLite-backed key management with bcrypt hashing, admin keys, and hot-reload
- Hugging Face Auth - Gated model support via
HF_TOKEN
Quick Start
# Install dependencies
uv sync
# Run with all models
uv run python main.py \
--model-id unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit \
--image-model-id stabilityai/sd-turbo \
--audio-model-id openai/whisper-base \
--tts-model-id suno/bark-small \
--port 8000
# Run with only a text model
uv run python main.py \
--load llm \
--model-id unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit \
--port 8000
# Run with a GGUF model (auto-detect best quantization)
uv run python main.py \
--load llm \
--model-id unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF \
--port 8000
# Run with a specific GGUF file
uv run python main.py \
--load llm \
--model-id TheBloke/Llama-2-7B-GGUF \
--gguf-file llama-2-7b.Q5_K_M.gguf \
--port 8000
# Run with only an image model
uv run python main.py \
--load image \
--image-model-id stabilityai/sd-turbo \
--port 8000
# Run with only audio transcription
uv run python main.py \
--load audio \
--audio-model-id openai/whisper-base \
--port 8000
# Run with only text-to-speech
uv run python main.py \
--load tts \
--tts-model-id suno/bark-small \
--port 8000
Configuration
Create a .env file for gated models:
HF_TOKEN=hf_your_token_here
Or export directly:
export HF_TOKEN=hf_your_token_here
API Key Authentication
The server supports API key authentication with SQLite-backed persistence and in-memory caching for fast validation. All endpoints except /health require authentication when enabled.
Enable/Disable
Set in .env:
# Enable API key authentication (default: true)
API_KEY_ENABLED=true
# Database path (default: ./data/api_keys.db)
API_KEY_DB_PATH=./data/api_keys.db
# bcrypt rounds for hashing (default: 12)
API_KEY_BCRYPT_ROUNDS=12
# Key prefix (default: sk-)
API_KEY_KEY_PREFIX=sk-
# Random key length in hex chars (default: 48)
API_KEY_KEY_LENGTH=48
Managing Keys
Use the included CLI tool manage_keys.py:
# Create a new key
uv run python -m src.manage_keys add "My App Key"
# Create an admin key (required for hot-reload)
uv run python -m src.manage_keys add --admin "Admin Key"
# List all keys
uv run python -m src.manage_keys list
# Deactivate a key
uv run python -m src.manage_keys deactivate <key_id>
# Reactivate a key
uv run python -m src.manage_keys reactivate <key_id>
# Hot-reload active keys without restarting server
uv run python -m src.manage_keys reload
Important: The full key is shown only once on creation. Store it securely.
Using Keys in Requests
Include the key in the Authorization header:
# Chat completions with authentication
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer sk-your-key-here" \
-H "Content-Type: application/json" \
-d '{
"model": "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'
# Image generation with authentication
curl -X POST http://localhost:8000/v1/images/generations \
-H "Authorization: Bearer sk-your-key-here" \
-H "Content-Type: application/json" \
-d '{
"prompt": "A serene mountain landscape",
"n": 1,
"size": "512x512"
}'
# Hot-reload keys (admin only)
curl -X POST http://localhost:8000/v1/admin/keys/reload \
-H "Authorization: Bearer sk-your-admin-key-here"
Key Features
- bcrypt hashing: Keys stored as hashes, full key shown only once on creation
- Soft delete: Deactivated keys are kept in DB with
deleted_atfor audit trail - In-memory cache: Active keys loaded into memory for O(1) validation
- Hot-reload: Add/remove keys without server restart via admin endpoint
- Admin flag: Admin keys can trigger hot-reload via
POST /v1/admin/keys/reload
Docker Deployment
The project includes Docker support for easy deployment on cloud platforms like RunPod, as well as local testing.
Quick Start with Docker
# Pull and run with default settings (LOAD=llm, minimal model)
docker run -p 8000:8000 \
-e AUTO_CREATE_ADMIN_KEY=true \
yourusername/hagalaz:latest
# Run with all models
docker run -p 8000:8000 \
-e LOAD=all \
-e MODEL_ID=unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit \
-e IMAGE_MODEL_ID=stabilityai/sd-turbo \
-e AUDIO_MODEL_ID=openai/whisper-base \
-e TTS_MODEL_ID=suno/bark-small \
-e HF_TOKEN=hf_your_token_here \
-e AUTO_CREATE_ADMIN_KEY=true \
-v /path/to/models:/app/models \
yourusername/hagalaz:cuda
Building Images
# General purpose (CPU/GPU-agnostic)
docker build -f docker/Dockerfile -t hagalaz:latest .
# CUDA for GPU hosts
docker build -f docker/Dockerfile.cuda -t hagalaz:cuda .
# RunPod optimized
docker build -f docker/Dockerfile.runpod -t hagalaz:runpod .
Docker Compose
cd docker
# Start with general purpose image
docker-compose up hagalaz
# Start with CUDA image
docker-compose up hagalaz-cuda
Environment Variables
All configuration is done via environment variables:
| Variable | Description | Default |
|---|---|---|
LOAD |
Models to load: llm, image, audio, tts, both, all |
llm |
MODEL_ID |
Chat model ID | unsloth/Llama-3.2-1B-Instruct-unsloth-bnb-4bit |
IMAGE_MODEL_ID |
Image generation model ID | (none) |
AUDIO_MODEL_ID |
Audio transcription model ID | (none) |
TTS_MODEL_ID |
Text-to-speech model ID | (none) |
PORT |
Server port | 8000 |
CACHE_SIZE |
Response cache size | 128 |
WORKERS |
Uvicorn workers | 1 |
MODELS_DIR |
Models storage directory | /app/models |
HF_TOKEN |
HuggingFace authentication token | (none) |
AUTO_CREATE_ADMIN_KEY |
Auto-create admin key on startup | false |
ADMIN_KEY_NAME |
Name for auto-created admin key | admin |
AUTO_CREATE_KEYS |
Comma-separated regular key names | (none) |
RunPod Deployment
Pod Mode (Persistent Server)
Deploy as a persistent pod on RunPod:
docker run -p 8000:8000 \
-e LOAD=all \
-e MODEL_ID=unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit \
-e AUTO_CREATE_ADMIN_KEY=true \
-v /runpod-volume:/app/models \
yourusername/hagalaz:runpod
The admin key will be printed to logs on first startup.
Serverless Mode
Use the included handler for RunPod serverless:
{
"input": {
"endpoint": "chat/completions",
"model": "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
"messages": [
{"role": "user", "content": "Hello!"}
],
"max_tokens": 512
}
}
Supported endpoints: chat/completions, images/generations, audio/transcriptions, audio/speech, models/list
Note: Streaming is disabled in serverless mode. All responses are returned as complete JSON.
Volume Mounts
For persistent storage, mount these directories:
| Container Path | Description |
|---|---|
/app/models |
Downloaded HuggingFace models |
/app/data |
API key database |
Example:
docker run -p 8000:8000 \
-v /path/to/models:/app/models \
-v /path/to/data:/app/data \
hagalaz:cuda
See DEPLOY_DOCKER.md for detailed build and publish instructions.
OpenCode Integration
Add to ~/.config/opencode/opencode.json:
{
"model": "local/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
"provider": {
"local": {
"name": "Local",
"npm": "@ai-sdk/openai-compatible",
"options": {
"baseURL": "http://localhost:8000/v1",
"apiKey": "empty-for-local"
},
"models": {
"unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit": {
"name": "DeepSeek 1.5B (Unsloth)",
"reasoning": true,
"interleaved": {
"field": "reasoning_content"
}
}
}
}
}
}
With API key authentication enabled:
{
"model": "local/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
"provider": {
"local": {
"name": "Local",
"npm": "@ai-sdk/openai-compatible",
"options": {
"baseURL": "http://localhost:8000/v1",
"apiKey": "sk-your-key-here"
},
"models": {
"unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit": {
"name": "DeepSeek 1.5B (Unsloth)",
"reasoning": true,
"interleaved": {
"field": "reasoning_content"
}
}
}
}
}
}
For image generation support, add the image model:
{
"model": "local/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
"provider": {
"local": {
"name": "Local",
"npm": "@ai-sdk/openai-compatible",
"options": {
"baseURL": "http://localhost:8000/v1",
"apiKey": "empty-for-local"
},
"models": {
"unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit": {
"name": "DeepSeek 1.5B (Unsloth)",
"reasoning": true,
"interleaved": {
"field": "reasoning_content"
}
},
"stabilityai/sd-turbo": {
"name": "SD Turbo",
"attachment": true,
"modalities": {
"input": ["text"],
"output": ["image"]
}
}
}
}
}
}
Project Structure
.
├── main.py # Application entry point
├── app.py # Backward-compatible wrapper
├── src/
│ ├── __init__.py
│ ├── config.py # CLI arguments and global config
│ ├── api/ # HTTP endpoints
│ │ ├── __init__.py
│ │ ├── routes.py # OpenAI-compatible endpoints
│ │ ├── middleware.py # Auth middleware (Bearer token validation)
│ │ ├── rate_limit.py # Rate limiting configuration
│ │ └── cache.py # aiocache configuration
│ ├── core/ # Business logic and services
│ │ ├── __init__.py
│ │ ├── models.py # HF auth and text model loading
│ │ ├── images.py # Stable Diffusion image generation
│ │ ├── cache.py # LRU response cache
│ │ ├── inference.py # Async inference worker queue
│ │ ├── api_keys.py # API key management (SQLite + bcrypt)
│ │ └── audio.py # Audio transcription & TTS
│ ├── streaming/ # Real-time token streaming
│ │ ├── __init__.py
│ │ └── streamer.py # Token streamer with reasoning detection
│ └── utils/ # Utilities
│ ├── __init__.py
│ └── text.py # Text parsing and conversation utilities
├── src/manage_keys.py # CLI tool for API key management
├── docker/ # Docker deployment files
│ ├── Dockerfile # General purpose image
│ ├── Dockerfile.cuda # CUDA image for GPU hosts
│ ├── Dockerfile.runpod # RunPod-optimized image
│ ├── docker_start.py # Container entrypoint
│ ├── docker-compose.yml # Local testing
│ └── .dockerignore # Build exclusions
├── runpod/ # RunPod serverless handler
│ ├── handler.py # Serverless handler interface
│ └── README.md # RunPod deployment guide
├── DEPLOY_DOCKER.md # Docker build & publish guide
├── tests/ # Test suite
│ ├── conftest.py # Shared pytest fixtures
│ └── unit/
│ ├── core/ # Core module tests
│ │ ├── test_config.py
│ │ ├── test_cache.py
│ │ ├── test_models.py
│ │ ├── test_images.py
│ │ ├── test_audio.py
│ │ └── test_api_keys.py
│ └── api/ # API endpoint tests
│ ├── test_routes.py
│ └── test_rate_limit.py
├── .env.example # Example environment variables
├── pyproject.toml # Project dependencies
├── uv.lock # Locked dependency versions
└── .env # Environment variables
CLI Options
| Option | Description | Default |
|---|---|---|
--load |
Which models to load: llm, image, audio, tts, both, or all |
all |
--model-id |
Hugging Face chat model ID | Required when loading llm/both/all |
--image-model-id |
Hugging Face image model ID | Required when loading image/both/all |
--audio-model-id |
Hugging Face audio transcription model ID | Required when loading audio/all |
--tts-model-id |
Hugging Face text-to-speech model ID | Required when loading tts/all |
--port |
Server port | 8000 |
--cache-size |
Response cache size | 128 |
--workers |
Uvicorn workers (1 recommended) | 1 |
--strip-reasoning |
Remove <think> tags from output |
False |
--models-dir |
Directory to store downloaded models | ./models |
--gguf-file |
Specific GGUF file to load (e.g., model-Q4_K_M.gguf) |
Auto-detected |
--gguf-auto-detect |
Auto-detect best GGUF from repository | True |
API Endpoints
Chat Completions
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer sk-your-key-here" \
-H "Content-Type: application/json" \
-d '{
"model": "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false,
"max_tokens": 512,
"temperature": 0.7
}'
Streaming example:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer sk-your-key-here" \
-H "Content-Type: application/json" \
-d '{
"model": "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit",
"messages": [{"role": "user", "content": "Explain quantum computing"}],
"stream": true
}'
Image Generation
curl -X POST http://localhost:8000/v1/images/generations \
-H "Authorization: Bearer sk-your-key-here" \
-H "Content-Type: application/json" \
-d '{
"prompt": "A serene mountain landscape at sunset, digital art",
"n": 1,
"size": "512x512",
"response_format": "b64_json"
}'
List Models
curl http://localhost:8000/v1/models \
-H "Authorization: Bearer sk-your-key-here"
Audio Transcription
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-H "Authorization: Bearer sk-your-key-here" \
-H "Content-Type: multipart/form-data" \
-F "file=@audio.mp3"
Streaming example:
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-H "Authorization: Bearer sk-your-key-here" \
-H "Content-Type: multipart/form-data" \
-F "file=@audio.mp3" \
-F "stream=true"
Text-to-Speech
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Authorization: Bearer sk-your-key-here" \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, this is a test of the text to speech system.",
"voice": "v2/en_speaker_6",
"response_format": "mp3"
}'
Health Check
curl http://localhost:8000/health
Recommended Models
Text Models (Chat)
| Model | Size | VRAM | Notes |
|---|---|---|---|
unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit |
1.5B | ~3GB | Recommended - Reasoning model, optimized 4-bit |
unsloth/Llama-3.2-1B-Instruct-unsloth-bnb-4bit |
1B | ~2GB | Very fast, minimal VRAM |
google/gemma-2-2b-it |
2B | ~4GB | Lightweight, decent quality |
unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit |
3B | ~4GB | Good balance speed/quality |
meta-llama/Llama-3.2-3B-Instruct |
3B | ~5GB | Requires HF_TOKEN (gated) |
microsoft/Phi-3-mini-4k-instruct |
3.8B | ~5GB | Fast, good instruction following |
unsloth/Phi-3-mini-4k-instruct-bnb-4bit |
3.8B | ~4GB | Unsloth optimized version |
unsloth/Qwen2.5-7B-Instruct-bnb-4bit |
7B | ~6GB | Recommended - Higher quality, efficient |
Qwen/Qwen2.5-7B-Instruct |
7B | ~8GB | Higher quality, needs more VRAM |
unsloth/Mistral-Small-3.2-24B-Instruct-2506-bnb-4bit |
25B | ~14GB | Best quality, needs 16GB+ VRAM |
GGUF Models (Alternative Format)
| Model | Size | VRAM | Notes |
|---|---|---|---|
unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF |
1.5B | ~3GB | Recommended - GGUF Q4_K_M, fast inference |
unsloth/Qwen2.5-7B-Instruct-GGUF |
7B | ~6GB | High quality GGUF format |
TheBloke/Llama-2-7B-GGUF |
7B | ~6GB | Llama 2, widely compatible |
TheBloke/Mistral-7B-Instruct-v0.2-GGUF |
7B | ~6GB | Mistral, good instruction following |
GGUF vs BNB:
- GGUF: Single-file format, easier to manage, dequantized at load time
- BNB 4-bit: Native transformers quantization, potentially better VRAM efficiency
- Both supported simultaneously — use whichever fits your workflow
Unsloth models are pre-quantized with optimized 4-bit BNB, loading faster and using less VRAM than original models. They are the recommended choice for this server.
Image Models
| Model | VRAM | Speed | Quality |
|---|---|---|---|
stabilityai/sd-turbo |
3-5GB | Very Fast | Good |
stabilityai/sdxl-turbo |
4-6GB | Very Fast | Very Good |
runwayml/stable-diffusion-v1-5 |
4-6GB | Medium | Good |
stabilityai/stable-diffusion-2-1 |
5-7GB | Medium | Better |
ByteDance/SDXL-Lightning |
6-8GB | Fast | Very Good |
stabilityai/stable-diffusion-xl-base-1.0 |
8-10GB | Slow | Best |
stabilityai/stable-diffusion-3-medium-diffusers |
10-12GB | Slow | Best |
Audio Models (Transcription)
| Model | VRAM | Speed | Quality |
|---|---|---|---|
openai/whisper-tiny |
2-4GB | Very Fast | Good |
openai/whisper-base |
3-5GB | Fast | Good |
openai/whisper-small |
4-6GB | Medium | Better |
openai/whisper-medium |
6-8GB | Medium | Better |
openai/whisper-large-v3-turbo |
6-8GB | Fast | Best |
openai/whisper-large-v3 |
8-10GB | Slow | Best |
Text-to-Speech Models
| Model | VRAM | Speed | Quality |
|---|---|---|---|
suno/bark-small |
4-6GB | Medium | Good |
suno/bark |
6-8GB | Slow | Better |
microsoft/speecht5_tts |
2-4GB | Fast | Good |
Note: Running multiple models simultaneously requires sufficient VRAM. With 6-8GB VRAM, use smaller models (DeepSeek 1.5B + SD Turbo + Whisper Tiny). With 12GB+ VRAM, you can run larger combinations. Use --load to selectively load only the models you need.
Reasoning Models
Models like DeepSeek-R1 output reasoning inside <think> tags. The server automatically:
- Streams reasoning tokens as
reasoning_content/reasoning_text - Detects
</think>to switch to content streaming - Emits both fields for maximum client compatibility
In OpenCode, use /thinking to toggle reasoning visibility.
To strip reasoning entirely (faster, less output):
uv run python main.py \
--model-id unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit \
--strip-reasoning \
--port 8000
Rate Limiting
The server includes per-endpoint rate limiting based on client IP address using slowapi:
| Endpoint | Rate Limit |
|---|---|
/v1/chat/completions |
30 requests/minute |
/v1/images/generations |
10 requests/minute |
/v1/audio/transcriptions |
20 requests/minute |
/v1/audio/speech |
20 requests/minute |
/v1/models |
60 requests/minute |
/health |
100 requests/minute |
When rate limit is exceeded, the server returns HTTP 429 with:
{
"error": {
"message": "Rate limit exceeded: ...",
"type": "rate_limit_exceeded",
"code": "rate_limit_exceeded"
}
}
Caching
The server uses aiocache for response caching:
- Chat completions (non-streaming): Cached for 5 minutes based on messages, max_tokens, and temperature
- Model list: Cached for 1 minute
- Streaming responses: Not cached
Cache is stored in memory by default. For production with multiple workers, configure Redis:
# In src/api/cache.py
from aiocache import Cache
cache = Cache(Cache.REDIS, endpoint="localhost", port=6379, serializer=JsonSerializer())
Development
# Run with auto-reload (development only)
uv run uvicorn main:create_app --factory --reload --port 8000
# Format code
uv run ruff format .
# Type check
uv run pyright
Testing
The project uses pytest for testing. Tests are organized under the tests/ directory:
tests/
├── conftest.py # Shared fixtures
├── unit/
│ ├── core/ # Core module tests
│ │ ├── test_config.py # Configuration tests
│ │ ├── test_cache.py # Cache tests
│ │ ├── test_models.py # Model loading & GGUF tests
│ │ ├── test_images.py # Image generation tests
│ │ ├── test_audio.py # Audio/TTS tests
│ │ └── test_api_keys.py # API key management tests
│ └── api/
│ ├── test_routes.py # API endpoint tests
│ └── test_rate_limit.py # Rate limiting tests
└── integration/
└── (future e2e tests)
Running Tests
# Run all tests
uv run pytest tests/ -v
# Run specific test file
uv run pytest tests/unit/core/test_models.py -v
# Run with coverage report
uv run pytest tests/ -v --cov=src --cov-report=term-missing
# Run with HTML coverage report
uv run pytest tests/ --cov=src --cov-report=html
# Run with coverage and fail if below threshold
uv run pytest tests/ --cov=src --cov-fail-under=80
Test Coverage
Current test coverage focuses on:
- Configuration: Settings classes, environment variables, CLI arguments
- Cache: LRU eviction, key generation, hit/miss logic
- Model Loading: HF auth, GGUF detection, auto-quantization selection
- Image Generation: Pipeline calls, base64 conversion
- Audio: Transcription, text-to-speech, base64 conversion
- API Key Management: Key generation, bcrypt hashing, database operations, validation
- API Routes: Health check, model listing, error handling
Docker Testing
# Build and test locally
docker build -f docker/Dockerfile -t hagalaz:test .
docker run -p 8000:8000 -e AUTO_CREATE_ADMIN_KEY=true hagalaz:test
# Test with docker-compose
cd docker
docker-compose up --build
Coverage reports are generated in htmlcov/ when using --cov-report=html.
Troubleshooting
ImportError: cannot import name 'BaseStreamer'
The BaseStreamer class path changed in newer transformers versions. This is handled automatically in src/streaming/streamer.py.
Out of Memory
- Reduce
--cache-size(default: 128) - Use smaller models
- Enable CPU offload for image models (edit
src/core/images.py) - Run only specific models with
--load llm,--load image,--load audio, or--load tts
Model Access Denied
Gated models require authentication:
- Get token from https://huggingface.co/settings/tokens
- Add to
.env:HF_TOKEN=hf_... - Or run:
huggingface-cli login
API Key Authentication Issues
401 Unauthorized:
- Ensure
Authorization: Bearer sk-...header is present in requests - The
/healthendpoint is the only one that does not require authentication - Create a key first:
uv run python -m src.manage_keys add "My Key"
Key not recognized after creation:
- Hot-reload keys:
uv run python -m src.manage_keys reload - Or restart the server
Admin operations failing:
- Admin keys are created with
--adminflag:uv run python -m src.manage_keys add --admin "Admin" - Only admin keys can trigger
POST /v1/admin/keys/reload
Slow Generation
- Use
sd-turboinstead of full SD models for images - Reduce
max_tokensin requests - Enable
use_cache=True(already enabled by default)
License
MIT