feat: Add vLLM backend, pluggable architecture, and Docker build improvements

rdenadai commented

2026-05-28 11:24:08 +00:00

Owner

vLLM Backend

Added vLLM 0.12.0 backend with AsyncLLMEngine and continuous batching
Supports concurrent requests (no serialization lock like transformers)
Auto-detects tokenizer_mode="mistral" for Mistral models
Added system role normalization (merges into first user message)
Added context length pre-flight validation
Fixed tokenizer resolution via HF AutoTokenizer
Default model: microsoft/Phi-4-mini-instruct (128K context, function calling)

Architecture Refactoring

Extracted ChatService from routes (response builder + cache service facade)
Created ModelLoader class for orchestrated model loading
Added LLMBackend Protocol with transformers and vLLM implementations
Extracted ReasoningDetector with unified _process() method
Created AsyncQueueIterator base class for streamers
Extracted image processing to src/utils/multimodal.py
Added processor_utils.py for shared template/multimodal logic
Added input validation module (src/api/validation.py) with bounds checking

Docker & Build

Migrated all images to pyenv with Python 3.13.13 (PGO+LTO compiled)
Separate dependency files per image: pyproject.{local,cuda,vllm}.toml + uv lockfiles
Optimized vLLM Dockerfile: CUDA 12.8 runtime base
vLLM image supports full stack (LOAD=all)
Added UV_HTTP_TIMEOUT=600 for large CUDA wheel downloads
Fixed TRANSFORMERS_CACHE deprecation warning

Dependencies

transformers: 4.52.4 → >=5.1.0 (root/local/cuda images)
transformers: pinned <5 in vLLM image (vLLM 0.12.0 incompatibility)
diffusers: >=0.32.0 → >=0.37.0 (v5 support)
huggingface-hub: >=0.27.0 → >=1.0.0 (v5 requirement)
Added protobuf>=3.20.0 to all images (fixes SD 3.5 tokenizer load)
Added orjson and xformers to docker local/cuda images

Transformers v5 Compatibility

processor_utils.py: normalizes BatchEncoding return from apply_chat_template
streamer.py: fallback import for BaseStreamer (v5 path change)
Qwen3.5 and Gemma-4 models require transformers>=5.1.0 (LOCAL/CUDA only)

Performance Optimizations

Replaced SHA256 with blake2b in request dedup (5-10x faster)
Added 5-second cache to health endpoint
Compiled regex patterns at module level in text.py
Added LRU cache (maxsize=32) for in-memory images
Added orjson for SSE streaming JSON serialization
Parallel model loading via asyncio.gather()

Testing

Added 60+ new tests (212 total):
test_chat_service.py, test_middleware.py, test_manage_keys.py
test_vllm_streamer.py, test_base_iterator.py, test_reasoning_builder.py
test_model_loader.py
Docker-only markers for torch-dependent tests
Mocked auth, external services, and async timeouts

Documentation

Restructured MODELS.md with per-GPU columns and 30+ models
Added context length configuration table
Added Qwen3.5, Gemma4, GLM, MoonshotAI model listings
Marked vLLM-incompatible models (Qwen3.5, Gemma4)
Updated DEPLOY_DOCKER.md with vLLM instructions
Updated README.md with architecture diagram

Bug Fixes

Fixed reasoning_builder.py double-emission bug
Fixed vLLM silent exception swallowing
Fixed Docker startup VRAM check crash
Fixed non-existent model IDs in docs
Added torchvision for multimodal models
Fixed Bark TTS pad token handling

Backward Compatibility

Transformers backend remains default
vLLM is opt-in via --vllm flag
All existing configs and models work unchanged

## vLLM Backend - Added vLLM 0.12.0 backend with AsyncLLMEngine and continuous batching - Supports concurrent requests (no serialization lock like transformers) - Auto-detects tokenizer_mode="mistral" for Mistral models - Added system role normalization (merges into first user message) - Added context length pre-flight validation - Fixed tokenizer resolution via HF AutoTokenizer - Default model: microsoft/Phi-4-mini-instruct (128K context, function calling) ## Architecture Refactoring - Extracted ChatService from routes (response builder + cache service facade) - Created ModelLoader class for orchestrated model loading - Added LLMBackend Protocol with transformers and vLLM implementations - Extracted ReasoningDetector with unified _process() method - Created AsyncQueueIterator base class for streamers - Extracted image processing to src/utils/multimodal.py - Added processor_utils.py for shared template/multimodal logic - Added input validation module (src/api/validation.py) with bounds checking ## Docker & Build - Migrated all images to pyenv with Python 3.13.13 (PGO+LTO compiled) - Separate dependency files per image: pyproject.{local,cuda,vllm}.toml + uv lockfiles - Optimized vLLM Dockerfile: CUDA 12.8 runtime base - vLLM image supports full stack (LOAD=all) - Added UV_HTTP_TIMEOUT=600 for large CUDA wheel downloads - Fixed TRANSFORMERS_CACHE deprecation warning ## Dependencies - transformers: 4.52.4 → >=5.1.0 (root/local/cuda images) - transformers: pinned <5 in vLLM image (vLLM 0.12.0 incompatibility) - diffusers: >=0.32.0 → >=0.37.0 (v5 support) - huggingface-hub: >=0.27.0 → >=1.0.0 (v5 requirement) - Added protobuf>=3.20.0 to all images (fixes SD 3.5 tokenizer load) - Added orjson and xformers to docker local/cuda images ## Transformers v5 Compatibility - processor_utils.py: normalizes BatchEncoding return from apply_chat_template - streamer.py: fallback import for BaseStreamer (v5 path change) - Qwen3.5 and Gemma-4 models require transformers>=5.1.0 (LOCAL/CUDA only) ## Performance Optimizations - Replaced SHA256 with blake2b in request dedup (5-10x faster) - Added 5-second cache to health endpoint - Compiled regex patterns at module level in text.py - Added LRU cache (maxsize=32) for in-memory images - Added orjson for SSE streaming JSON serialization - Parallel model loading via asyncio.gather() ## Testing - Added 60+ new tests (212 total): - test_chat_service.py, test_middleware.py, test_manage_keys.py - test_vllm_streamer.py, test_base_iterator.py, test_reasoning_builder.py - test_model_loader.py - Docker-only markers for torch-dependent tests - Mocked auth, external services, and async timeouts ## Documentation - Restructured MODELS.md with per-GPU columns and 30+ models - Added context length configuration table - Added Qwen3.5, Gemma4, GLM, MoonshotAI model listings - Marked vLLM-incompatible models (Qwen3.5, Gemma4) - Updated DEPLOY_DOCKER.md with vLLM instructions - Updated README.md with architecture diagram ## Bug Fixes - Fixed reasoning_builder.py double-emission bug - Fixed vLLM silent exception swallowing - Fixed Docker startup VRAM check crash - Fixed non-existent model IDs in docs - Added torchvision for multimodal models - Fixed Bark TTS pad token handling ## Backward Compatibility - Transformers backend remains default - vLLM is opt-in via --vllm flag - All existing configs and models work unchanged

rdenadai added 2 commits

2026-05-28 11:24:08 +00:00

Add docker/uv.lock ddad7de725

feat: Improvement v0.2.0 33e3a813d0

rdenadai changed title from ~~rdenadai/improvements-v0.2.0~~ to feat: Add vLLM backend, pluggable architecture, and Docker build improvements

2026-05-28 11:25:43 +00:00

rdenadai force-pushed rdenadai/improvements-v0.2.0 from 33e3a813d0 to 3d0afc7708

2026-05-28 13:04:21 +00:00

Compare

rdenadai added 1 commit

2026-06-03 20:25:10 +00:00

feat: vLLM backend integration and Docker optimization 5364b9b186

- Add vLLM 0.12.0 backend with AsyncLLMEngine for high-performance inference
- Fix vLLM tokenizer resolution (load HF tokenizer for chat template support)
- Add system role normalization for vLLM compatibility (Mistral/Phi models)
- Add context length pre-flight check with clear error messages
- Improve error logging across vLLM backend and chat service
- Optimize vLLM Dockerfile: CUDA 12.8 runtime, deadsnakes PPA Python 3.13
- Enable full stack support in vLLM image (LOAD=all for LLM+image+audio+tts)
- Update MODELS.md with context length configuration and vLLM examples
- Update DEPLOY_DOCKER.md with vLLM build instructions
- Add 38 new tests for vLLM backend, chat service, model loader
- Fix TRANSFORMERS_CACHE deprecation warning in Docker startup
- Update default model to Phi-4-mini-instruct (function calling, 128K context)

rdenadai added 1 commit

2026-06-04 03:33:31 +00:00

perf: optimize dedup/caching/streaming; test: add middleware and manage_keys coverage 0100cc9498

rdenadai added 1 commit

2026-06-05 00:04:38 +00:00

docs: Update for more models e8f79dac07

rdenadai added 1 commit

2026-06-05 00:24:32 +00:00

docs(readme): add hagalaz logo 51cdafda8d

rdenadai added 7 commits

2026-06-05 01:09:58 +00:00

fix(api): add utf-8 charset to json and sse responses 06721ea0ca

docs(models): fix unsloth Qwen3.5 IDs and add gated warnings to SD3 models eccdc340d5

docs(models): fix incorrect unsloth model IDs 99121afb41

fix: add torchvision for multimodal models, use_safetensors for SD pipelines 86db2ee69b

fix(audio): add low_cpu_mem_usage and pad token fix for Bark TTS e1e2345255

docs(readme): add multimodal feature, update model recommendations e8b894b625

docs(rules): remove detailed code layout section 53e30bb92a

rdenadai added 3 commits

2026-06-05 02:46:47 +00:00

docs(docker): add Docker Hub page with image tags and quick-start c5dfee9130

docs(docker): update repo links to GitForge e3b6184c53

chore: add MIT license 02657991a5

rdenadai added 5 commits

2026-06-06 16:45:24 +00:00

build(docker): migrate all images to pyenv with Python 3.13.13 (PGO+LTO) bfa746fd9b

- Replace deadsnakes PPA with pyenv across all Dockerfiles
- Compile Python 3.13.13 with --enable-optimizations --with-lto
- Use -mtune=generic for CPU compatibility
- Create separate dependency files per image:
  * pyproject.local.toml + uv.local.lock (CUDA 11.8)
  * pyproject.cuda.toml + uv.cuda.lock (CUDA 12.4)
  * pyproject.vllm.toml + uv.vllm.lock (CUDA 12.8)
- Update documentation (DEPLOY_DOCKER.md, RULES.md, README.md, DOCKERHUB.md)
- All Docker images now use Ubuntu 22.04 base with pyenv

feat(api): add input validation to chat and image endpoints 1017b65b44

Add bounds checking for API parameters:
- max_tokens: 1-128000
- temperature: 0.0-2.0
- num_inference_steps: 1-50
- guidance_scale: 0.0-30.0
- width/height: 64-2048, multiple of 8
- n: 1-4 concurrent images

New src/api/validation.py provides reusable validators.

docs(models): add Gemma 4 QAT models to MODELS.md 323c3cdabb

Add Unsloth Gemma 4 QAT (Quantization-Aware Training) variants:
- GGUF versions: E2B, E4B, 12B, 26B-A4B, 31B
- Unquantized BF16 versions for all sizes
- Updated VRAM estimates based on actual file sizes
- Added QAT models to Multimodal Models section
- Added QAT recommendations for RTX 4090 and L40S
- Added QAT-based model combinations

Verified all models exist on HuggingFace.

test: skip torch-dependent tests outside Docker containers 34b3a4d1ac

Add docker_only and requires_torch markers to pytest.ini.
Add module-level guards to 8 test files importing torch/CUDA.
Add Docker detection fixture to conftest.py for future use.
Remove redundant tests/unit/api/conftest.py.

Tests skip cleanly outside Docker with clear messaging about CUDA/cuDNN requirements.

deps: bump transformers to >=5.1.0, add v5 compatibility, document vLLM limitations 3e9faf2077

- transformers: 4.52.4 -> >=5.1.0 in root/local/cuda pyproject.toml
- transformers: pinned <5 in vLLM image (vLLM 0.12.0 incompatibility)
- diffusers: >=0.32.0 -> >=0.37.0 (v5 support added in 0.37.0)
- huggingface-hub: >=0.27.0 -> >=1.0.0 (v5 requirement)
- Add protobuf>=3.20.0 to all images (fix SD 3.5 tokenizer load)
- Add orjson and xformers to docker local/cuda images
- processor_utils.py: normalize BatchEncoding return from apply_chat_template for v5 compat
- streamer.py: add fallback import for BaseStreamer (v5 path change)
- MODELS.md: mark Qwen3.5/Gemma4 as incompatible with vLLM backend
- README.md: clarify BaseStreamer import error for v5
- Regenerate all uv lockfiles

Fixes runtime dependency failures for newer models (Qwen3.5, Gemma4)
while maintaining vLLM 0.12.0 backward compatibility.

rdenadai added 4 commits

2026-06-07 23:35:02 +00:00

fix(transformers): filter mm_token_type_ids from generate kwargs, replace torch_dtype with dtype 729d6e5ca2

- transformers.py: filter out multimodal keys (mm_token_type_ids) that
  v5 processors include but generate() rejects. Prevents 500 errors on
  Qwen3.5 and other multimodal models.
- Replace deprecated torch_dtype= with dtype= in all from_pretrained()
  calls across models.py, audio.py, images.py. Both v4.57.6 and v5.x
  support dtype parameter.

fix(runtime): diffusers torch_dtype, meta tensors, tokenizer detection, dedup exception b048c8b5ae

- images.py: revert diffusers pipeline params from dtype back to torch_dtype
  (diffusers pipelines ignore dtype keyword, causing float32/float16 mismatch)
- audio.py: use device_map=device instead of low_cpu_mem_usage + .to(device)
  to avoid meta tensor copy errors in transformers v5 for Bark and Whisper
- models.py: restrict AutoProcessor detection to explicit multimodal
  architectures/model_type. Prevents text-only models like Qwen3.5 from
  using AutoProcessor and producing malformed inputs.
- chat_service.py: pass actual exception object to fail_dedup instead of
  string, fixing secondary TypeError in error handling.

fix(audio): add low_cpu_mem_usage=True back to Bark and Whisper loading abd44acf5d

Restores low_cpu_mem_usage=True for v5 compatibility while keeping
device_map=device to avoid meta tensor copy errors. Both flags are
supported together in transformers >=4.56 and v5.x.

fix(text): remove 'Please continue.' user-first hack efa7bca528

Removes the defensive insertion of a dummy user message when the
conversation doesn't start with a user role. This was causing agent
tools (like opencode) to see artificial context and produce confused
reasoning. All models in MODELS.md are modern instruction-tuned models
that natively support system-first conversations via their chat templates.

rdenadai added 20 commits

2026-06-12 01:47:50 +00:00

fix: transformers v5 compatibility, CUDA stability, and Docker deployment dd598e5779

Streamer:
- Remove incorrect next_tokens_are_prompt skip that truncated first token

Audio:
- Bark float16→float32 for soundfile compatibility
- Add max_length=None override to avoid config default conflicts
- Create attention_mask when missing

Transformers backend:
- Add attn_implementation=eager for BNB 4-bit numerical stability
- Disable use_cache for quantized models during generation
- Clamp temperature to min 0.01 to prevent softmax division-by-zero
- Create attention_mask when missing

Models:
- Replace SystemExit with RuntimeError for gated model access checks
- Add attn_implementation=eager to all model loading paths

API:
- Fix model ID extraction for audio/tts in /v1/models endpoint

Docker:
- Rewrite default Dockerfile to use nvidia/cuda:11.8 base image
- Tie default Dockerfile to root pyproject.toml and uv.lock
- Remove redundant docker/pyproject.local.toml and uv.local.lock
- Update docker-compose with env var interpolation for all 3 services
- Remove hagalaz-runpod service
- Add GPU reservations to default hagalaz service

Frontend:
- Add JavaScript to auto-replace YOUR_HOST placeholder with actual host

Docs:
- Update .env.example, README, DEPLOY_DOCKER.md for new Docker setup

Tests: 72 passed, 8 skipped (1 pre-existing failure unrelated to changes)

fix(index.html): remove extra closing brace causing JS syntax error 9ce4796eb0

The extra }); after DOMContentLoaded listener caused:
- Uncaught SyntaxError: expected expression, got '}'
- YOUR_HOST:8000 placeholder never replaced with actual host

1 line removed. Fixes regression from previous commit.

feat: support comma-separated --load values 41730a760f

Allow selective model loading via comma-separated lists:
  --load llm,image,audio
  LOAD=llm,image,audio

Also preserves legacy aliases:
  both -> llm+image
  all  -> llm+image+audio+tts

Changes:
- config.py: add _parse_load type with validation
- model_loader.py: add _normalize_load for backward compat
- validation updated to check set membership

Fixes Docker Compose default LOAD=llm,image,audio failing with
'invalid choice: llm,image,audio'.

fix: CUDA numerical stability for small pre-quantized models on limited VRAM 8056af9da8

A. Reorder quantization branch priority in _load_standard_model()
   - Native/pre-quantized models (Unsloth, GPTQ, AWQ) now load first
   - Fixes DeepSeek 1.5B being loaded in FP16 instead of native 4-bit
   - Prevents CUDA assert from numerical overflow in attention softmax

D. VRAM-based KV cache threshold (12GB)
   - >= 12GB: cache enabled (fast generation, RTX 3060+, L40S)
   - < 12GB: cache disabled (stable on RTX 1060 6GB)
   - Eliminates VRAM corruption on limited GPUs

E. Set max_length=None in generation kwargs
   - Resolves transformers warning about max_new_tokens vs max_length conflict
   - Prevents unnecessary memory allocation from model's default 131072 context

Impact:
- hagalaz (RTX 1060 6GB): Pre-quantized models load correctly, cache disabled
- hagalaz-cuda (L40S 48GB): Cache enabled, full speed
- hagalaz-vllm: No impact (uses vLLM's PagedAttention)

fix: explicitly pass quantization_config for native BNB models + safe defaults ada43a6ec2

Fix 1: Native quantization loading
- Pass quantization_config explicitly in has_native_quant branch
- BNB models (e.g., unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit) were
  loading in full FP16 instead of 4-bit despite log saying 'native quant'
- Caused CUDA assert from numerical overflow in attention softmax
- VRAM usage now ~1.5GB instead of ~4GB for 1.5B quantized models

Fix 2: Default LOAD for local Docker
- Changed from 'llm,image,audio' to 'llm,image'
- 6GB VRAM cannot fit all three models simultaneously
- Audio fails with meta tensor error when VRAM exhausted
- Users with more VRAM can override via LOAD env var

Tests: 72 passed, 8 skipped

fix: remove explicit quantization_config to let transformers auto-detect 2b67eca797

The previous fix passed model_config.quantization_config (a plain dict)
which caused: 'model is quantized with BitsAndBytesConfig but you are
passing a dict config.'

Transformers auto-detects quantization_config from config.json and
instantiates the proper class internally. Explicit passing is redundant
and now causes type mismatch in v5.

Removed the parameter entirely from the native/pre-quantized branch.

fix: TTS endpoint returns raw audio file instead of base64 JSON 7eb4059681

Changed /v1/audio/speech to return a StreamingResponse with raw audio
bytes (Content-Type: audio/wav) matching OpenAI API specification.

Changes:
- Added audio_to_bytes() helper in src/core/audio.py
- Modified audio_to_base64() to reuse audio_to_bytes()
- Updated /v1/audio/speech endpoint to return StreamingResponse
- Added Content-Disposition header for file download

Previously returned JSON with base64-encoded audio string.
Now returns raw WAV file for direct playback/download.

Tests: 72 passed, 8 skipped

fix: add attention_mask and pad_token_id to TTS generation b906137b81

Eliminates transformers warnings:
- 'attention mask and the pad token id were not set'
- 'Setting pad_token_id to eos_token_id for open-end generation'

Changes:
- _generate_bark_speech(): add attention_mask + pad_token_id
- _generate_voxtral_speech(): add attention_mask + pad_token_id
- Both now create attention_mask when missing
- Both now set pad_token_id to eos_token_id when unset
- pad_token_id passed explicitly to model.generate()

Tests: syntax verified

feat: add single-file chat interface with streaming, markdown, and quick actions 30b31a2ff8

fix: docker-compose default LOAD to llm,image (was tts,image) 426a907c8a

docs: add VLLM.md, update TTS example, fix LOAD defaults, add chat interface 11190c6efe

feat: add /chat route and links from landing page aa914d38d8

refactor(main): split create_app into focused helper functions with tests ff37047b68

fix(middleware): add /chat to public endpoints 821bd480e4

docs: move DEPLOY_DOCKER, DOCKERHUB, MODELS, RULES to docs/ and update references ff29ad4c0f

fix(chat): fix auth button, replace emojis with SVGs, match index.html aesthetic 2d729b4c71

fix(chat): fix undefined chat bug, improve alignment, update robot icon ad1fc419a5

refactor(chat): rewrite with Alpine.js components, extract CSS/JS modules, add strip reasoning toggle 4e70174598

fix(chat): merge JS modules into single file, fix Alpine.js loading order 81f0ca7763

feat(chat): improve streaming and media workflows 422f1cf6ac

rdenadai added 2 commits

2026-06-12 02:34:37 +00:00

feat(chat): filter mode options by loaded model types daede47562

docs: update public endpoints, chat route, testing tree, and vLLM default 5dae549749

This pull request can be merged automatically.

You are not authorized to merge this pull request.

View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u origin rdenadai/improvements-v0.2.0:rdenadai/improvements-v0.2.0

git switch rdenadai/improvements-v0.2.0

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch main

git merge --no-ff rdenadai/improvements-v0.2.0

git switch rdenadai/improvements-v0.2.0

git rebase main

git switch main

git merge --ff-only rdenadai/improvements-v0.2.0

git switch rdenadai/improvements-v0.2.0

git rebase main

git switch main

git merge --no-ff rdenadai/improvements-v0.2.0

git switch main

git merge --squash rdenadai/improvements-v0.2.0

git switch main

git merge --ff-only rdenadai/improvements-v0.2.0

git switch main

git merge rdenadai/improvements-v0.2.0

git push origin main

Rows
Columns

feat: Add vLLM backend, pluggable architecture, and Docker build improvements #1

vLLM Backend

Architecture Refactoring

Docker & Build

Dependencies

Transformers v5 Compatibility

Performance Optimizations

Testing

Documentation

Bug Fixes

Backward Compatibility

Checkout

Merge