feat: Add vLLM backend, pluggable architecture, and Docker build improvements #1

Open
rdenadai wants to merge 47 commits from rdenadai/improvements-v0.2.0 into main
Owner

vLLM Backend

  • Added vLLM 0.12.0 backend with AsyncLLMEngine and continuous batching
  • Supports concurrent requests (no serialization lock like transformers)
  • Auto-detects tokenizer_mode="mistral" for Mistral models
  • Added system role normalization (merges into first user message)
  • Added context length pre-flight validation
  • Fixed tokenizer resolution via HF AutoTokenizer
  • Default model: microsoft/Phi-4-mini-instruct (128K context, function calling)

Architecture Refactoring

  • Extracted ChatService from routes (response builder + cache service facade)
  • Created ModelLoader class for orchestrated model loading
  • Added LLMBackend Protocol with transformers and vLLM implementations
  • Extracted ReasoningDetector with unified _process() method
  • Created AsyncQueueIterator base class for streamers
  • Extracted image processing to src/utils/multimodal.py
  • Added processor_utils.py for shared template/multimodal logic
  • Added input validation module (src/api/validation.py) with bounds checking

Docker & Build

  • Migrated all images to pyenv with Python 3.13.13 (PGO+LTO compiled)
  • Separate dependency files per image: pyproject.{local,cuda,vllm}.toml + uv lockfiles
  • Optimized vLLM Dockerfile: CUDA 12.8 runtime base
  • vLLM image supports full stack (LOAD=all)
  • Added UV_HTTP_TIMEOUT=600 for large CUDA wheel downloads
  • Fixed TRANSFORMERS_CACHE deprecation warning

Dependencies

  • transformers: 4.52.4 → >=5.1.0 (root/local/cuda images)
  • transformers: pinned <5 in vLLM image (vLLM 0.12.0 incompatibility)
  • diffusers: >=0.32.0 → >=0.37.0 (v5 support)
  • huggingface-hub: >=0.27.0 → >=1.0.0 (v5 requirement)
  • Added protobuf>=3.20.0 to all images (fixes SD 3.5 tokenizer load)
  • Added orjson and xformers to docker local/cuda images

Transformers v5 Compatibility

  • processor_utils.py: normalizes BatchEncoding return from apply_chat_template
  • streamer.py: fallback import for BaseStreamer (v5 path change)
  • Qwen3.5 and Gemma-4 models require transformers>=5.1.0 (LOCAL/CUDA only)

Performance Optimizations

  • Replaced SHA256 with blake2b in request dedup (5-10x faster)
  • Added 5-second cache to health endpoint
  • Compiled regex patterns at module level in text.py
  • Added LRU cache (maxsize=32) for in-memory images
  • Added orjson for SSE streaming JSON serialization
  • Parallel model loading via asyncio.gather()

Testing

  • Added 60+ new tests (212 total):
  • test_chat_service.py, test_middleware.py, test_manage_keys.py
  • test_vllm_streamer.py, test_base_iterator.py, test_reasoning_builder.py
  • test_model_loader.py
  • Docker-only markers for torch-dependent tests
  • Mocked auth, external services, and async timeouts

Documentation

  • Restructured MODELS.md with per-GPU columns and 30+ models
  • Added context length configuration table
  • Added Qwen3.5, Gemma4, GLM, MoonshotAI model listings
  • Marked vLLM-incompatible models (Qwen3.5, Gemma4)
  • Updated DEPLOY_DOCKER.md with vLLM instructions
  • Updated README.md with architecture diagram

Bug Fixes

  • Fixed reasoning_builder.py double-emission bug
  • Fixed vLLM silent exception swallowing
  • Fixed Docker startup VRAM check crash
  • Fixed non-existent model IDs in docs
  • Added torchvision for multimodal models
  • Fixed Bark TTS pad token handling

Backward Compatibility

  • Transformers backend remains default
  • vLLM is opt-in via --vllm flag
  • All existing configs and models work unchanged
## vLLM Backend - Added vLLM 0.12.0 backend with AsyncLLMEngine and continuous batching - Supports concurrent requests (no serialization lock like transformers) - Auto-detects tokenizer_mode="mistral" for Mistral models - Added system role normalization (merges into first user message) - Added context length pre-flight validation - Fixed tokenizer resolution via HF AutoTokenizer - Default model: microsoft/Phi-4-mini-instruct (128K context, function calling) ## Architecture Refactoring - Extracted ChatService from routes (response builder + cache service facade) - Created ModelLoader class for orchestrated model loading - Added LLMBackend Protocol with transformers and vLLM implementations - Extracted ReasoningDetector with unified _process() method - Created AsyncQueueIterator base class for streamers - Extracted image processing to src/utils/multimodal.py - Added processor_utils.py for shared template/multimodal logic - Added input validation module (src/api/validation.py) with bounds checking ## Docker & Build - Migrated all images to pyenv with Python 3.13.13 (PGO+LTO compiled) - Separate dependency files per image: pyproject.{local,cuda,vllm}.toml + uv lockfiles - Optimized vLLM Dockerfile: CUDA 12.8 runtime base - vLLM image supports full stack (LOAD=all) - Added UV_HTTP_TIMEOUT=600 for large CUDA wheel downloads - Fixed TRANSFORMERS_CACHE deprecation warning ## Dependencies - transformers: 4.52.4 → >=5.1.0 (root/local/cuda images) - transformers: pinned <5 in vLLM image (vLLM 0.12.0 incompatibility) - diffusers: >=0.32.0 → >=0.37.0 (v5 support) - huggingface-hub: >=0.27.0 → >=1.0.0 (v5 requirement) - Added protobuf>=3.20.0 to all images (fixes SD 3.5 tokenizer load) - Added orjson and xformers to docker local/cuda images ## Transformers v5 Compatibility - processor_utils.py: normalizes BatchEncoding return from apply_chat_template - streamer.py: fallback import for BaseStreamer (v5 path change) - Qwen3.5 and Gemma-4 models require transformers>=5.1.0 (LOCAL/CUDA only) ## Performance Optimizations - Replaced SHA256 with blake2b in request dedup (5-10x faster) - Added 5-second cache to health endpoint - Compiled regex patterns at module level in text.py - Added LRU cache (maxsize=32) for in-memory images - Added orjson for SSE streaming JSON serialization - Parallel model loading via asyncio.gather() ## Testing - Added 60+ new tests (212 total): - test_chat_service.py, test_middleware.py, test_manage_keys.py - test_vllm_streamer.py, test_base_iterator.py, test_reasoning_builder.py - test_model_loader.py - Docker-only markers for torch-dependent tests - Mocked auth, external services, and async timeouts ## Documentation - Restructured MODELS.md with per-GPU columns and 30+ models - Added context length configuration table - Added Qwen3.5, Gemma4, GLM, MoonshotAI model listings - Marked vLLM-incompatible models (Qwen3.5, Gemma4) - Updated DEPLOY_DOCKER.md with vLLM instructions - Updated README.md with architecture diagram ## Bug Fixes - Fixed reasoning_builder.py double-emission bug - Fixed vLLM silent exception swallowing - Fixed Docker startup VRAM check crash - Fixed non-existent model IDs in docs - Added torchvision for multimodal models - Fixed Bark TTS pad token handling ## Backward Compatibility - Transformers backend remains default - vLLM is opt-in via --vllm flag - All existing configs and models work unchanged
rdenadai changed title from rdenadai/improvements-v0.2.0 to feat: Add vLLM backend, pluggable architecture, and Docker build improvements 2026-05-28 11:25:43 +00:00
rdenadai force-pushed rdenadai/improvements-v0.2.0 from 33e3a813d0 to 3d0afc7708 2026-05-28 13:04:21 +00:00 Compare
- Add vLLM 0.12.0 backend with AsyncLLMEngine for high-performance inference
- Fix vLLM tokenizer resolution (load HF tokenizer for chat template support)
- Add system role normalization for vLLM compatibility (Mistral/Phi models)
- Add context length pre-flight check with clear error messages
- Improve error logging across vLLM backend and chat service
- Optimize vLLM Dockerfile: CUDA 12.8 runtime, deadsnakes PPA Python 3.13
- Enable full stack support in vLLM image (LOAD=all for LLM+image+audio+tts)
- Update MODELS.md with context length configuration and vLLM examples
- Update DEPLOY_DOCKER.md with vLLM build instructions
- Add 38 new tests for vLLM backend, chat service, model loader
- Fix TRANSFORMERS_CACHE deprecation warning in Docker startup
- Update default model to Phi-4-mini-instruct (function calling, 128K context)
- Replace deadsnakes PPA with pyenv across all Dockerfiles
- Compile Python 3.13.13 with --enable-optimizations --with-lto
- Use -mtune=generic for CPU compatibility
- Create separate dependency files per image:
  * pyproject.local.toml + uv.local.lock (CUDA 11.8)
  * pyproject.cuda.toml + uv.cuda.lock (CUDA 12.4)
  * pyproject.vllm.toml + uv.vllm.lock (CUDA 12.8)
- Update documentation (DEPLOY_DOCKER.md, RULES.md, README.md, DOCKERHUB.md)
- All Docker images now use Ubuntu 22.04 base with pyenv
Add bounds checking for API parameters:
- max_tokens: 1-128000
- temperature: 0.0-2.0
- num_inference_steps: 1-50
- guidance_scale: 0.0-30.0
- width/height: 64-2048, multiple of 8
- n: 1-4 concurrent images

New src/api/validation.py provides reusable validators.
Add Unsloth Gemma 4 QAT (Quantization-Aware Training) variants:
- GGUF versions: E2B, E4B, 12B, 26B-A4B, 31B
- Unquantized BF16 versions for all sizes
- Updated VRAM estimates based on actual file sizes
- Added QAT models to Multimodal Models section
- Added QAT recommendations for RTX 4090 and L40S
- Added QAT-based model combinations

Verified all models exist on HuggingFace.
Add docker_only and requires_torch markers to pytest.ini.
Add module-level guards to 8 test files importing torch/CUDA.
Add Docker detection fixture to conftest.py for future use.
Remove redundant tests/unit/api/conftest.py.

Tests skip cleanly outside Docker with clear messaging about CUDA/cuDNN requirements.
- transformers: 4.52.4 -> >=5.1.0 in root/local/cuda pyproject.toml
- transformers: pinned <5 in vLLM image (vLLM 0.12.0 incompatibility)
- diffusers: >=0.32.0 -> >=0.37.0 (v5 support added in 0.37.0)
- huggingface-hub: >=0.27.0 -> >=1.0.0 (v5 requirement)
- Add protobuf>=3.20.0 to all images (fix SD 3.5 tokenizer load)
- Add orjson and xformers to docker local/cuda images
- processor_utils.py: normalize BatchEncoding return from apply_chat_template for v5 compat
- streamer.py: add fallback import for BaseStreamer (v5 path change)
- MODELS.md: mark Qwen3.5/Gemma4 as incompatible with vLLM backend
- README.md: clarify BaseStreamer import error for v5
- Regenerate all uv lockfiles

Fixes runtime dependency failures for newer models (Qwen3.5, Gemma4)
while maintaining vLLM 0.12.0 backward compatibility.
- transformers.py: filter out multimodal keys (mm_token_type_ids) that
  v5 processors include but generate() rejects. Prevents 500 errors on
  Qwen3.5 and other multimodal models.
- Replace deprecated torch_dtype= with dtype= in all from_pretrained()
  calls across models.py, audio.py, images.py. Both v4.57.6 and v5.x
  support dtype parameter.
- images.py: revert diffusers pipeline params from dtype back to torch_dtype
  (diffusers pipelines ignore dtype keyword, causing float32/float16 mismatch)
- audio.py: use device_map=device instead of low_cpu_mem_usage + .to(device)
  to avoid meta tensor copy errors in transformers v5 for Bark and Whisper
- models.py: restrict AutoProcessor detection to explicit multimodal
  architectures/model_type. Prevents text-only models like Qwen3.5 from
  using AutoProcessor and producing malformed inputs.
- chat_service.py: pass actual exception object to fail_dedup instead of
  string, fixing secondary TypeError in error handling.
Restores low_cpu_mem_usage=True for v5 compatibility while keeping
device_map=device to avoid meta tensor copy errors. Both flags are
supported together in transformers >=4.56 and v5.x.
Removes the defensive insertion of a dummy user message when the
conversation doesn't start with a user role. This was causing agent
tools (like opencode) to see artificial context and produce confused
reasoning. All models in MODELS.md are modern instruction-tuned models
that natively support system-first conversations via their chat templates.
Streamer:
- Remove incorrect next_tokens_are_prompt skip that truncated first token

Audio:
- Bark float16→float32 for soundfile compatibility
- Add max_length=None override to avoid config default conflicts
- Create attention_mask when missing

Transformers backend:
- Add attn_implementation=eager for BNB 4-bit numerical stability
- Disable use_cache for quantized models during generation
- Clamp temperature to min 0.01 to prevent softmax division-by-zero
- Create attention_mask when missing

Models:
- Replace SystemExit with RuntimeError for gated model access checks
- Add attn_implementation=eager to all model loading paths

API:
- Fix model ID extraction for audio/tts in /v1/models endpoint

Docker:
- Rewrite default Dockerfile to use nvidia/cuda:11.8 base image
- Tie default Dockerfile to root pyproject.toml and uv.lock
- Remove redundant docker/pyproject.local.toml and uv.local.lock
- Update docker-compose with env var interpolation for all 3 services
- Remove hagalaz-runpod service
- Add GPU reservations to default hagalaz service

Frontend:
- Add JavaScript to auto-replace YOUR_HOST placeholder with actual host

Docs:
- Update .env.example, README, DEPLOY_DOCKER.md for new Docker setup

Tests: 72 passed, 8 skipped (1 pre-existing failure unrelated to changes)
The extra }); after DOMContentLoaded listener caused:
- Uncaught SyntaxError: expected expression, got '}'
- YOUR_HOST:8000 placeholder never replaced with actual host

1 line removed. Fixes regression from previous commit.
Allow selective model loading via comma-separated lists:
  --load llm,image,audio
  LOAD=llm,image,audio

Also preserves legacy aliases:
  both -> llm+image
  all  -> llm+image+audio+tts

Changes:
- config.py: add _parse_load type with validation
- model_loader.py: add _normalize_load for backward compat
- validation updated to check set membership

Fixes Docker Compose default LOAD=llm,image,audio failing with
'invalid choice: llm,image,audio'.
A. Reorder quantization branch priority in _load_standard_model()
   - Native/pre-quantized models (Unsloth, GPTQ, AWQ) now load first
   - Fixes DeepSeek 1.5B being loaded in FP16 instead of native 4-bit
   - Prevents CUDA assert from numerical overflow in attention softmax

D. VRAM-based KV cache threshold (12GB)
   - >= 12GB: cache enabled (fast generation, RTX 3060+, L40S)
   - < 12GB: cache disabled (stable on RTX 1060 6GB)
   - Eliminates VRAM corruption on limited GPUs

E. Set max_length=None in generation kwargs
   - Resolves transformers warning about max_new_tokens vs max_length conflict
   - Prevents unnecessary memory allocation from model's default 131072 context

Impact:
- hagalaz (RTX 1060 6GB): Pre-quantized models load correctly, cache disabled
- hagalaz-cuda (L40S 48GB): Cache enabled, full speed
- hagalaz-vllm: No impact (uses vLLM's PagedAttention)
Fix 1: Native quantization loading
- Pass quantization_config explicitly in has_native_quant branch
- BNB models (e.g., unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit) were
  loading in full FP16 instead of 4-bit despite log saying 'native quant'
- Caused CUDA assert from numerical overflow in attention softmax
- VRAM usage now ~1.5GB instead of ~4GB for 1.5B quantized models

Fix 2: Default LOAD for local Docker
- Changed from 'llm,image,audio' to 'llm,image'
- 6GB VRAM cannot fit all three models simultaneously
- Audio fails with meta tensor error when VRAM exhausted
- Users with more VRAM can override via LOAD env var

Tests: 72 passed, 8 skipped
The previous fix passed model_config.quantization_config (a plain dict)
which caused: 'model is quantized with BitsAndBytesConfig but you are
passing a dict config.'

Transformers auto-detects quantization_config from config.json and
instantiates the proper class internally. Explicit passing is redundant
and now causes type mismatch in v5.

Removed the parameter entirely from the native/pre-quantized branch.
Changed /v1/audio/speech to return a StreamingResponse with raw audio
bytes (Content-Type: audio/wav) matching OpenAI API specification.

Changes:
- Added audio_to_bytes() helper in src/core/audio.py
- Modified audio_to_base64() to reuse audio_to_bytes()
- Updated /v1/audio/speech endpoint to return StreamingResponse
- Added Content-Disposition header for file download

Previously returned JSON with base64-encoded audio string.
Now returns raw WAV file for direct playback/download.

Tests: 72 passed, 8 skipped
Eliminates transformers warnings:
- 'attention mask and the pad token id were not set'
- 'Setting pad_token_id to eos_token_id for open-end generation'

Changes:
- _generate_bark_speech(): add attention_mask + pad_token_id
- _generate_voxtral_speech(): add attention_mask + pad_token_id
- Both now create attention_mask when missing
- Both now set pad_token_id to eos_token_id when unset
- pad_token_id passed explicitly to model.generate()

Tests: syntax verified
This pull request can be merged automatically.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin rdenadai/improvements-v0.2.0:rdenadai/improvements-v0.2.0
git switch rdenadai/improvements-v0.2.0

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch main
git merge --no-ff rdenadai/improvements-v0.2.0
git switch rdenadai/improvements-v0.2.0
git rebase main
git switch main
git merge --ff-only rdenadai/improvements-v0.2.0
git switch rdenadai/improvements-v0.2.0
git rebase main
git switch main
git merge --no-ff rdenadai/improvements-v0.2.0
git switch main
git merge --squash rdenadai/improvements-v0.2.0
git switch main
git merge --ff-only rdenadai/improvements-v0.2.0
git switch main
git merge rdenadai/improvements-v0.2.0
git push origin main
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
rdenadai/hagalaz!1
No description provided.