|
|
||
|---|---|---|
| audio | ||
| src | ||
| static | ||
| templates | ||
| uploads | ||
| .editorconfig | ||
| .env.example | ||
| .gitignore | ||
| image.png | ||
| LICENSE.md | ||
| pyproject.toml | ||
| README.md | ||
| run.sh | ||
| uv.lock | ||
Transcribe Audio2Text + Resumo
This project exposes two entry points: a FastAPI web app for MP3 uploads and a CLI for local transcription. Whisper handles speech-to-text, and a Hugging Face chat model cleans up the transcript and produces the summary.
Audio should be in mp3 format. If you have a different format, you can convert it using ffmpeg.
ffmpeg -i audio.mp4 -vn -acodec libmp3lame -q:a 2 audio.mp3
ffmpeg -i audio.m4a -c:a libmp3lame -q:a 2 audio.mp3
Install dependencies
If you don't have uv installed, you can install it curling the following command Installing uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
After that in order to run the application, you need to install the required dependencies. You can do this using uv (a virtual environment manager).
uv sync
GGUF checkpoints require a recent Transformers release and a matching .gguf filename. CUDA is required; the app loads the audio and text models one at a time so each model stays on the GPU instead of relying on CPU offload.
P.S.: Take a look at the
pyproject.tomlfile, in particular the tool.uv.index section, because in my machine I had to add thepytorch-cu118index to be able to install thetorchpackage. If you have a new NVIDIA GPU, you might need to change the index URL to match your CUDA version (or remove it if you don't need it).
Install ffmpeg if you don't have it installed yet. This is required to convert audio files to mp3 format.
sudo apt install ffmpeg
Create a .env file
From the project root, copy .env.example to .env.
The application loads .env automatically on startup.
cp .env.example .env
This file should contain the Hugging Face model names and other configurations. Here is an example of what your .env file might look like:
The below values are the default ones used in the project, but you can change them to use different models or configurations as needed.
HUGGINGFACE_LLM_MODEL=unsloth/Qwen3.5-2B-GGUF
HUGGINGFACE_LLM_GGUF_FILE=Qwen3.5-2B-UD-IQ2_M.gguf
HUGGINGFACE_WHISPER_MODEL=openai/whisper-small
Supported environment variables
| Variable | Default | Purpose |
|---|---|---|
HUGGINGFACE_LLM_MODEL |
unsloth/Qwen3.5-2B-GGUF |
Chat model used for cleanup and summary generation |
HUGGINGFACE_LLM_GGUF_FILE |
Qwen3.5-2B-UD-IQ2_M.gguf |
GGUF filename used to switch the loader to Hugging Face GGUF support |
HUGGINGFACE_LLM_MAX_TOKENS |
4096 |
Maximum tokens generated for the summary |
HUGGINGFACE_LLM_TEMPERATURE |
0.5 |
Sampling temperature for the summary model |
HUGGINGFACE_WHISPER_MODEL |
openai/whisper-small |
Whisper model used for transcription |
HUGGINGFACE_WHISPER_CHUNK_LENGTH_S |
30 |
Whisper chunk size in seconds |
HUGGINGFACE_WHISPER_STRIDE_LENGTH_S |
5 |
Whisper overlap in seconds |
SYSTEM_PROMPT |
project default | System prompt used by the summary model |
USER_RESUME_PROMPT |
project default | User prompt used to normalize and summarize the transcript |
Run as a command line
From the project root, you can run the script to transcribe audio files directly from the command line.
uv run python -m src.main --sample audio/your-file.mp3 --output_file audio/transcribed.txt
Run the server
Another way to use the application is to run it as a server. This allows you to send audio files via HTTP requests and receive transcriptions in response.
uv run uvicorn src.app:app --host 0.0.0.0 --port 8000 --reload
The web UI is available at /. Uploads are accepted by POST /upload, and the response includes rendered transcript and summary HTML plus timing data.
Runtime notes
- CUDA is required. The app raises at startup if no GPU is available.
- The audio and text models are loaded one at a time so each can use GPU memory before the next one starts.
- The text model is loaded in 4-bit when CUDA is available, but Gemma 4 can still hit temporary VRAM spikes during initialization on small GPUs; if that happens, use a smaller model or more VRAM.
- If
HUGGINGFACE_LLM_GGUF_FILEis set to a.gguffilename, the text model switches to the Hugging Face GGUF loader automatically. - Uploaded files are written temporarily to
uploads/and removed after processing. - Generated transcripts and summaries are saved in
audio/with timestamped filenames for reference.
