Simple web app made so you can use Deep Learn models to transcribe audio to text.

artificial-intelligence audio audio2text llms text whisper-ai

Find a file

Rodolfo De Nadai 77271dde2b Merge pull request 'feat: General improvements (v0.0.2)' (#1 ) from rdenadai/improvements-v0.0.2 into main Reviewed-on: #1		2026-04-17 11:54:32 +00:00
audio	feat: Sistema de transcrição	2025-08-16 14:49:21 -03:00
src	feat: General improvements	2026-04-17 08:52:10 -03:00
static	feat: General improvements	2026-04-17 08:52:10 -03:00
templates	feat: General improvements	2026-04-17 08:52:10 -03:00
uploads	feat: Sistema de transcrição	2025-08-16 14:49:21 -03:00
.editorconfig	feat: Change the way we run the frontend	2025-08-26 08:31:19 -03:00
.env.example	feat: General improvements	2026-04-17 08:52:10 -03:00
.gitignore	feat: Sistema de transcrição	2025-08-16 14:49:21 -03:00
image.png	feat: Change the way we run the frontend	2025-08-26 08:31:19 -03:00
LICENSE.md	feat: Sistema de transcrição	2025-08-16 14:49:21 -03:00
pyproject.toml	feat: General improvements	2026-04-17 08:52:10 -03:00
README.md	feat: General improvements	2026-04-17 08:52:10 -03:00
run.sh	feat: General improvements	2026-04-17 08:52:10 -03:00
uv.lock	feat: General improvements	2026-04-17 08:52:10 -03:00

README.md

Transcribe Audio2Text + Resumo

This project exposes two entry points: a FastAPI web app for MP3 uploads and a CLI for local transcription. Whisper handles speech-to-text, and a Hugging Face chat model cleans up the transcript and produces the summary.

Audio should be in mp3 format. If you have a different format, you can convert it using ffmpeg.

ffmpeg -i audio.mp4 -vn -acodec libmp3lame -q:a 2 audio.mp3

ffmpeg -i audio.m4a -c:a libmp3lame -q:a 2 audio.mp3

Install dependencies

If you don't have uv installed, you can install it curling the following command Installing uv:

curl -LsSf https://astral.sh/uv/install.sh | sh

After that in order to run the application, you need to install the required dependencies. You can do this using uv (a virtual environment manager).

uv sync

GGUF checkpoints require a recent Transformers release and a matching .gguf filename. CUDA is required; the app loads the audio and text models one at a time so each model stays on the GPU instead of relying on CPU offload.

P.S.: Take a look at the pyproject.toml file, in particular the tool.uv.index section, because in my machine I had to add the pytorch-cu118 index to be able to install the torch package. If you have a new NVIDIA GPU, you might need to change the index URL to match your CUDA version (or remove it if you don't need it).

Install ffmpeg if you don't have it installed yet. This is required to convert audio files to mp3 format.

sudo apt install ffmpeg

Create a .env file

From the project root, copy .env.example to .env.

The application loads .env automatically on startup.

cp .env.example .env

This file should contain the Hugging Face model names and other configurations. Here is an example of what your .env file might look like:

The below values are the default ones used in the project, but you can change them to use different models or configurations as needed.

HUGGINGFACE_LLM_MODEL=unsloth/Qwen3.5-2B-GGUF
HUGGINGFACE_LLM_GGUF_FILE=Qwen3.5-2B-UD-IQ2_M.gguf
HUGGINGFACE_WHISPER_MODEL=openai/whisper-small

Supported environment variables

Variable	Default	Purpose
`HUGGINGFACE_LLM_MODEL`	`unsloth/Qwen3.5-2B-GGUF`	Chat model used for cleanup and summary generation
`HUGGINGFACE_LLM_GGUF_FILE`	`Qwen3.5-2B-UD-IQ2_M.gguf`	GGUF filename used to switch the loader to Hugging Face GGUF support
`HUGGINGFACE_LLM_MAX_TOKENS`	`4096`	Maximum tokens generated for the summary
`HUGGINGFACE_LLM_TEMPERATURE`	`0.5`	Sampling temperature for the summary model
`HUGGINGFACE_WHISPER_MODEL`	`openai/whisper-small`	Whisper model used for transcription
`HUGGINGFACE_WHISPER_CHUNK_LENGTH_S`	`30`	Whisper chunk size in seconds
`HUGGINGFACE_WHISPER_STRIDE_LENGTH_S`	`5`	Whisper overlap in seconds
`SYSTEM_PROMPT`	project default	System prompt used by the summary model
`USER_RESUME_PROMPT`	project default	User prompt used to normalize and summarize the transcript

Run as a command line

From the project root, you can run the script to transcribe audio files directly from the command line.

uv run python -m src.main --sample audio/your-file.mp3 --output_file audio/transcribed.txt

Run the server

Another way to use the application is to run it as a server. This allows you to send audio files via HTTP requests and receive transcriptions in response.

uv run uvicorn src.app:app --host 0.0.0.0 --port 8000 --reload

The web UI is available at /. Uploads are accepted by POST /upload, and the response includes rendered transcript and summary HTML plus timing data.

Runtime notes

CUDA is required. The app raises at startup if no GPU is available.
The audio and text models are loaded one at a time so each can use GPU memory before the next one starts.
The text model is loaded in 4-bit when CUDA is available, but Gemma 4 can still hit temporary VRAM spikes during initialization on small GPUs; if that happens, use a smaller model or more VRAM.
If HUGGINGFACE_LLM_GGUF_FILE is set to a .gguf filename, the text model switches to the Hugging Face GGUF loader automatically.
Uploaded files are written temporarily to uploads/ and removed after processing.
Generated transcripts and summaries are saved in audio/ with timestamped filenames for reference.