Simple web app made so you can use Deep Learn models to transcribe audio to text.
Find a file
2026-04-17 11:54:32 +00:00
audio feat: Sistema de transcrição 2025-08-16 14:49:21 -03:00
src feat: General improvements 2026-04-17 08:52:10 -03:00
static feat: General improvements 2026-04-17 08:52:10 -03:00
templates feat: General improvements 2026-04-17 08:52:10 -03:00
uploads feat: Sistema de transcrição 2025-08-16 14:49:21 -03:00
.editorconfig feat: Change the way we run the frontend 2025-08-26 08:31:19 -03:00
.env.example feat: General improvements 2026-04-17 08:52:10 -03:00
.gitignore feat: Sistema de transcrição 2025-08-16 14:49:21 -03:00
image.png feat: Change the way we run the frontend 2025-08-26 08:31:19 -03:00
LICENSE.md feat: Sistema de transcrição 2025-08-16 14:49:21 -03:00
pyproject.toml feat: General improvements 2026-04-17 08:52:10 -03:00
README.md feat: General improvements 2026-04-17 08:52:10 -03:00
run.sh feat: General improvements 2026-04-17 08:52:10 -03:00
uv.lock feat: General improvements 2026-04-17 08:52:10 -03:00

Transcribe Audio2Text + Resumo

System Image

This project exposes two entry points: a FastAPI web app for MP3 uploads and a CLI for local transcription. Whisper handles speech-to-text, and a Hugging Face chat model cleans up the transcript and produces the summary.

Audio should be in mp3 format. If you have a different format, you can convert it using ffmpeg.

ffmpeg -i audio.mp4 -vn -acodec libmp3lame -q:a 2 audio.mp3

ffmpeg -i audio.m4a -c:a libmp3lame -q:a 2 audio.mp3

Install dependencies

If you don't have uv installed, you can install it curling the following command Installing uv:

curl -LsSf https://astral.sh/uv/install.sh | sh

After that in order to run the application, you need to install the required dependencies. You can do this using uv (a virtual environment manager).

uv sync

GGUF checkpoints require a recent Transformers release and a matching .gguf filename. CUDA is required; the app loads the audio and text models one at a time so each model stays on the GPU instead of relying on CPU offload.

P.S.: Take a look at the pyproject.toml file, in particular the tool.uv.index section, because in my machine I had to add the pytorch-cu118 index to be able to install the torch package. If you have a new NVIDIA GPU, you might need to change the index URL to match your CUDA version (or remove it if you don't need it).

Install ffmpeg if you don't have it installed yet. This is required to convert audio files to mp3 format.

sudo apt install ffmpeg

Create a .env file

From the project root, copy .env.example to .env.

The application loads .env automatically on startup.

cp .env.example .env

This file should contain the Hugging Face model names and other configurations. Here is an example of what your .env file might look like:

The below values are the default ones used in the project, but you can change them to use different models or configurations as needed.

HUGGINGFACE_LLM_MODEL=unsloth/Qwen3.5-2B-GGUF
HUGGINGFACE_LLM_GGUF_FILE=Qwen3.5-2B-UD-IQ2_M.gguf
HUGGINGFACE_WHISPER_MODEL=openai/whisper-small

Supported environment variables

Variable Default Purpose
HUGGINGFACE_LLM_MODEL unsloth/Qwen3.5-2B-GGUF Chat model used for cleanup and summary generation
HUGGINGFACE_LLM_GGUF_FILE Qwen3.5-2B-UD-IQ2_M.gguf GGUF filename used to switch the loader to Hugging Face GGUF support
HUGGINGFACE_LLM_MAX_TOKENS 4096 Maximum tokens generated for the summary
HUGGINGFACE_LLM_TEMPERATURE 0.5 Sampling temperature for the summary model
HUGGINGFACE_WHISPER_MODEL openai/whisper-small Whisper model used for transcription
HUGGINGFACE_WHISPER_CHUNK_LENGTH_S 30 Whisper chunk size in seconds
HUGGINGFACE_WHISPER_STRIDE_LENGTH_S 5 Whisper overlap in seconds
SYSTEM_PROMPT project default System prompt used by the summary model
USER_RESUME_PROMPT project default User prompt used to normalize and summarize the transcript

Run as a command line

From the project root, you can run the script to transcribe audio files directly from the command line.

uv run python -m src.main --sample audio/your-file.mp3 --output_file audio/transcribed.txt

Run the server

Another way to use the application is to run it as a server. This allows you to send audio files via HTTP requests and receive transcriptions in response.

uv run uvicorn src.app:app --host 0.0.0.0 --port 8000 --reload

The web UI is available at /. Uploads are accepted by POST /upload, and the response includes rendered transcript and summary HTML plus timing data.

Runtime notes

  • CUDA is required. The app raises at startup if no GPU is available.
  • The audio and text models are loaded one at a time so each can use GPU memory before the next one starts.
  • The text model is loaded in 4-bit when CUDA is available, but Gemma 4 can still hit temporary VRAM spikes during initialization on small GPUs; if that happens, use a smaller model or more VRAM.
  • If HUGGINGFACE_LLM_GGUF_FILE is set to a .gguf filename, the text model switches to the Hugging Face GGUF loader automatically.
  • Uploaded files are written temporarily to uploads/ and removed after processing.
  • Generated transcripts and summaries are saved in audio/ with timestamped filenames for reference.