Demystifying automatic multi speaker video transcription using AI & ML

1 minutes

June 28, 2023

I recently embarked on a fascinating journey to develop an automated solution for multi-speaker video transcription, bridgind the gap between AI and ML in transcription technology.

Armed with Python 3.8, ffmpeg, and libsndfile1, you can set the stage for this innovative project. A simple install command gets you all the software necessities swiftly:

$ apt get install ffmpeg libsndfile1 python3.8-venv -y

The automatic video transcription tool relies on two powerful components:

Pyannote.audio’s speaker diarization pipeline (standalone version available) performing seamless speaker segmentation and matching across different time frames using AgglomerativeClustering.
The Whisper model from OpenAI, revolutionizing English speech transcription.

These steps create an efficient multi-speaker video transcription session via AI: decode audio input, process it with the speaker diarization, and finally, transcribe it with Whisper.

You can dive deeper into the project’s details on my GitHub repository: https://github.com/guluarte/speaker-transcribe

Get into the driver’s seat of automatic multi-speaker video transcription by downloading yt-dlp, choosing your video and running the prediction script.

$ yt-dlp https://www.youtube.com/watch?v=HKp68LX7fPA -f 251 -o HKp68LX7fPA.webm
$ python predict.py "./HKp68LX7fPA.webm" "./HKp68LX7fPA.json"

Tap into the power of a free GPU from Paperspace to amplify transcription speed, processing each video in a fantastic 10-20 minutes! Grab the free Paperspace GPU here: https://console.paperspace.com/signup?R=QI58Z1X

Unlock the power of AI & ML with automatic multi-speaker video transcription today!

Hi, I'm Rodo 👋

I'm a Software Engineer

Hi, I'm Rodo 👋

I'm a Software Engineer

Demystifying automatic multi speaker video transcription using AI & ML