I needed to do this this week (transcribe an interview with multiple speakers) a...

adipasquale · on Aug 9, 2024

I have had very good results using Spectropic [1], a hosted Whisper Diarization API service as a platform. I found it cheap and way easier and faster than setting up and using whisper-diarization on my M1. Audiogest [2] is a web service built upon Spectropic, I have not yet used it.

disclaimer : I am not affiliated in any way, just a happy customer! I had some nice mail exchanges after bug reports with the (I believe solo-)developer behind these tools.

---

[1] https://spectropic.ai/

[2] https://audiogest.app/

thomasmol · on Aug 9, 2024

Thanks for the shout-out and kind words!

Thomas here, maker of Spectropic and Audiogest. I am indeed focused on building a simple and reliable Whisper + diarization API. Also working on providing fine-tuned versions of Whisper of non-English languages through the API.

Feel free to reach out to me if anyone is interested in this!

dchuk · on Aug 9, 2024

Great looking API. Are you able to, or do you have plans, for there to be automatic speaker identification based on labeled samples of their voices? It would be great to basically have a library of known speakers that are auto matched when transcribing

thomasmol · on Aug 9, 2024

Thanks! That is something I might offer in the future and is definitely possible with a library like pyannote. Would be really cool to add for sure.

I am also experimenting with post-processing transcripts with LLMs to infer speaker names from a transcript. It works pretty decent already but it's still a bit expensive. I have this feature available under the 'enhanced' model if you want to check it out: https://docs.spectropic.ai/models/transcribe/enhanced

ukuina · on Aug 9, 2024

Hi! Any plans to support streaming transcription with diarization?

thomasmol · on Aug 10, 2024

Streaming is definitely on the to-do list! Its quite complex to stream both transcription + diarization, but we will get there eventually

H8crilA · on Aug 9, 2024

I often subtitle old, obscure, foreign language movies with Whisper. Or random clips found on foreign Telegram/Twitter channels. Paired up with some GPT for translation it works great!

You can do this locally if you have enough (V)RAM, but I prefer the OpenAI API, as usually I don't have enough at hand. And the various Llamas aren't really quality on par with GPT-4. If you only need Whisper, and no translation, then local execution is indeed very viable. High quality Whisper fits in 4GB of (V)RAM.

wanderingmind · on Aug 9, 2024

The problem with using OpenAI whisper is that its too slow on CPU only machines. Whisper.CPP is blazing fast compared to Whisper and I wish people build better diarization on top of that.

aidenn0 · on Aug 9, 2024

Another advantage of Whisper.CPP is that it can use cublas to accelerate models too large for your GPU memory; I can run the medium and large models with cublas on my 1050, but only the small if I use the pure GPU mode.

stavros · on Aug 9, 2024

What's OpenAI Whisper vs whisper.cpp? Do you mean whisper-diarization uses the API?

Zambyte · on Aug 9, 2024

https://github.com/openai/whisper

vs

https://github.com/ggerganov/whisper.cpp

They are two inference engines for running the whisper ASR model, each with their own API AFAIK.

stavros · on Aug 9, 2024

Ah I see, thanks. Hm, I would imagine that it's not hard to make something that works with both (the surface area of the API should be fairly small, I imagine), odd that projects use the former and not the latter.

RamblingCTO · on Aug 9, 2024

I had better success with whisperx, as whisper-dia does sometimes have weird issues I couldn't resolve: https://github.com/m-bain/whisperX

cube2222 · on Aug 9, 2024

iirc whisper-diarization uses whisperx under the hood.

I’ll be honest, I haven’t dived much into this as I just needed something transcribed quickly, but when I was looking at WhisperX I couldn’t find a CLI that would just out of the box give me a text file with a line per speaker statement (not per word).

RamblingCTO · on Aug 9, 2024

I use it like this:

whisperx $file int8 --min_speakers 3 --max_speakers 3 --language de --hf_token $token --diarize

stavros · on Aug 9, 2024

> iirc whisper-diarization uses whisperx under the hood.

It seems like it does:

https://github.com/MahmoudAshraf97/whisper-diarization/blob/...

hubraumhugo · on Aug 9, 2024

Fascinating how traditionally very complex and hard ML problems are slowly becomming commodities with AI:

- transcription

- machine translation

- OCR

- image recognition

terribleperson · on Aug 9, 2024

Does it hallucinate when there's dead air?

jduckles · on Aug 10, 2024