After All Is Said and Indexed – Unlocking Information in Recorded Speech

rektide · on April 15, 2023

Hadn't heard of the thing they were putting their data into, Marqo, a "tensor search for humans" , https://github.com/marqo-ai/marqo

jeadie · on April 15, 2023

Its a great tool. Unlike vectorDBs alone, Marqo helps the full process that alot of people end up wanting to use vectorDBs for (e.g. have structured data, use LLMs to create embeddings, and perform search/CRUD on embeddings + original data).

jeadie · on April 15, 2023

A really interesting blog post I found using LLMs for audio search which I think is a pretty nifty/new idea.

I've found it cumbersome using some of the new vector DBs (chroma, faiss, etc) to make end to end systems, but with Marqo it doesn't seem too hard.

thomasahle · on April 15, 2023

> I've found it cumbersome using some of the new vector DBs (chroma, faiss, etc) to make end to end systems

What parts are cumbersome?

jeadie · on April 15, 2023

Most people, like me, who end up needing to use vector DBs, are wanting to use LLMs on a specific, often private dataset/use case. Typically one starts with something like unstructured JSON data, then need to pick and manage LLMs to create embeddings, then store these and the original JSON data in a vectorDB. Then the application is some variety of CRUD operations + searching over both the original data and the embeddings.

Chroma, Pinecone, I guess FAISS/HNSWlib/etc only handle vector operations. Really what I'd want, which Marqo does, is handle everything end to end.

notjulianjaynes · on April 15, 2023

This is interesting but what problem does it solve better than CTRL+F-ing a transcript? It seems like this would be a worse solution for when the precise way someone says something could be important (ex. journalists parsing an interview, students studying their recorded lectures) and that it would be most useful if you were working with a large volume of recorded audio, such as customer service calls. This makes me somewhat uncomfortable, but perhaps I am not fully understanding how it works.

Edit: wording

jeadie · on April 15, 2023

Being able to handle and ask questions of audio data is a pretty big field. https://www.assemblyai.com/, for example, is a company entirely dedicated to audio intelligence. They have some great example use cases on their page.

UncleEntity · on April 15, 2023

> This is interesting but what problem does it solve better than CTRL+F-ing a transcript?

Producing the transcript?

Being able to classify and search data seems like a pretty big deal these days too.

password4321 · on April 15, 2023

Both speaker and speech recognition are done in the article using huggingface.

Is there anything as good ready to use on-prem for the diarization (speaker recognition)?

I've heard good things about whisper(.cpp) for speech recognition and vosk used to be king of that hill...

rolisz · on April 15, 2023

Diarization can be done on premise using pyannote (what they use in the article). Huggingface offers a library to run things locally and an API to run things on their cloud. Pyannote is available under an MIT licence

boredemployee · on April 15, 2023

vosk is really good, but also a good example of an open source project with great potential, but doesn't scale up because the person behind it is a douchebag.

documentation is poor, and what you find is sparsed outdated shit on the web, so it's really hard to find help.

moneywoes · on April 15, 2023

How does this compare to using Whisper and feeding that into a vector DB and querying with a LLM

Pardon the dumb question I only have an elementary understanding

jeadie · on April 15, 2023

Not a dumb question at all! Essentially what can do Marqo, and this blog shows, is that there is alot of logic and work to do what you said (i.e. pass raw data into LLM, get embeddings, store in vector DB, then query both embeddings and original data).