I spent some time exploring Deepgram’s Next.js starter app. I was hoping I could use it to generate a transcription in realtime but it was more like real-time captions. The responses from the server were sometimes corrections of previous transcriptions. Maybe there is a way to make this a transcription but I wasn’t sure.
I also tried out vocode’s Python library for building voice-based LLM applications. By far the hardest part of this was getting an Azure account, which I believe is used to synthesize the LLM response as speech. I had the demo working end to end but it was a bit sensitive to background noise as they note. I haven’t had a chance to play around with any configurations they provide.