For several days now, I’ve been looking into recording audio in a browser and streaming it to a backend over a websocket with the intent to do speech to text translation with an AI model.
I know the pieces are all there and I’ve done something like this before (streamed audio from a Twilio IVR to a node backend, the send that to a Google Dialogflow CX agent).
The current challenge is finding which pieces I want to connect.
I’ve used a lot of Next.js lately.
I like the developer experience.
It enjoyable to use to build frontends.
It also has route handlers, which are backend functions that are deployed on Lambda if you deploy on Vercel.
These route handlers can’t really support a websocket backend because they aren’t designed to be long lived, something I learned when I worked around if by creating a secondary route handler as an async function.
Apparently, these can now run for up to five minutes.1
Route handlers on Vercel can now run for a maximum of five minutes, which is an increase from the previous limit. This allows for more complex operations to be handled directly within these functions.
I would need to stand up a separate backend.
That seemed fine and fair enough, so I started looking at Deno, which I’ve also used recently and enjoyed.
Deno supports websockets out of the box.
It also supports importing npm modules – I plan to use @google-cloud/speech
to do speech to text conversion.
The remaining question is how I can stream audio captured in the browser with navigator.getUserMedia
over a websocket to forward to Google to convert to text.