For several days now, I’ve been looking into recording audio in a browser and streaming it to a backend over a websocket with the intent to do speech to text translation with an AI model.
I know the pieces are all there and I’ve done something like this before (streamed audio from a Twilio IVR to a node backend, the send that to a Google Dialogflow CX agent).
The current challenge is finding which pieces I want to connect.
I’ve used a lot of Next.js lately.
I like the developer experience.
It enjoyable to use to build frontends.
It also has route handlers, which are backend functions that are deployed on Lambda if you deploy on Vercel.
These route handlers can’t really support a websocket backend because they aren’t designed to be long lived, something I learned when I worked around if by creating a secondary route handler as an async function.
Apparently, these can now run for up to five minutes.1
Route handlers on Vercel can now run for a maximum of five minutes, which is an increase from the previous limit. This allows for more complex operations to be handled directly within these functions.
I would need to stand up a separate backend.
That seemed fine and fair enough, so I started looking at Deno, which I’ve also used recently and enjoyed.
Deno supports websockets out of the box.
It also supports importing npm modules – I plan to use @google-cloud/speech
to do speech to text conversion.
The remaining question is how I can stream audio captured in the browser with navigator.getUserMedia
over a websocket to forward to Google to convert to text.
Hardly seemed with a TIL post because it was too easy, but I learned gpt-4
is proficient at building working ffmpeg
commands.
I wrote the prompt
convert m4a to mp3 with ffmpeg
and it responsed with
ffmpeg -i input.m4a -codec:v copy -codec:a libmp3lame -q:a 2 output.mp3
Since the problem at hand was low stakes, I just ran the command and, to my satisfaction, it worked. Language models can’t solve every problem but they can be absolutely delightful when they work.
I spent another hour playing around with different techniques to try and teach and convince gpt-4
to play Connections properly, after a bit of exploration and feedback.
I incorporated two new techniques
- Asking for on category at a time, then giving the model feedback (correct, incorrect, 3/4)
- Using the chain of thought prompting technique
Despite all sorts of shimming and instructions, I still struggled to get the model to
- only suggest each word once, even when it already got a category correct
- only suggest words from the 16 word list
Even giving a followup message with feedback that the previous guess was invalid didn’t seem to help. This was the prompt I ended up with. It wasn’t all that effective.
After some experimentation with GitHub Copilot Chat, my review is mixed. I like the ability to copy from the sidebar chat to the editor a lot. It makes the chat more useful, but the chat is pretty chatty and thus somewhat slow to finish responding as a result. I’ve also found the inline generation doesn’t consistently respect instructions or highlighted context, which is probably the most common way I use Cursor, so that was a little disappointing. To get similar behavior with Copilot, sometimes I needed to run a generation for the whole file, but the lack of specific highlighted context meant I had to write more specific instructions, which was more time-consuming than highlighting and giving shorter, more contextual instructions. It is easy to edit the prompt and resubmit it if the completion is close, but not quite right, so that is helpful.
I worked through a basic SwiftUI 2 tutorial to build a simple Mac app. Swift and SwiftUI are an alternative to accomplish the same things Javascript and React do for web. I could also use something like Electron to build a cross-platform app using web technology, but after reading Mihhail’s article about using macOS native technology to develop Paper, I was curious to dip my toe in and see what the state of the ecosystem looked like. He opted to use Objective-C, for performance reasons. I decided to try Swift because I’ve written a bit of Objective-C years ago. I like the ergonomics of Swift as a language well enough. I can’t say I’m a huge fan of Xcode. My hardware is almost certainly too old, but Xcode is sluggish and not fun to use in a way that the web development tools I use are not (at least on my machine). Seeing all the things that PWAs can do today, I’m unsure whether it makes sense to invest in learning SwiftUI unless I want to build native mac apps.
I enjoyed this article by Robin about writing software for yourself. I very much appreciate the reminder of how gratifying it can be to build tools for yourself.
I read Swyx’s article Learn in Public today and it’s inspired me to open source most of my projects on Github.
A beautifully written and thought-provoking piece by Henrik about world models, exploring vs. exploiting in life, among other things.
I finally had a chance to use Github Copilot Chat in VS Code. It has a function to chat inline like Cursor, which has worked quite well given my initial use of it. I’m looking forward to using this more. Unfortunately, it’s not available for all IDEs yet but hopefully will be soon!
I watched lesson 3 of the FastAI course. I’ve really enjoyed Jeremy Howard’s lecture’s so far.
I looked into 11ty
today to see if it could be worth migrating away from hugo
, which is how (at the time of this post) I build my blog.
After a bit of research and browsing, I setup this template and copied over some posts.
Some over my older posts were using Hugo’s markup for syntax highlighting.
I converted these to standard markdown code fences (which was worthwhile regardless).
I also needed to adjust linking between posts.
In Hugo, I use ref
.
In 11ty, these need to be relative links, e.g. /posts/2023/future-of-personal-knowledge
.
In Hugo, this approach works as well, so I may move to it.
I would love if OpenAI added support for presetting a max_tokens
url parameter in the Playground.
Something as simple as this:
https://platform.openai.com/playground?mode=chat&model=gpt-4-1106-preview&max_tokens=1024
My most common workflow (mistake):
- Press my hotkey to open the playground
- Type in a prompt
- Submit with cmd+enter
- Cancel the request
- Increase the “Maximum Length” to something that won’t get truncated
- Submit the request again