Friends, Romans, countrymen, lend me your ears

Feb 25, 2023

Hi,

this is Rinat Abdullin, writing about development of ML products.

My friend is building a prototype ML product. It is like a Siri but within a niche language and culture.

The prototype works like this: there is a 3D-printed black box with some circuitry inside. People can talk to a device in their native language. Device thinks really hard and then speaks out an answer in the same language.

Device itself is fairly simple at this point - a RaspberryPi that records audio and sends it to a remote service for processing. Answer is streamed back to the speakers.

Backend is where the magic happens. First, it runs a fine-tuned text-to-speech engine. Afterwards, the text is translated to English via a transformer model. English text, usually a question, is passed to a GPT model. Response is translated back to native language before being split into sentences and processed through speech-to-text model.

If you are interested in the references to the models, I'll list them at the end.

Currently it takes ~40-60 seconds to run a full round-trip, which is very bad from user experience standpoint. I’m also extremely proud of it!

Why?

You see, this is a personal learning project for me. I’m working on it in short bursts of time in the evenings, usually after the kids are put to sleep. The rest of the team is running on the same scrappy schedule as well: product development, hardware, social media management, ML research and other parts. Given little time, we have to be precise with our development approach.

The last sprint ran for 2 weeks of calendar time and ended today. The goal was very specific: hardware prototype that can be asked any question in a native language, answering within a minute.

Native language is the key here. Anybody can assemble Siri in English: rent a GPU-enabled machine and wire together whisper with ChatGPT and some voice generation. However, when we are talking about native niche languages that have only a few million of speakers left, this approach would not work. Models are unusable or non-existent. There could even be no data to train these models.

I'm not even talking about missing cultural context in ChatGPT.

We achieved that sprint goal 3 days early. My focus was a small one - chain existing models and deploy them to a server without GPU.

Image generation detour

In the middle of the sprint I wasted 3 evenings on rewriting the image-generation part of the previous pipeline from Python-powered imgkit (depends on an ancient QT-app with embedded webview) to Pillow. This code was a part of the previous inference pipeline, that we wanted to keep. It allowed render question-response pairs into screenshots of a chat conversation - well suited for sharing on social media.

In retrospective, it made no sense to work on image-generation, because it wasn’t part of the sprint, even though it is a part of the bigger product strategy. So it was a useful waste of time that contributed nothing to product at this stage.

After getting through the image generation part, I finally focused on core objective of the pipeline - to make it run, so that the hardware prototype would be a real thing. While doing that I had to skip possible (and fairly easy) optimisations that could cut the response time in half or more. It was very tempting, though :)

From a high-level product perspective it was the absolute right thing to do: faster but unfinished ML inference pipeline doesn’t advance the product forward. Slow, hacky, but functional pipeline could be plugged into the hardware prototype, shown, recorded and shared with community for the feedback.

And we've done it this sprint! For the next 2-week sprint, the tech objective for me is to finally make the ML inference faster, running the entire pipeline under 15 seconds. Faster is better, but 15 sec would be good enough for the next step.

I think, this is doable in a couple of evenings. While helping to grow DS/ML department at a transport organisation in the last 4 years, I've learned about building orchestrators for custom models. This will be just another version of it, running in CPU-only mode on a resource-constrainted platform.

The whole stack, aide from the GPT model, runs on a hosted server without GPU. We can’t afford always-on GPU at this stage. It also isn’t necessary. All the processing happens on CPU only.

References

Here are a few links, if you are interested:

Whisper - general-purpose speech recognition model from OpenAI: https://github.com/openai/whisper
Fine-tuning whisper for multi-language purposes: https://huggingface.co/blog/fine-tune-whisper
Whisper.cpp - deploying Whisper models in CPU-optimised workloads: https://github.com/ggerganov/whisper.cpp
OpenAI API for GPT models - https://platform.openai.com/docs/models/overview

Have a great weekend!

With best regards,
Rinat

ML Under the Hood

Discussion about this post