Heard the news? LLaMa is open source now. In a way...
Hi,
this is Rinat Abdullin, writing to you in a newsletter on ML, engineering and product development.
So, have you heard the news? Meta has trained a new GPT 3.0 model called LLaMa, using public data. They have published the paper and invited researchers to apply for the access to the model. Eventually somebody just downloaded the model and shared it via torrent. Setting a question of copyrights aside (are weights even copy-writeable?), the model is open source now. It is all over the internet.
This feels like a StableDiffusion moment. That visual model was also shared by accident, starting a wave of innovation. We’ll see how it turns out for the LLaMA.
You can go over, download the model and try running it on you local GPU card. There are a few gotchas, though.
First, this is GPT3 model. Don’t expect results that are as good as the answers fro ChatGPT.
Second, even to run this model, you would need some decent hardware. There are multiple model sizes of LLaMA and you could squeeze the smallest one in 20GB of VRAM (model with 13 billion parameters at 8bit quantization). So if you had a workstation with Nvidia 3090 (24GB VRAM) and 64 GB of normal RAM, you could try running LLaMA.
However, all the interesting results start with larger flavours of LLaMa. So to run them you’d need a couple of A100 cards, or link together a couple of consumer GPUs and lower the model precision.
The community is quite interested in this model, so I’d give it a few weeks before we’ll see people managing to run even larger LLaMa models on commodity hardware.
GPT 3.5 Turbo - 10x cheaper
Meanwhile OpenAI is not standing still. They have recently announced a new ChatGPT model to replace davinchi - GPT-3.5 Turbo. Not only it feels better, but it also is 10 times cheaper to use.
Migration between the models is pretty straightforward. I’ve done it in ten minutes for “voice assistant” inference pipeline one of the nights. The best part - you get more control over the model now by giving it instructions at the “system” level. For example: “be laconic and assume that the speaker is living in a concrete region and speaking a specific language”.
Given that, when a user asks questions like “How are rivers called nearby?” or “how is the capital called?”, GPT-3.5 turbo will not quote a generic wikipedia article about capitals of the world, but will provide a concrete answer.
ML Product update - running inference in 2-10 seconds
By the way, in a previous mail I mentioned the goal of running ML inference pipeline for a native-language assistant under 15 seconds. The sprint was planned to be 2 weeks, but we have already done that! The pipeline runs under 10 seconds in cases when we need to hit ChatGPT. However, when there is a clear user intent being detected, the answer can be much faster - in 2 seconds or less! This is because we don't need to spend time on translation and ChatGPT calls.
The next interesting step for me is to try to take this inference pipeline and: (1) make it perceptively faster by streaming answers and providing audio cues, (2) try deploying it as a GPU-powered function-as-a-service.
We'll see how this works out. I still have a few spare evenings in this sprint to make this happen :)
Wishing you a great week!
Best regards,
Rinat