Running Llama3 locally
Day 114 / 366

I’ve heard great things about the new model from Meta, Llama 3, so I decided to try it out today. I first checked out how much computing power I would need to run it, and I was met with disappointment at first because most sources I looked at said that even to run the 8B model I would need at least 48 GB of VRAM (That means a 48 GB Graphic Card!)
However, thanks to Model Quantization, it is possible to optimize models and reduce the memory required to run them. These models are nothing but a collection of parameters, which are 32-bit floats. Through quantization, we can reduce the precision to 8-bit or even 4-bit integers and significantly reduce the amount of space needed to run them. And this does not lead to a significant drop in the output generated by these LLMs.
To run the 4-bit quantized version of this model on my MacBook Pro (with no dedicated graphic card) I used a freely available program called Ollama. You can try it out here — https://ollama.com/library/llama3
It was pretty easy to set up, and within a few minutes I was able to talk to llama 3 —

It’s tough to judge how well an LLM is doing just with a few prompts. So for the next few days, whenever I need to use ChatGPT I will first try Llama 3 as well and see how good the output is. And after a few days, I will share my findings in another blog