Running Llama3 locally

2 min readApr 23, 2024

Day 114 / 366

I’ve heard great things about the new model from Meta, Llama 3, so I decided to try it out today. I first checked out how much computing power I would need to run it, and I was met with disappointment at first because most sources I looked at said that even to run the 8B model I would need at least 48 GB of VRAM (That means a 48 GB Graphic Card!)

However, thanks to Model Quantization, it is possible to optimize models and reduce the memory required to run them. These models are nothing but a collection of parameters, which are 32-bit floats. Through quantization, we can reduce the precision to 8-bit or even 4-bit integers and significantly reduce the amount of space needed to run them. And this does not lead to a significant drop in the output generated by these LLMs.

To run the 4-bit quantized version of this model on my MacBook Pro (with no dedicated graphic card) I used a freely available program called Ollama. You can try it out here — https://ollama.com/library/llama3

It was pretty easy to set up, and within a few minutes I was able to talk to llama 3 —

It’s tough to judge how well an LLM is doing just with a few prompts. So for the next few days, whenever I need to use ChatGPT I will first try Llama 3 as well and see how good the output is. And after a few days, I will share my findings in another blog

Running Llama3 locally

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Pranav Tiwari

Responses (1)