Running Llama3 locally

Pranav Tiwari
2 min readApr 23, 2024

--

Day 114 / 366

I’ve heard great things about the new model from Meta, Llama 3, so I decided to try it out today. I first checked out how much computing power I would need to run it, and I was met with disappointment at first because most sources I looked at said that even to run the 8B model I would need at least 48 GB of VRAM (That means a 48 GB Graphic Card!)

However, thanks to Model Quantization, it is possible to optimize models and reduce the memory required to run them. These models are nothing but a collection of parameters, which are 32-bit floats. Through quantization, we can reduce the precision to 8-bit or even 4-bit integers and significantly reduce the amount of space needed to run them. And this does not lead to a significant drop in the output generated by these LLMs.

To run the 4-bit quantized version of this model on my MacBook Pro (with no dedicated graphic card) I used a freely available program called Ollama. You can try it out here — https://ollama.com/library/llama3

It was pretty easy to set up, and within a few minutes I was able to talk to llama 3 —

It’s tough to judge how well an LLM is doing just with a few prompts. So for the next few days, whenever I need to use ChatGPT I will first try Llama 3 as well and see how good the output is. And after a few days, I will share my findings in another blog

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Pranav Tiwari
Pranav Tiwari

Written by Pranav Tiwari

I write about life, happiness, work, mental health, and anything else that’s bothering me

Responses (1)

Write a response