One challenge I’ve continued to have is figuring out how to use the models on Huggingface. There are usually Python snippets to “run” models that often seem to require GPUs and always seem to run into some sort of issues when trying to install the various Python dependencies. Today, I learned how to run model inference on a Mac with an M-series chip using llama-cpp and a gguf file built from safetensors files on Huggingface.

Download and convert the model#

For this example, we’ll be using the Phi-3-mini-4k-instruct by Microsoft from Huggingface. You will also need git-lfs, so install that first.

Llama-cpp generally needs a gguf file to run, so first we will build that from the safetensors files in the Huggingface repo. This will take a while to run, so do the next step in parallel.

git clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

Separately, pull down the code for llama-cpp and build it

git clone [email protected]:ggerganov/llama.cpp.git
cd llama.cpp
make

Create a virtualenv and install the dependencies for llama-cpp’s conversion scripts.

python -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt

Now, let’s convert the safetensors files into a gguf. From the llama.cpp folder, run

python convert_hf_to_gguf.py ../models/Phi-3-mini-4k-instruct --outfile ../models/Phi-3-mini-4k-instruct-f16.gguf --outtype f16

Run inference#

We’ve now generated the gguf file. Let’s run it.

Again, from the the llama.cpp folder run

./llama-cli -m ../models/Phi-3-mini-4k-instruct-f16.gguf -cnv -ngl 80 -p "You are a helpful assistant"

This runs llama-cpp in the command line, in conversation mode (-cnv), offloading the whole model to the GPU (-ngl 80) and with a system prompt (-p). When everything is loaded, you’ll be given a command prompt to type into:

You are a helpful assistant

> Hi, who am I speaking with?
Hello! I'm Phi, your friendly AI assistant designed to help you with information and tasks. How can I assist you today?

> Who designed you?
I was created by Microsoft's team of engineers and researchers. They continuously work to improve my abilities and ensure I provide the best assistance possible.

Llama-cpp also has a server that will locally host a UI that can be used for prompting the model.

./llama-server -m ../models/Phi-3-mini-4k-instruct-f16.gguf --port 8080

Run the model from ollama#

Ollama is a user-friendly way to get up and running with local language model quickly. Using a Modelfile, we can expose the Phi-3 model we just downloaded and convert via ollama.

In the models folder, create a file called phi-3-mini-4k-instruct.modelfile containing

FROM ./Phi-3-mini-4k-instruct-f16.gguf

Then run:

ollama create phi-3-mini-4k-instruct -f phi-3-mini-4k-instruct.modelfile

which will output something like

transferring model data
using existing layer sha256:fe4c64522173db6b3dae84ae667f6a8ad9e6cbc767f37ef165addbed991b129d
using autodetected template zephyr
creating new layer sha256:86107d0be467af35235b1becfb7a10e9cd30cf88332325c66670d70c90ee82b1
writing manifest
success

You should see the new model show up in ollama

ollama list
NAME                            ID              SIZE    MODIFIED
phi-3-mini-4k-instruct:latest   945b0b26c95a    7.6 GB  35 seconds ago

Finally, run the model through ollama

โฏ ollama run phi-3-mini-4k-instruct
>>> Hi, who am I speaking with?
Hello! You're communicating with an AI assistant. How can I assist you today?

>>> Who created you?
I was crafted by Microsoft through a team of engineers and research scientists dedicated to
improving artificial intelligence technologies for various applications, including conversational
assistance like myself!

To clean things up (if you so choose), run

ollama rm phi-3-mini-4k-instruct:latest

Huge thanks to Vinny for his help getting this working!