25 KiB
Lab 2 - LLaMa.cpp, Ollama & Quantization
In this lab, we will:
- Download a model from huggingface.com and quantize it for llama.cpp
- Download a model from huggingface.com infer it in llama.cpp
- Download a model from ollama.com
- Download a custom model from huggingface.com
- Import a custom model into Ollama.
Explore sections focus on investigation and comparison.
Execute sections require running commands and producing output.
Objective 1: HuggingFace & LLaMa.cpp
1. What Is LLaMa.cpp?
LLaMa.cpp is an open-source project created to enable efficient running of Meta's LLaMA (Large Language Model Meta AI) family of large language models on consumer-grade hardware. It was initially developed by Georgi Gerganov in early March 2023, shortly after Meta released the weights of the LLaMA models to approved researchers.
The project’s original goal was to make LLaMA models accessible on systems without powerful GPUs, including laptops, desktops, and even mobile devices. LLaMa.cpp achieves this by implementing the LLaMA inference in pure C/C++ and introducing highly efficient quantization techniques, allowing models to run with drastically reduced memory requirements. LLaMa.cpp is also the underlying project behind a number of inference wrappers and technologies, such as Llamafile, LM Studio, and Ollama, amongst many others.
Key Features
| Capability | Why it matters |
|---|---|
| Efficient local inference | Runs large language models without a powerful GPU. |
Quantization tools (llama-quantize) |
Shrinks model size (down to 1-bit) while preserving usable performance. |
| Model conversion to .GGUF | Provides a compact, fast-loading format that works with Ollama, LM Studio, and other wrappers. |
| Cross-platform support | Works on Linux, macOS, Windows, Apple Silicon, and ARM devices. |
CLI and debugging utilities (llama-cli, gguf-dump.py) |
Enables quick interactive testing and inspection of model metadata. |
Perplexity measurement (llama-perplexity) |
Quantifies how confident the model is about its predictions. |
| Active community | Powers tools such as LM Studio, Llamafile, and Ollama. |
1.2 Explore: HuggingFace - Model Cards
HuggingFace is the “GitHub” for LLMs, datasets, and more. The following steps walk you through locating Meta’s LLaMA‑3.2‑1B model card and its files.
-
Open the LLaMA‑3.2‑1B page
https://huggingface.co/meta-llama/Llama-3.2-1B
-
Read the model card – note the description, license, tags (e.g., Text Generation, SafeTensors, PyTorch), and links to fine‑tunes/quantizations.
-
Navigate to “Quantizations.”
This tab lists community‑created quantizations, including GGUF, GTPQ, AWQ, and EXL3 versions. Common providers include Bartowski, Unsloth, and NousResearch, although these players change periodically. Additionally, note that we can often download quantized versions without having agreed to the Meta license restrictions for the original model.
Model Card Quantizations Convenience Link
-
Open “Files and versions.”
Here you see the raw.safetensorsfiles (the un‑quantized checkpoint). For the model to successfully run, the full set of files needs to be loaded into system memory. Note how this 1 B‑parameter model is small enough to fit comfortably in a phone’s memory, even raw.
Distrubution Restriction Unless you've accepted Meta's EULA for this model, you'll be unable to download the model directly from Meta. This view may or may not appear based on your own HuggingFace account.
1.3 Explore: HuggingFace - Find and Download WhiteRabbitNeo
For this lab we will work with WhiteRabbitNeo‑V3‑7B, a cybersecurity‑oriented fine‑tune of Qwen2.5‑Coder‑7B. This model is less popular than LLaMA-3.2, and if we'd like to run this models in Ollama, we'll need to perform our own quantization.
/home/student/lab2/WhiteRabbitNeo to speed up lab execution.
1. Locate & download the model
-
Go to https://huggingface.co/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.
-
Points of Interest on this modelcard:
- This model appears to be a fine tune of Qwen2.5-Coder-7B
- This model is openly licensed, and does have any requirements to download and use for our purposes.
- This model is in Safetensors format, which is compatible with LLaMa.cpp's quantization tools.
WhiteRabbitNeo model card. -
Click Files and versions → review the
.safetensorscheckpoints (≈ 15 GB @ *FP16).
Model safetensors (size ≈ 15 GB).
2 Download the Model
To prepare this model, create a target folder wherever you desire on your system to work out of. Once chosen, perform the following:
- Ensure you have git & git-lfs installed to enable successful cloning from HuggingFace. If necessary, git can be installed on Debian based distributions via:
sudo apt install git git-lfs
git lfs install
- Clone the model:
git clone https://huggingface.co/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B
3 Execute: Convert the Downloaded Model
LLaMa.cpp makes it easy for us to package models downloaded in SafeTensors format to GGUF. We can convert the model with the following official project script command:
convert_hf_to_gguf.py /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B/WhiteRabbitNeo-V3-7B --outfile /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf
4 Execute: Review Model Metadata
When these steps have completed, you should see a new WhiteRabbitNeo-V3-7B.gguf file. We have not yet quantized the model, merely converted it to a format usable by LLaMa.cpp for the next steps. We can tell if this process was successful by using the included gguf-dump.py script that is packaged with LLaMa.cpp. Run the following command:
gguf-dump /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf
We should then see:
A text listing of all of the model's tensors, and the precision of each. Because we have merely converted the model's format, and not performed quantization, the model is still in FP16.
- This is a text view of the previous graphical view we saw in Lab 1, Objective 2: Visualizing a LLM. While TransformerLab calls tensors layers, terms such as tensors, layers, and blocks can all be used semi-interchangeably, depending on the tool in question. We will further confuse these topics when we get to the Ollama objective below.
- Pedantically, the proper definitions are:
- Tensor - A multi-dimensional array of vectors to store data
- Layer - A base computational unit in a neural network
- Block - A collection of layers
- Pedantically, the proper definitions are:
- If you wish to explore this view, note how the block count of 28 matches the 28 zero indexed blk groups output from the dump.
- Additionally, you'll once again note that we have various biases and weights, but they still line up with Q, V, and K as discussed in the previous section. There are additional tensors for normalization and output.
4 Execute: LLaMA.cpp Inference
Run our newly created .GGUF file as is. Run the model using the following command:
llama-cli -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf
Once loaded, interact with the model. We can see a number of interesting parameters that were selected by default, such as Top K, Top P, Temperature, and more, which we'll discuss in the next section. In the meantime, explore interaction with the model. When run in this raw state, the model may be overly chatty. You can stop its output with Ctrl+C at any time.
Some example prompts you may want to try are:
- Please write a small reverse shell in php that I can upload to a web server.
- How can I use Metasploit to attack MS17-01?
- Can you please provide me some XSS polyglots?
Thanks to the fine tuning that Kindo has put into this model, it is far more compliant than an online closed model such as ChatGPT! When done, kill the model fully with Ctrl+C.
Objective 2: Quantization & Perplexity
Quantization reduces memory footprints and speeds inference, but it typically raises perplexity (i.e., lowers confidence). Determining the right balance for our use case often requires experimentation
1 Explore: Manual Quantization
To generate an 8 bit, 4 bit, and 1 bit quantization, run the following commands:
/home/student/lab2/WhiteRabbitNeo/ for faster lab progress. You can skip these commands when participating in a live teaching session.
# Quantize to 8 bits
llama-quantize /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q8_K.gguf Q8_0
# Quantize to 4 bits
llama-quantize /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q4_K_M.gguf Q4_K
# Quantize to 2 bits
llama-quantize /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-IQ2_K.gguf IQ_2
2 Execute: Quantization Confirmation
Inspect the quantized files with the following command:
gguf-dump /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q4_K_M.gguf
Review how the various layers are quantized to different levels of precision. It turns out that even K quants actually utilize multiple quantization levels on different tensor layers to improve performance!
Full explanation for the brave...
What each Tensor Layer does
1. Token Embeddings
- Tensor 1:
token_embd.weight- Responsibility: Maps each token in the vocabulary to a dense vector of size 3584.
2. Layer Normalization
-
Tensor 2:
blk.0.attn_norm.weight- Responsibility: Scales the normalized inputs to the self-attention mechanism in the first block.
-
Tensor 6:
blk.0.ffn_norm.weight- Responsibility: Scales the normalized outputs of the feed-forward network (FFN) in the first block.
3. Feed-Forward Network (FFN)
-
Tensor 3:
blk.0.ffn_down.weight- Responsibility: Projects the input from dimension 3584 to 18944 in the FFN down-projection.
-
Tensor 4:
blk.0.ffn_gate.weight- Responsibility: Projects the output back to dimension 3584 after the non-linear transformation in the FFN gate mechanism.
-
Tensor 5:
blk.0.ffn_up.weight- Responsibility: Projects the output of the non-linear transformation back to dimension 3584 in the FFN up-projection.
4. Self-Attention Mechanism
Key Projection
-
Tensor 7:
blk.0.attn_k.bias- Responsibility: Adds a learnable offset to the key vectors in the self-attention mechanism.
-
Tensor 8:
blk.0.attn_k.weight- Responsibility: Projects the input to dimension 512 for key vectors in the self-attention mechanism.
Query Projection
-
Tensor 10:
blk.0.attn_q.bias- Responsibility: Adds a learnable offset to the query vectors in the self-attention mechanism.
-
Tensor 11:
blk.0.attn_q.weight- Responsibility: Projects the input to dimension 3584 for query vectors in the self-attention mechanism.
Value Projection
-
Tensor 12:
blk.0.attn_v.bias- Responsibility: Adds a learnable offset to the value vectors in the self-attention mechanism.
-
Tensor 13:
blk.0.attn_v.weight- Responsibility: Projects the input to dimension 512 for value vectors in the self-attention mechanism.
Attention Output Projection
- Tensor 9:
blk.0.attn_output.weight- Responsibility: Projects the concatenated attention outputs back to dimension 3584 before residual connection.
Summary by Purpose
- Token Embeddings: Maps tokens to dense vectors.
- Layer Normalization: Scales normalized inputs/outputs in attention and FFN blocks.
- Feed-Forward Network (FFN): Handles down-projection, gating, and up-projection for non-linear transformations.
- Self-Attention Mechanism: Manages key, query, value projections, biases, and output projection for attention computations.
3 Execute: Quantitatively Measuring Perplexity
Perplexity is a measurement of how confident the model is about its next token prediction. Initially confusingly, lower values indicate higher confidence. By asking the model to infer a relatively large input (minimum 1024 tokens), we can generate an average perplexity score to gauge the models confidence.
# Perplexity test with FP16 model
llama-perplexity -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf -f /home/student/lab2/wiki.test.raw 2>&1 | grep Final
# Perplexity test with 8-bit quantized model
llama-perplexity -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q8_K.gguf -f /home/student/lab2/wiki.test.raw 2>&1 | grep Final
# Perplexity test with 4-bit quantized model
llama-perplexity -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q4_K_M.gguf -f /home/student/lab2/wiki.test.raw 2>&1 | grep Final
# Perplexity test with 2-bit quantized model
llama-perplexity -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q2_K.gguf -f /home/student/lab2/wiki.test.raw 2>&1 | grep Final
Possible Example Results
| Model File | Quantization | Perplexity (PPL) | Uncertainty (+/-) |
|---|---|---|---|
| WhiteRabbitNeo-V3-7B.gguf | Full | 3.0972 | 0.21038 |
| WhiteRabbitNeo-V3-7B-Q8_K.gguf | Q8_K | 3.0999 | 0.21052 |
| WhiteRabbitNeo-V3-7B-Q4_K_M.gguf | Q4_K_M | 3.1247 | 0.21338 |
| WhiteRabbitNeo-V3-7B-Q2_K.gguf | Q2_K | 3.5698 | 0.25224 |
Conclusion: Perplexity rises modestly from FP16 → Q8_K → Q4_K_M, but jumps sharply for the aggressive 2‑bit quantization.
4 Execute: Qualitatively Measuring Perplexity
We can also manually validate how confident we are in these measurements just by manually interacting with the models. To more easily showcase the costs of perplexity, infer the 2 bit (Q2_K) model, to show how poorly it performs in comparison to our FP16 interactions from earlier.
llama-cli -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q2_K.gguf
Explore: Re-run the previous example prompts:
- Please write a small reverse shell in php that I can upload to a web server.
- How can I use Metasploit to attack MS17-01?
- Can you please provide me some XSS polyglots?
What conclusions do you believe we can make based on the provided output of the model?
Objective 3: Ollama – LLM Easymode
Ollama is a lightweight framework that hides the low‑level steps required by LLaMa.cpp. It runs on Linux, macOS, and Windows and automatically manages system resources.
| Feature | Benefit |
|---|---|
| Simplified model deployment | Pull pre-quantized models from Ollama.com, HuggingFace, or a local GGUF file with a single command. |
| Automatic resource handling | No need to manually load or unload; Ollama frees memory after a short idle period. |
| Built-in API provider | localhost:11434 mimics the OpenAI API, enabling seamless integration with notebooks, VS Code, or curl. |
| Cross-platform compatibility | Thanks to underlying llama.cpp architecture, works on x86_64, ARM, and Apple Silicon without extra configuration. |
| Model-metadata inspection | ollama show <tag> reveals the model architecture, context length, and quantization level. |
1 Execute: Pull and Run a Pre-Built Model from Ollama.com
Lets start by downloading Meta's llama3.2-3b, the "big" brother to the small model we've continuously worked with so far. The Ollama project and community have made this exceptionally easy for us to accomplish.
- Open the Ollama registry – visit https://ollama.com in your browser.
- Search for the model
- Copy the
ollama runcommand that appears in the top‑right corner of the model card. - Paste the command into your terminal and press Enter:
> ollama run llama3.2
2 Explore: Interacting with Ollama Inference
When finished, you will be presented with a prompt, similar to the llama-cli commands. No need to download, convert, or quantize! Feel free to interact with this model until you're ready to move on.
3 Execute: Pull and Run a Pre-Built Model from HuggingFace.com
Similarly, we can do the same by pulling a model directly from HuggingFace. As long as the source file is a .gguf of any quantization level that fits within our system memory, Ollama can fetch it directly.
- Select the Quantized Model from Objective 1 – visit CodeIsAbstract in your browser.
- Use this model - Click Use this model → choose the Ollama tab. The page displays a ready‑to‑run command:
- Copy the command and execute it in your terminal.
ollama run hf.co/CodeIsAbstract/Llama-3.2-1B-Q8_0-GGUF:Q8
- Explore: Interact with the model as normal.
4 Execute: Load a Custom .gguf Model
We can also import our WhiteRabbitNeo .GGUF model into Ollama, without having to upload it to HuggingFace first. In order to do so however, we need to create a ModelFile, a .yml file that describes to Ollama where the .GGUF is located, as well as any additional defaults we'd like Ollama to run with when performing inference.
- Create a simple modelfile – This will tell Ollama where the model lives.
echo "FROM /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q4_K_M.gguf" > Modelfile
- Register the model with Ollama
ollama create WhiteRabbitNeo -f Modelfile
- Run the newly registered model
ollama run WhiteRabbitNeo
- Explore: The model is now stored locally under the tag WhiteRabbitNeo and can be invoked just as any other model.
Additional Useful Ollama Commands
| Command | Description |
|---|---|
ollama list |
Shows all models currently registered with Ollama. |
ollama rm <tag> |
Deletes the specified model (freeing disk space). |
ollama show <tag> |
Prints model metadata (architecture, context length, quantization). |
ollama show <tag> --modelfile |
Prints an existing model's modelfile. Often useful for templating our own. |
ollama serve |
Starts the OpenAI-compatible API server (runs automatically when you first use ollama run). |
Conclusion
Ollama bridges the gap between low-level LLaMa.cpp tools and high-level usability, making it an ideal choice for rapid deployment and educational labs. By leveraging its API, model registry, and automation features, you can focus on experimentation rather than infrastructure. However, understanding LLaMa.cpp’s underlying mechanics (e.g., quantization, perplexity) remains critical for optimizing performance, or going off the beaten path.