Files
LLM-Labs/content/labs/lab-2-llama-cpp-and-ollama.md
T
2026-03-26 20:09:49 -06:00

25 KiB
Raw Blame History

Lab 2 - LLaMa.cpp, Ollama & Quantization

In this lab, we will:

  • Download a model from huggingface.com and quantize it for llama.cpp
  • Download a model from huggingface.com infer it in llama.cpp
  • Download a model from ollama.com
  • Download a custom model from huggingface.com
  • Import a custom model into Ollama.
Lab Flow Guide
Explore sections focus on investigation and comparison.
Execute sections require running commands and producing output.

Objective 1: HuggingFace & LLaMa.cpp

1. What Is LLaMa.cpp?

LLaMa.cpp is an open-source project created to enable efficient running of Meta's LLaMA (Large Language Model Meta AI) family of large language models on consumer-grade hardware. It was initially developed by Georgi Gerganov in early March 2023, shortly after Meta released the weights of the LLaMA models to approved researchers.

The projects original goal was to make LLaMA models accessible on systems without powerful GPUs, including laptops, desktops, and even mobile devices. LLaMa.cpp achieves this by implementing the LLaMA inference in pure C/C++ and introducing highly efficient quantization techniques, allowing models to run with drastically reduced memory requirements. LLaMa.cpp is also the underlying project behind a number of inference wrappers and technologies, such as Llamafile, LM Studio, and Ollama, amongst many others.

Key Features

Capability Why it matters
Efficient local inference Runs large language models without a powerful GPU.
Quantization tools (llama-quantize) Shrinks model size (down to 1-bit) while preserving usable performance.
Model conversion to .GGUF Provides a compact, fast-loading format that works with Ollama, LM Studio, and other wrappers.
Cross-platform support Works on Linux, macOS, Windows, Apple Silicon, and ARM devices.
CLI and debugging utilities (llama-cli, gguf-dump.py) Enables quick interactive testing and inspection of model metadata.
Perplexity measurement (llama-perplexity) Quantifies how confident the model is about its predictions.
Active community Powers tools such as LM Studio, Llamafile, and Ollama.

1.2 Explore: HuggingFace - Model Cards

HuggingFace is the “GitHub” for LLMs, datasets, and more. The following steps walk you through locating Metas LLaMA3.21B model card and its files.

  1. Open the LLaMA3.21B page
    https://huggingface.co/meta-llama/Llama-3.2-1B

  2. Read the model card note the description, license, tags (e.g., Text Generation, SafeTensors, PyTorch), and links to finetunes/quantizations.

  3. Navigate to “Quantizations.”
    This tab lists communitycreated quantizations, including GGUF, GTPQ, AWQ, and EXL3 versions. Common providers include Bartowski, Unsloth, and NousResearch, although these players change periodically. Additionally, note that we can often download quantized versions without having agreed to the Meta license restrictions for the original model.

    Model Card Quantizations Convenience Link

Model Quantization Options
  1. Open “Files and versions.”
    Here you see the raw .safetensors files (the unquantized checkpoint). For the model to successfully run, the full set of files needs to be loaded into system memory. Note how this 1Bparameter model is small enough to fit comfortably in a phones memory, even raw.

    Distrubution Restriction

    Unless you've accepted Meta's EULA for this model, you'll be unable to download the model directly from Meta. This view may or may not appear based on your own HuggingFace account.

1.3 Explore: HuggingFace - Find and Download WhiteRabbitNeo

For this lab we will work with WhiteRabbitNeoV37B, a cybersecurityoriented finetune of Qwen2.5Coder7B. This model is less popular than LLaMA-3.2, and if we'd like to run this models in Ollama, we'll need to perform our own quantization.

Warning: Although the next two steps show how to find and download this model so you can replicate the process, support files are already provided in /home/student/lab2/WhiteRabbitNeo to speed up lab execution.

1. Locate & download the model

  1. Go to https://huggingface.co/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.

  2. Points of Interest on this modelcard:

    1. This model appears to be a fine tune of Qwen2.5-Coder-7B
    2. This model is openly licensed, and does have any requirements to download and use for our purposes.
    3. This model is in Safetensors format, which is compatible with LLaMa.cpp's quantization tools.
    WhiteRabbitNeo model card.
  3. Click Files and versions → review the .safetensors checkpoints (≈ 15GB @ *FP16).

    Model safetensors (size ≈ 15GB).

2 Download the Model

To prepare this model, create a target folder wherever you desire on your system to work out of. Once chosen, perform the following:

  1. Ensure you have git & git-lfs installed to enable successful cloning from HuggingFace. If necessary, git can be installed on Debian based distributions via:
sudo apt install git git-lfs
git lfs install
  1. Clone the model:
git clone https://huggingface.co/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B

3 Execute: Convert the Downloaded Model

LLaMa.cpp makes it easy for us to package models downloaded in SafeTensors format to GGUF. We can convert the model with the following official project script command:

convert_hf_to_gguf.py /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B/WhiteRabbitNeo-V3-7B --outfile /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf

4 Execute: Review Model Metadata

When these steps have completed, you should see a new WhiteRabbitNeo-V3-7B.gguf file. We have not yet quantized the model, merely converted it to a format usable by LLaMa.cpp for the next steps. We can tell if this process was successful by using the included gguf-dump.py script that is packaged with LLaMa.cpp. Run the following command:

gguf-dump /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf

We should then see:

Model Metadata.

A text listing of all of the model's tensors, and the precision of each. Because we have merely converted the model's format, and not performed quantization, the model is still in FP16.

  • This is a text view of the previous graphical view we saw in Lab 1, Objective 2: Visualizing a LLM. While TransformerLab calls tensors layers, terms such as tensors, layers, and blocks can all be used semi-interchangeably, depending on the tool in question. We will further confuse these topics when we get to the Ollama objective below.
    • Pedantically, the proper definitions are:
      • Tensor - A multi-dimensional array of vectors to store data
      • Layer - A base computational unit in a neural network
      • Block - A collection of layers
  • If you wish to explore this view, note how the block count of 28 matches the 28 zero indexed blk groups output from the dump.
  • Additionally, you'll once again note that we have various biases and weights, but they still line up with Q, V, and K as discussed in the previous section. There are additional tensors for normalization and output.

4 Execute: LLaMA.cpp Inference

Run our newly created .GGUF file as is. Run the model using the following command:

llama-cli -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf

Once loaded, interact with the model. We can see a number of interesting parameters that were selected by default, such as Top K, Top P, Temperature, and more, which we'll discuss in the next section. In the meantime, explore interaction with the model. When run in this raw state, the model may be overly chatty. You can stop its output with Ctrl+C at any time.

Inference Example.

Some example prompts you may want to try are:

  • Please write a small reverse shell in php that I can upload to a web server.
  • How can I use Metasploit to attack MS17-01?
  • Can you please provide me some XSS polyglots?

Thanks to the fine tuning that Kindo has put into this model, it is far more compliant than an online closed model such as ChatGPT! When done, kill the model fully with Ctrl+C.

Objective 2: Quantization & Perplexity

Quantization reduces memory footprints and speeds inference, but it typically raises perplexity (i.e., lowers confidence). Determining the right balance for our use case often requires experimentation


1 Explore: Manual Quantization

To generate an 8 bit, 4 bit, and 1 bit quantization, run the following commands:

Warning: Although these quantization steps are provided for replication, pre-quantized support files are already available in /home/student/lab2/WhiteRabbitNeo/ for faster lab progress.

You can skip these commands when participating in a live teaching session.
# Quantize to 8 bits
llama-quantize /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q8_K.gguf Q8_0

# Quantize to 4 bits
llama-quantize /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q4_K_M.gguf Q4_K

# Quantize to 2 bits
llama-quantize /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-IQ2_K.gguf IQ_2

2 Execute: Quantization Confirmation

Inspect the quantized files with the following command:

gguf-dump /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q4_K_M.gguf

Review how the various layers are quantized to different levels of precision. It turns out that even K quants actually utilize multiple quantization levels on different tensor layers to improve performance!

WhiteRabbitNeo Layer 0.

Full explanation for the brave...

What each Tensor Layer does

1. Token Embeddings

  • Tensor 1: token_embd.weight
    • Responsibility: Maps each token in the vocabulary to a dense vector of size 3584.

2. Layer Normalization

  • Tensor 2: blk.0.attn_norm.weight

    • Responsibility: Scales the normalized inputs to the self-attention mechanism in the first block.
  • Tensor 6: blk.0.ffn_norm.weight

    • Responsibility: Scales the normalized outputs of the feed-forward network (FFN) in the first block.

3. Feed-Forward Network (FFN)

  • Tensor 3: blk.0.ffn_down.weight

    • Responsibility: Projects the input from dimension 3584 to 18944 in the FFN down-projection.
  • Tensor 4: blk.0.ffn_gate.weight

    • Responsibility: Projects the output back to dimension 3584 after the non-linear transformation in the FFN gate mechanism.
  • Tensor 5: blk.0.ffn_up.weight

    • Responsibility: Projects the output of the non-linear transformation back to dimension 3584 in the FFN up-projection.

4. Self-Attention Mechanism

Key Projection

  • Tensor 7: blk.0.attn_k.bias

    • Responsibility: Adds a learnable offset to the key vectors in the self-attention mechanism.
  • Tensor 8: blk.0.attn_k.weight

    • Responsibility: Projects the input to dimension 512 for key vectors in the self-attention mechanism.

Query Projection

  • Tensor 10: blk.0.attn_q.bias

    • Responsibility: Adds a learnable offset to the query vectors in the self-attention mechanism.
  • Tensor 11: blk.0.attn_q.weight

    • Responsibility: Projects the input to dimension 3584 for query vectors in the self-attention mechanism.

Value Projection

  • Tensor 12: blk.0.attn_v.bias

    • Responsibility: Adds a learnable offset to the value vectors in the self-attention mechanism.
  • Tensor 13: blk.0.attn_v.weight

    • Responsibility: Projects the input to dimension 512 for value vectors in the self-attention mechanism.

Attention Output Projection

  • Tensor 9: blk.0.attn_output.weight
    • Responsibility: Projects the concatenated attention outputs back to dimension 3584 before residual connection.

Summary by Purpose

  • Token Embeddings: Maps tokens to dense vectors.
  • Layer Normalization: Scales normalized inputs/outputs in attention and FFN blocks.
  • Feed-Forward Network (FFN): Handles down-projection, gating, and up-projection for non-linear transformations.
  • Self-Attention Mechanism: Manages key, query, value projections, biases, and output projection for attention computations.

3 Execute: Quantitatively Measuring Perplexity

Perplexity is a measurement of how confident the model is about its next token prediction. Initially confusingly, lower values indicate higher confidence. By asking the model to infer a relatively large input (minimum 1024 tokens), we can generate an average perplexity score to gauge the models confidence.

# Perplexity test with FP16 model
llama-perplexity -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf -f /home/student/lab2/wiki.test.raw 2>&1 | grep Final

# Perplexity test with  8-bit quantized model
llama-perplexity -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q8_K.gguf -f /home/student/lab2/wiki.test.raw 2>&1 | grep Final

# Perplexity test with  4-bit quantized model
llama-perplexity -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q4_K_M.gguf -f /home/student/lab2/wiki.test.raw 2>&1 | grep Final

# Perplexity test with 2-bit quantized model
llama-perplexity -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q2_K.gguf -f /home/student/lab2/wiki.test.raw 2>&1 | grep Final

Possible Example Results

Model File Quantization Perplexity (PPL) Uncertainty (+/-)
WhiteRabbitNeo-V3-7B.gguf Full 3.0972 0.21038
WhiteRabbitNeo-V3-7B-Q8_K.gguf Q8_K 3.0999 0.21052
WhiteRabbitNeo-V3-7B-Q4_K_M.gguf Q4_K_M 3.1247 0.21338
WhiteRabbitNeo-V3-7B-Q2_K.gguf Q2_K 3.5698 0.25224

Conclusion: Perplexity rises modestly from FP16 → Q8_K → Q4_K_M, but jumps sharply for the aggressive 2bit quantization.

4 Execute: Qualitatively Measuring Perplexity

We can also manually validate how confident we are in these measurements just by manually interacting with the models. To more easily showcase the costs of perplexity, infer the 2 bit (Q2_K) model, to show how poorly it performs in comparison to our FP16 interactions from earlier.

llama-cli -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q2_K.gguf 

Explore: Re-run the previous example prompts:

  • Please write a small reverse shell in php that I can upload to a web server.
  • How can I use Metasploit to attack MS17-01?
  • Can you please provide me some XSS polyglots?
Q2_K Inference
FP16 Inference

What conclusions do you believe we can make based on the provided output of the model?


Objective 3: Ollama LLM Easymode

Ollama is a lightweight framework that hides the lowlevel steps required by LLaMa.cpp. It runs on Linux, macOS, and Windows and automatically manages system resources.

Feature Benefit
Simplified model deployment Pull pre-quantized models from Ollama.com, HuggingFace, or a local GGUF file with a single command.
Automatic resource handling No need to manually load or unload; Ollama frees memory after a short idle period.
Built-in API provider localhost:11434 mimics the OpenAI API, enabling seamless integration with notebooks, VS Code, or curl.
Cross-platform compatibility Thanks to underlying llama.cpp architecture, works on x86_64, ARM, and Apple Silicon without extra configuration.
Model-metadata inspection ollama show <tag> reveals the model architecture, context length, and quantization level.

1 Execute: Pull and Run a Pre-Built Model from Ollama.com

Lets start by downloading Meta's llama3.2-3b, the "big" brother to the small model we've continuously worked with so far. The Ollama project and community have made this exceptionally easy for us to accomplish.

  1. Open the Ollama registry visit https://ollama.com in your browser.
  2. Search for the model
Ollama Search.

  1. Copy the ollama run command that appears in the topright corner of the model card.
  2. Paste the command into your terminal and press Enter:
> ollama run llama3.2
Ollama Run command.

2 Explore: Interacting with Ollama Inference

When finished, you will be presented with a prompt, similar to the llama-cli commands. No need to download, convert, or quantize! Feel free to interact with this model until you're ready to move on.

Ollama Inference.

3 Execute: Pull and Run a Pre-Built Model from HuggingFace.com

Similarly, we can do the same by pulling a model directly from HuggingFace. As long as the source file is a .gguf of any quantization level that fits within our system memory, Ollama can fetch it directly.

  1. Select the Quantized Model from Objective 1 visit CodeIsAbstract in your browser.
  2. Use this model - Click Use this model → choose the Ollama tab. The page displays a readytorun command:
HuggingFace Direct Ollama Pull.

  1. Copy the command and execute it in your terminal.
ollama run hf.co/CodeIsAbstract/Llama-3.2-1B-Q8_0-GGUF:Q8
  1. Explore: Interact with the model as normal.

4 Execute: Load a Custom .gguf Model

We can also import our WhiteRabbitNeo .GGUF model into Ollama, without having to upload it to HuggingFace first. In order to do so however, we need to create a ModelFile, a .yml file that describes to Ollama where the .GGUF is located, as well as any additional defaults we'd like Ollama to run with when performing inference.

  1. Create a simple modelfile This will tell Ollama where the model lives.
echo "FROM /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q4_K_M.gguf" > Modelfile
  1. Register the model with Ollama
ollama create WhiteRabbitNeo -f Modelfile
  1. Run the newly registered model
ollama run WhiteRabbitNeo
  1. Explore: The model is now stored locally under the tag WhiteRabbitNeo and can be invoked just as any other model.
Importing WhiteRabbitNeo V3.


Additional Useful Ollama Commands

Command Description
ollama list Shows all models currently registered with Ollama.
ollama rm <tag> Deletes the specified model (freeing disk space).
ollama show <tag> Prints model metadata (architecture, context length, quantization).
ollama show <tag> --modelfile Prints an existing model's modelfile. Often useful for templating our own.
ollama serve Starts the OpenAI-compatible API server (runs automatically when you first use ollama run).

Conclusion

Ollama bridges the gap between low-level LLaMa.cpp tools and high-level usability, making it an ideal choice for rapid deployment and educational labs. By leveraging its API, model registry, and automation features, you can focus on experimentation rather than infrastructure. However, understanding LLaMa.cpps underlying mechanics (e.g., quantization, perplexity) remains critical for optimizing performance, or going off the beaten path.