--- order: 3 title: Lab 3 - LLaMa.cpp and Ollama Workflows description: Convert a Hugging Face checkpoint to GGUF, run it in llama.cpp, and load it into Ollama. --- # Lab 3 - LLaMa.cpp and Ollama Workflows In this lab, we will: - Download a model from Hugging Face - Convert a model to GGUF for `llama.cpp` - Run a model directly in `llama.cpp` - Download a model from Ollama.com - Import a custom `.gguf` model into Ollama

Lab Flow Guide
Explore sections focus on investigation and comparison.
Execute sections require running commands and producing output.

To start this lab, use the embedded terminal below. It connects to the same lab machine in your browser and should prompt you for any local username and password that already work on that host.

If the embedded terminal is unavailable, you can still fall back to: - SSH - :22 - A regular terminal session on the lab host ## Objective 1: HuggingFace & LLaMa.cpp ### 1. What Is LLaMa.cpp? LLaMa.cpp is an open-source project created to enable efficient running of Meta's LLaMA (Large Language Model Meta AI) family of large language models on consumer-grade hardware. It was initially developed by **Georgi Gerganov** in early March 2023, shortly after Meta released the weights of the LLaMA models to approved researchers. The project’s original goal was to make LLaMA models accessible on systems without powerful GPUs, including laptops, desktops, and even mobile devices. **LLaMa.cpp** achieves this by implementing the LLaMA inference in pure C/C++ and introducing highly efficient quantization techniques, allowing models to run with drastically reduced memory requirements. **LLaMa.cpp** is also the underlying project behind a number of inference wrappers and technologies, such as Llamafile, LM Studio, and Ollama, amongst many others. ### Key Features | Capability | Why it matters | | ------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- | | **Efficient local inference** | Runs large language models without a powerful GPU. | | **Quantization tools** (`llama-quantize`) | Shrinks model size (down to 1-bit) while preserving usable performance. | | **Model conversion to .GGUF** | Provides a compact, fast-loading format that works with Ollama, LM Studio, and other wrappers. | | **Cross-platform support** | Works on Linux, macOS, Windows, Apple Silicon, and ARM devices. | | **CLI and debugging utilities** (`llama-cli`, `gguf-dump.py`) | Enables quick interactive testing and inspection of model metadata. | | **Perplexity measurement** (`llama-perplexity`) | Quantifies how confident the model is about its predictions. | | **Active community** | Powers tools such as LM Studio, Llamafile, and Ollama. | --- ## 1.2 Explore: HuggingFace - Model Cards [HuggingFace](https://huggingface.com) is the “GitHub” for LLMs, datasets, and more. The following steps walk you through locating Meta’s **LLaMA‑3.2‑1B** model card and its files. 1. **Open the LLaMA‑3.2‑1B page**
2. **Read the model card** – note the description, license, tags (e.g., _Text Generation_, _SafeTensors_, _PyTorch_), and links to fine‑tunes/quantizations.
3. **Navigate to “Quantizations.”** This tab lists community‑created quantizations, including GGUF, GTPQ, AWQ, and EXL3 versions. Common providers include **Bartowski**, **Unsloth**, and **NousResearch**, although these players change periodically. Additionally, note that we can often download quantized versions _without_ having agreed to the Meta license restrictions for the original model.

Model Card Quantizations Convenience Link

4. **Open “Files and versions.”** Here you see the raw `.safetensors` files (the un‑quantized checkpoint). For the model to successfully run, the full set of files needs to be loaded into system memory. Note how this 1 B‑parameter model is small enough to fit comfortably in a phone’s memory, even raw.

Unless you've accepted Meta's EULA for this model, you'll be unable to download the model directly from Meta. This view may or may not appear based on your own HuggingFace account. ## 1.3 Explore: HuggingFace - Find and Download WhiteRabbitNeo For this lab we will work with **WhiteRabbitNeo‑V3‑7B**, a cybersecurity‑oriented fine‑tune of Qwen2.5‑Coder‑7B. This model is less popular than LLaMA-3.2, and if we'd like to run it in `llama.cpp` or Ollama, we first need to convert it into a usable GGUF artifact.

Warning: The commands below assume you are working from ~/lab3. If you prefer another path, adjust the examples consistently as you go.

### 1. Locate & download the model 1. Go to . 2. Points of Interest on this modelcard: 1. This model appears to be a fine tune of **Qwen2.5-Coder-7B** 2. This model is openly licensed, and does have any requirements to download and use for our purposes. 3. This model is in **Safetensors** format, which is compatible with **LLaMa.cpp**'s quantization tools.

3. Click **Files and versions** → review the `.safetensors` checkpoints (≈ 15 GB @ \*_FP16_).

### 2 Download the Model To prepare this model, create a target folder wherever you desire on your system to work out of. Once chosen, perform the following: 1. Ensure you have git & git-lfs installed to enable successful cloning from HuggingFace. If necessary, git can be installed on Debian based distributions via: ```bash sudo apt install git git-lfs git lfs install ``` 2. Clone the model: ```bash mkdir -p ~/lab3/WhiteRabbitNeo cd ~/lab3/WhiteRabbitNeo git clone https://huggingface.co/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B ``` ### 3 Execute: Convert the Downloaded Model **LLaMa.cpp** makes it easy for us to package models downloaded in SafeTensors format to GGUF. We can convert the model with the following official project script command: ```bash convert_hf_to_gguf.py ~/lab3/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B/WhiteRabbitNeo-V3-7B --outfile ~/lab3/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf ``` ### 4 Execute: Review Model Metadata When these steps have completed, you should see a new WhiteRabbitNeo-V3-7B.gguf file. We have not yet quantized the model, merely converted it to a format usable by **LLaMa.cpp** for the next steps. We can tell if this process was successful by using the included **gguf-dump.py** script that is packaged with **LLaMa.cpp**. Run the following command: ```bash gguf-dump ~/lab3/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf ``` We should then see:

A text listing of all of the model's tensors, and the precision of each. Because we have merely converted the model's format, and not performed quantization, the model is still in **FP16**. - This is a text view of the previous graphical view we saw in **Lab 1, Objective 2: Visualizing a LLM**. While **TransformerLab** calls tensors **layers**, terms such as **tensors**, **layers**, and **blocks** can all be used semi-interchangeably, depending on the tool in question. We will further confuse these topics when we get to the Ollama objective below. - Pedantically, the proper definitions are: - Tensor - A multi-dimensional array of vectors to store data - Layer - A base computational unit in a neural network - Block - A collection of layers - If you wish to explore this view, note how the block count of 28 matches the 28 zero indexed blk groups output from the dump. - Additionally, you'll once again note that we have various biases and weights, but they still line up with **Q**, **V**, and **K** as discussed in the previous section. There are additional tensors for **normalization** and **output**. ### 4 Execute: LLaMA.cpp Inference Run our newly created **.GGUF** file as is. Run the model using the following command: ```bash llama-cli -m ~/lab3/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf ``` Once loaded, interact with the model. We can see a number of interesting parameters that were selected by default, such as **Top K**, **Top P**, **Temperature**, and more, which we'll discuss in the next section. In the meantime, explore interaction with the model. When run in this raw state, the model may be overly chatty. You can stop its output with `Ctrl+C` at any time.

Some example prompts you may want to try are: - Please write a small reverse shell in php that I can upload to a web server. - How can I use Metasploit to attack MS17-01? - Can you please provide me some XSS polyglots? Thanks to the fine tuning that Kindo has put into this model, it is far more compliant than an online closed model such as ChatGPT! When done, kill the model fully with `Ctrl+C`.

Note: Dedicated quantization comparisons now live in Lab 2. This lab stays focused on format conversion, raw llama.cpp inference, and Ollama workflows.

## Objective 2: Ollama – LLM Easymode Ollama is a lightweight framework that hides the low‑level steps required by LLaMa.cpp. It runs on **Linux, macOS, and Windows** and automatically manages system resources. | Feature | Benefit | | -------------------------------- | ----------------------------------------------------------------------------------------------------------------- | | **Simplified model deployment** | Pull pre-quantized models from Ollama.com, HuggingFace, or a local GGUF file with a single command. | | **Automatic resource handling** | No need to manually load or unload; Ollama frees memory after a short idle period. | | **Built-in API provider** | `localhost:11434` mimics the OpenAI API, enabling seamless integration with notebooks, VS Code, or curl. | | **Cross-platform compatibility** | Thanks to underlying llama.cpp architecture, works on x86_64, ARM, and Apple Silicon without extra configuration. | | **Model-metadata inspection** | `ollama show ` reveals the model architecture, context length, and quantization level. | ### 1 Execute: Pull and Run a Pre-Built Model from Ollama.com Lets start by downloading Meta's llama3.2-3b, the "big" brother to the small model we've continuously worked with so far. The Ollama project and community have made this exceptionally easy for us to accomplish. 1. **Open the Ollama registry** – visit in your browser. 2. **Search for the model**

3. **Copy the `ollama run` command** that appears in the top‑right corner of the model card. 4. **Paste the command into your terminal** and press **Enter**: ```bash > ollama run llama3.2 ```

### 2 Explore: Interacting with Ollama Inference When finished, you will be presented with a prompt, similar to the `llama-cli` commands. No need to download, convert, or quantize! Feel free to interact with this model until you're ready to move on.

### 3 Execute: Pull and Run a Pre-Built Model from HuggingFace.com Similarly, we can do the same by pulling a model directly from **HuggingFace**. As long as the source file is a .gguf of any quantization level that fits within our system memory, Ollama can fetch it directly. 1. **Select a pre-quantized GGUF model** – visit [CodeIsAbstract](https://huggingface.co/CodeIsAbstract/Llama-3.2-1B-Q8_0-GGUF) in your browser. 2. **Use this model** - Click Use this model → choose the Ollama tab. The page displays a ready‑to‑run command:

3. **Copy the command** and execute it in your terminal. ```bash ollama run hf.co/CodeIsAbstract/Llama-3.2-1B-Q8_0-GGUF:Q8 ``` 4. **Explore:** Interact with the model as normal. ### 4 Execute: Load a Custom `.gguf` Model We can also import our WhiteRabbitNeo **.GGUF** model into Ollama, without having to upload it to **HuggingFace** first. In order to do so however, we need to create a **ModelFile**, a `.yml` file that describes to **Ollama** where the **.GGUF** is located, as well as any additional defaults we'd like Ollama to run with when performing inference. 1. **Create a simple modelfile** – This will tell Ollama where the model lives. ```bash echo "FROM $HOME/lab3/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf" > Modelfile ``` 2. **Register the model with Ollama** ```bash ollama create WhiteRabbitNeo -f Modelfile ``` 3. **Run the newly registered model** ```bash ollama run WhiteRabbitNeo ``` 4. **Explore:** The model is now stored locally under the tag _WhiteRabbitNeo_ and can be invoked just as any other model.

--- #### Additional Useful Ollama Commands | Command | Description | | ------------------------------- | --------------------------------------------------------------------------------------------- | | `ollama list` | Shows all models currently registered with Ollama. | | `ollama rm ` | Deletes the specified model (freeing disk space). | | `ollama show ` | Prints model metadata (architecture, context length, quantization). | | `ollama show --modelfile` | Prints an existing model's modelfile. Often useful for templating our own. | | `ollama serve` | Starts the OpenAI-compatible API server (runs automatically when you first use `ollama run`). | --- ## Conclusion Ollama bridges the gap between low-level LLaMa.cpp tools and high-level usability, making it an ideal choice for rapid deployment and educational labs. By leveraging its API, model registry, and automation features, you can focus on experimentation rather than infrastructure. Quantization tradeoffs still matter, but they now have a dedicated home in Lab 2 so this lab can stay centered on conversion and deployment workflows.
---