548 lines
25 KiB
Markdown
548 lines
25 KiB
Markdown
<!-- breakout-style: instruction-rails -->
|
||
<!-- step-style: underline -->
|
||
<!-- objective-style: divider -->
|
||
|
||
# Lab 2 - LLaMa.cpp, Ollama & Quantization
|
||
|
||
In this lab, we will:
|
||
|
||
* Download a model from huggingface.com and quantize it for llama.cpp
|
||
* Download a model from huggingface.com infer it in llama.cpp
|
||
* Download a model from ollama.com
|
||
* Download a custom model from huggingface.com
|
||
* Import a custom model into Ollama.
|
||
|
||
<div class="lab-callout lab-callout--info">
|
||
<strong>Lab Flow Guide</strong><br />
|
||
<strong>Explore</strong> sections focus on investigation and comparison.<br />
|
||
<strong>Execute</strong> sections require running commands and producing output.
|
||
</div>
|
||
|
||
To start this lab, you'll need CLI access:
|
||
|
||
* SSH - <IP>:22
|
||
* All necessary artifacts are in the lab2 folder
|
||
|
||
## Objective 1: HuggingFace & LLaMa.cpp
|
||
### 1. What Is LLaMa.cpp?
|
||
|
||
LLaMa.cpp is an open-source project created to enable efficient running of Meta's LLaMA (Large Language Model Meta AI) family of large language models on consumer-grade hardware. It was initially developed by **Georgi Gerganov** in early March 2023, shortly after Meta released the weights of the LLaMA models to approved researchers.
|
||
|
||
The project’s original goal was to make LLaMA models accessible on systems without powerful GPUs, including laptops, desktops, and even mobile devices. **LLaMa.cpp** achieves this by implementing the LLaMA inference in pure C/C++ and introducing highly efficient quantization techniques, allowing models to run with drastically reduced memory requirements. **LLaMa.cpp** is also the underlying project behind a number of inference wrappers and technologies, such as Llamafile, LM Studio, and Ollama, amongst many others.
|
||
|
||
### Key Features
|
||
|
||
| Capability | Why it matters |
|
||
|------------|----------------|
|
||
| **Efficient local inference** | Runs large language models without a powerful GPU. |
|
||
| **Quantization tools** (`llama-quantize`) | Shrinks model size (down to 1-bit) while preserving usable performance. |
|
||
| **Model conversion to .GGUF** | Provides a compact, fast-loading format that works with Ollama, LM Studio, and other wrappers. |
|
||
| **Cross-platform support** | Works on Linux, macOS, Windows, Apple Silicon, and ARM devices. |
|
||
| **CLI and debugging utilities** (`llama-cli`, `gguf-dump.py`) | Enables quick interactive testing and inspection of model metadata. |
|
||
| **Perplexity measurement** (`llama-perplexity`) | Quantifies how confident the model is about its predictions. |
|
||
| **Active community** | Powers tools such as LM Studio, Llamafile, and Ollama. |
|
||
|
||
---
|
||
|
||
## 1.2 Explore: HuggingFace - Model Cards
|
||
|
||
[HuggingFace](https://huggingface.com) is the “GitHub” for LLMs, datasets, and more. The following steps walk you through locating Meta’s **LLaMA‑3.2‑1B** model card and its files.
|
||
|
||
1. **Open the LLaMA‑3.2‑1B page**
|
||
<https://huggingface.co/meta-llama/Llama-3.2-1B>
|
||
<br>
|
||
2. **Read the model card** – note the description, license, tags (e.g., *Text Generation*, *SafeTensors*, *PyTorch*), and links to fine‑tunes/quantizations.
|
||
<br>
|
||
3. **Navigate to “Quantizations.”**
|
||
This tab lists community‑created quantizations, including GGUF, GTPQ, AWQ, and EXL3 versions. Common providers include **Bartowski**, **Unsloth**, and **NousResearch**, although these players change periodically. Additionally, note that we can often download quantized versions *without* having agreed to the Meta license restrictions for the original model.
|
||
|
||
<figure style="text-align:center;">
|
||
<a href="https://i.imgur.com/Po0Ll3o.png" target="_blank">
|
||
<img src="https://i.imgur.com/Po0Ll3o.png" width="800" style="border:5px solid black;">
|
||
</a>
|
||
<figcaption>Model Card Quantizations Convenience Link</figcaption>
|
||
</figure>
|
||
<br>
|
||
|
||
<figure style="text-align:center;">
|
||
<a href="https://i.imgur.com/NM1rbXV.png" target="_blank">
|
||
<img src="https://i.imgur.com/NM1rbXV.png" width="800" style="border:5px solid black;">
|
||
</a>
|
||
<figcaption>Model Quantization Options</figcaption>
|
||
</figure>
|
||
|
||
|
||
4. **Open “Files and versions.”**
|
||
Here you see the raw `.safetensors` files (the un‑quantized checkpoint). For the model to successfully run, the full set of files needs to be loaded into system memory. Note how this 1 B‑parameter model is small enough to fit comfortably in a phone’s memory, even raw.
|
||
|
||
<figure style="text-align:center;">
|
||
<a href="https://i.imgur.com/6I9zkeu.png" target="_blank">
|
||
<img src="https://i.imgur.com/6I9zkeu.png" width="800" style="border:5px solid black;">
|
||
</a>
|
||
<figcaption>Distrubution Restriction</figcaption>
|
||
</figure>
|
||
|
||
Unless you've accepted Meta's EULA for this model, you'll be unable to download the model directly from Meta. This view may or may not appear based on your own HuggingFace account.
|
||
|
||
|
||
## 1.3 Explore: HuggingFace - Find and Download WhiteRabbitNeo
|
||
|
||
For this lab we will work with **WhiteRabbitNeo‑V3‑7B**, a cybersecurity‑oriented fine‑tune of Qwen2.5‑Coder‑7B. This model is less popular than LLaMA-3.2, and if we'd like to run this models in Ollama, we'll need to perform our own quantization.
|
||
|
||
<div class="lab-callout lab-callout--warning">
|
||
<strong>Warning:</strong> Although the next two steps show how to find and download this model so you can replicate the process, support files are already provided in <code>/home/student/lab2/WhiteRabbitNeo</code> to speed up lab execution.
|
||
</div>
|
||
|
||
|
||
### 1. Locate & download the model
|
||
|
||
1. Go to <https://huggingface.co/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B>.
|
||
2. Points of Interest on this modelcard:
|
||
1. This model appears to be a fine tune of **Qwen2.5-Coder-7B**
|
||
2. This model is openly licensed, and does have any requirements to download and use for our purposes.
|
||
3. This model is in **Safetensors** format, which is compatible with **LLaMa.cpp**'s quantization tools.
|
||
|
||
<figure style="text-align:center;">
|
||
<a href="https://i.imgur.com/9GrHRuh.png" target="_blank">
|
||
<img src="https://i.imgur.com/9GrHRuh.png" width="800" style="border:5px solid black;">
|
||
</a>
|
||
<figcaption>WhiteRabbitNeo model card.</figcaption>
|
||
</figure>
|
||
|
||
3. Click **Files and versions** → review the `.safetensors` checkpoints (≈ 15 GB @ **FP16*).
|
||
|
||
<figure style="text-align:center;">
|
||
<a href="https://i.imgur.com/Emx97nL.png" target="_blank">
|
||
<img src="https://i.imgur.com/Emx97nL.png" width="800" style="border:5px solid black;">
|
||
</a>
|
||
<figcaption>Model safetensors (size ≈ 15 GB).</figcaption>
|
||
</figure>
|
||
|
||
### 2 Download the Model
|
||
|
||
To prepare this model, create a target folder wherever you desire on your system to work out of. Once chosen, perform the following:
|
||
|
||
1. Ensure you have git & git-lfs installed to enable successful cloning from HuggingFace. If necessary, git can be installed on Debian based distributions via:
|
||
|
||
```bash
|
||
sudo apt install git git-lfs
|
||
git lfs install
|
||
```
|
||
|
||
2. Clone the model:
|
||
|
||
```bash
|
||
git clone https://huggingface.co/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B
|
||
```
|
||
|
||
### 3 Execute: Convert the Downloaded Model
|
||
|
||
**LLaMa.cpp** makes it easy for us to package models downloaded in SafeTensors format to GGUF. We can convert the model with the following official project script command:
|
||
|
||
```bash
|
||
convert_hf_to_gguf.py /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B/WhiteRabbitNeo-V3-7B --outfile /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf
|
||
```
|
||
|
||
### 4 Execute: Review Model Metadata
|
||
|
||
When these steps have completed, you should see a new WhiteRabbitNeo-V3-7B.gguf file. We have not yet quantized the model, merely converted it to a format usable by **LLaMa.cpp** for the next steps. We can tell if this process was successful by using the included **gguf-dump.py** script that is packaged with **LLaMa.cpp**.
|
||
Run the following command:
|
||
```bash
|
||
gguf-dump /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf
|
||
```
|
||
|
||
We should then see:
|
||
|
||
<figure style="text-align: center;">
|
||
<a href="https://i.imgur.com/JiX2fJM.png" target="_blank">
|
||
<img
|
||
src="https://i.imgur.com/JiX2fJM.png"
|
||
width="800"
|
||
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
|
||
</a>
|
||
<figcaption style="margin-top: 8px; font-size: 1.1em;">
|
||
Model Metadata.
|
||
</figcaption>
|
||
</figure>
|
||
<br>
|
||
|
||
A text listing of all of the model's tensors, and the precision of each. Because we have merely converted the model's format, and not performed quantization, the model is still in **FP16**.
|
||
|
||
* This is a text view of the previous graphical view we saw in **Lab 1, Objective 2: Visualizing a LLM**. While **TransformerLab** calls tensors **layers**, terms such as **tensors**, **layers**, and **blocks** can all be used semi-interchangeably, depending on the tool in question. We will further confuse these topics when we get to the Ollama objective below.
|
||
* Pedantically, the proper definitions are:
|
||
* Tensor - A multi-dimensional array of vectors to store data
|
||
* Layer - A base computational unit in a neural network
|
||
* Block - A collection of layers
|
||
* If you wish to explore this view, note how the block count of 28 matches the 28 zero indexed blk groups output from the dump.
|
||
* Additionally, you'll once again note that we have various biases and weights, but they still line up with **Q**, **V**, and **K** as discussed in the previous section. There are additional tensors for **normalization** and **output**.
|
||
|
||
### 4 Execute: LLaMA.cpp Inference
|
||
|
||
Run our newly created **.GGUF** file as is. Run the model using the following command:
|
||
|
||
```bash
|
||
llama-cli -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf
|
||
```
|
||
|
||
Once loaded, interact with the model. We can see a number of interesting parameters that were selected by default, such as **Top K**, **Top P**, **Temperature**, and more, which we'll discuss in the next section. In the meantime, explore interaction with the model. When run in this raw state, the model may be overly chatty. You can stop its output with `Ctrl+C` at any time.
|
||
|
||
<figure style="text-align: center;">
|
||
<a href="https://i.imgur.com/H3ISWS8.png" target="_blank">
|
||
<img
|
||
src="https://i.imgur.com/H3ISWS8.png"
|
||
width="800"
|
||
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
|
||
</a>
|
||
<figcaption style="margin-top: 8px; font-size: 1.1em;">
|
||
Inference Example.
|
||
</figcaption>
|
||
</figure>
|
||
|
||
Some example prompts you may want to try are:
|
||
|
||
* Please write a small reverse shell in php that I can upload to a web server.
|
||
* How can I use Metasploit to attack MS17-01?
|
||
* Can you please provide me some XSS polyglots?
|
||
|
||
Thanks to the fine tuning that Kindo has put into this model, it is far more compliant than an online closed model such as ChatGPT! When done, kill the model fully with `Ctrl+C`.
|
||
|
||
## Objective 2: Quantization & Perplexity
|
||
|
||
Quantization reduces memory footprints and speeds inference, but it typically raises perplexity (i.e., lowers confidence). Determining the right balance for our use case often requires experimentation
|
||
|
||
---
|
||
|
||
### 1 Explore: Manual Quantization
|
||
|
||
To generate an 8 bit, 4 bit, and 1 bit quantization, run the following commands:
|
||
|
||
<div class="lab-callout lab-callout--warning">
|
||
<strong>Warning:</strong> Although these quantization steps are provided for replication, pre-quantized support files are already available in <code>/home/student/lab2/WhiteRabbitNeo/</code> for faster lab progress. <br><br>You can skip these commands when participating in a live teaching session.
|
||
</div>
|
||
|
||
```bash
|
||
# Quantize to 8 bits
|
||
llama-quantize /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q8_K.gguf Q8_0
|
||
|
||
# Quantize to 4 bits
|
||
llama-quantize /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q4_K_M.gguf Q4_K
|
||
|
||
# Quantize to 2 bits
|
||
llama-quantize /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-IQ2_K.gguf IQ_2
|
||
```
|
||
|
||
### 2 Execute: Quantization Confirmation
|
||
|
||
Inspect the quantized files with the following command:
|
||
|
||
```bash
|
||
gguf-dump /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q4_K_M.gguf
|
||
```
|
||
|
||
Review how the various layers are quantized to different levels of precision. It turns out that even K quants actually utilize multiple quantization levels on different tensor layers to improve performance!
|
||
|
||
<figure style="text-align: center;">
|
||
<a href="https://i.imgur.com/kur4TPj.png" target="_blank">
|
||
<img
|
||
src="https://i.imgur.com/kur4TPj.png"
|
||
style="width: 800; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
|
||
</a>
|
||
<figcaption style="margin-top: 8px; font-size: 1.1em; color: var(--text-color);">
|
||
WhiteRabbitNeo Layer 0.
|
||
</figcaption>
|
||
</figure>
|
||
<br>
|
||
|
||
<details>
|
||
<summary style="font-weight:bold; color:#a94442; cursor:pointer;">
|
||
Full explanation for the brave...
|
||
</summary>
|
||
|
||
### What each Tensor Layer does
|
||
### **1. Token Embeddings**
|
||
- **Tensor 1: `token_embd.weight`**
|
||
- **Responsibility:** Maps each token in the vocabulary to a dense vector of size 3584.
|
||
|
||
---
|
||
|
||
### **2. Layer Normalization**
|
||
- **Tensor 2: `blk.0.attn_norm.weight`**
|
||
- **Responsibility:** Scales the normalized inputs to the self-attention mechanism in the first block.
|
||
|
||
- **Tensor 6: `blk.0.ffn_norm.weight`**
|
||
- **Responsibility:** Scales the normalized outputs of the feed-forward network (FFN) in the first block.
|
||
|
||
---
|
||
|
||
### **3. Feed-Forward Network (FFN)**
|
||
- **Tensor 3: `blk.0.ffn_down.weight`**
|
||
- **Responsibility:** Projects the input from dimension 3584 to 18944 in the FFN down-projection.
|
||
|
||
- **Tensor 4: `blk.0.ffn_gate.weight`**
|
||
- **Responsibility:** Projects the output back to dimension 3584 after the non-linear transformation in the FFN gate mechanism.
|
||
|
||
- **Tensor 5: `blk.0.ffn_up.weight`**
|
||
- **Responsibility:** Projects the output of the non-linear transformation back to dimension 3584 in the FFN up-projection.
|
||
|
||
---
|
||
|
||
### **4. Self-Attention Mechanism**
|
||
#### **Key Projection**
|
||
- **Tensor 7: `blk.0.attn_k.bias`**
|
||
- **Responsibility:** Adds a learnable offset to the key vectors in the self-attention mechanism.
|
||
|
||
- **Tensor 8: `blk.0.attn_k.weight`**
|
||
- **Responsibility:** Projects the input to dimension 512 for key vectors in the self-attention mechanism.
|
||
|
||
#### **Query Projection**
|
||
- **Tensor 10: `blk.0.attn_q.bias`**
|
||
- **Responsibility:** Adds a learnable offset to the query vectors in the self-attention mechanism.
|
||
|
||
- **Tensor 11: `blk.0.attn_q.weight`**
|
||
- **Responsibility:** Projects the input to dimension 3584 for query vectors in the self-attention mechanism.
|
||
|
||
#### **Value Projection**
|
||
- **Tensor 12: `blk.0.attn_v.bias`**
|
||
- **Responsibility:** Adds a learnable offset to the value vectors in the self-attention mechanism.
|
||
|
||
- **Tensor 13: `blk.0.attn_v.weight`**
|
||
- **Responsibility:** Projects the input to dimension 512 for value vectors in the self-attention mechanism.
|
||
|
||
#### **Attention Output Projection**
|
||
- **Tensor 9: `blk.0.attn_output.weight`**
|
||
- **Responsibility:** Projects the concatenated attention outputs back to dimension 3584 before residual connection.
|
||
|
||
---
|
||
|
||
### **Summary by Purpose**
|
||
- **Token Embeddings:** Maps tokens to dense vectors.
|
||
- **Layer Normalization:** Scales normalized inputs/outputs in attention and FFN blocks.
|
||
- **Feed-Forward Network (FFN):** Handles down-projection, gating, and up-projection for non-linear transformations.
|
||
- **Self-Attention Mechanism:** Manages key, query, value projections, biases, and output projection for attention computations.
|
||
|
||
</details>
|
||
|
||
### 3 Execute: Quantitatively Measuring Perplexity
|
||
|
||
Perplexity is a measurement of how confident the model is about its next token prediction. Initially confusingly, lower values indicate higher confidence. By asking the model to infer a relatively large input (minimum 1024 tokens), we can generate an average perplexity score to gauge the models confidence.
|
||
|
||
```bash
|
||
# Perplexity test with FP16 model
|
||
llama-perplexity -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf -f /home/student/lab2/wiki.test.raw 2>&1 | grep Final
|
||
|
||
# Perplexity test with 8-bit quantized model
|
||
llama-perplexity -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q8_K.gguf -f /home/student/lab2/wiki.test.raw 2>&1 | grep Final
|
||
|
||
# Perplexity test with 4-bit quantized model
|
||
llama-perplexity -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q4_K_M.gguf -f /home/student/lab2/wiki.test.raw 2>&1 | grep Final
|
||
|
||
# Perplexity test with 2-bit quantized model
|
||
llama-perplexity -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q2_K.gguf -f /home/student/lab2/wiki.test.raw 2>&1 | grep Final
|
||
```
|
||
|
||
#### Possible Example Results
|
||
|
||
| Model File | Quantization | Perplexity (PPL) | Uncertainty (+/-) |
|
||
|------------|--------------|------------------|-------------------|
|
||
| WhiteRabbitNeo-V3-7B.gguf | Full | 3.0972 | 0.21038 |
|
||
| WhiteRabbitNeo-V3-7B-Q8_K.gguf | Q8_K | 3.0999 | 0.21052 |
|
||
| WhiteRabbitNeo-V3-7B-Q4_K_M.gguf | Q4_K_M | 3.1247 | 0.21338 |
|
||
| WhiteRabbitNeo-V3-7B-Q2_K.gguf | Q2_K | 3.5698 | 0.25224 |
|
||
|
||
**Conclusion: Perplexity rises modestly from FP16 → Q8_K → Q4_K_M, but jumps sharply for the aggressive 2‑bit quantization.**
|
||
|
||
### 4 Execute: Qualitatively Measuring Perplexity
|
||
|
||
We can also manually validate how confident we are in these measurements just by manually interacting with the models. To more easily showcase the costs of perplexity, infer the 2 bit (**Q2_K**) model, to show how poorly it performs in comparison to our **FP16** interactions from earlier.
|
||
|
||
```bash
|
||
llama-cli -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q2_K.gguf
|
||
```
|
||
|
||
**Explore:** Re-run the previous example prompts:
|
||
* Please write a small reverse shell in php that I can upload to a web server.
|
||
* How can I use Metasploit to attack MS17-01?
|
||
* Can you please provide me some XSS polyglots?
|
||
|
||
<div style="display: flex; justify-content: center; align-items: flex-start; gap: 32px;">
|
||
<div style="text-align: center;">
|
||
<a href="https://i.imgur.com/nvb7QV6.png" target="_blank">
|
||
<img
|
||
src="https://i.imgur.com/nvb7QV6.png"
|
||
style="width: 90%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
|
||
</a>
|
||
<div style="margin-top: 8px; font-size: 1.1em;">
|
||
Q2_K Inference
|
||
</div>
|
||
</div>
|
||
<div style="text-align: center;">
|
||
<a href="https://i.imgur.com/yNHQbxb.png" target="_blank">
|
||
<img
|
||
src="https://i.imgur.com/yNHQbxb.png"
|
||
style="width: 90%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
|
||
</a>
|
||
<div style="margin-top: 8px; font-size: 1.1em;">
|
||
FP16 Inference
|
||
</div>
|
||
</div>
|
||
</div>
|
||
|
||
What conclusions do you believe we can make based on the provided output of the model?
|
||
|
||
---
|
||
|
||
|
||
## Objective 3: Ollama – LLM Easymode
|
||
Ollama is a lightweight framework that hides the low‑level steps required by LLaMa.cpp. It runs on **Linux, macOS, and Windows** and automatically manages system resources.
|
||
|
||
|
||
| Feature | Benefit |
|
||
|---------|---------|
|
||
| **Simplified model deployment** | Pull pre-quantized models from Ollama.com, HuggingFace, or a local GGUF file with a single command. |
|
||
| **Automatic resource handling** | No need to manually load or unload; Ollama frees memory after a short idle period. |
|
||
| **Built-in API provider** | `localhost:11434` mimics the OpenAI API, enabling seamless integration with notebooks, VS Code, or curl. |
|
||
| **Cross-platform compatibility** | Thanks to underlying llama.cpp architecture, works on x86_64, ARM, and Apple Silicon without extra configuration. |
|
||
| **Model-metadata inspection** | `ollama show <tag>` reveals the model architecture, context length, and quantization level. |
|
||
|
||
|
||
### 1 Execute: Pull and Run a Pre-Built Model from Ollama.com
|
||
|
||
|
||
Lets start by downloading Meta's llama3.2-3b, the "big" brother to the small model we've continuously worked with so far. The Ollama project and community have made this exceptionally easy for us to accomplish.
|
||
|
||
1. **Open the Ollama registry** – visit <https://ollama.com> in your browser.
|
||
2. **Search for the model**
|
||
|
||
<figure style="text-align: center;">
|
||
<a href="https://i.imgur.com/VBvOGty.png" target="_blank">
|
||
<img
|
||
src="https://i.imgur.com/VBvOGty.png"
|
||
style="width: 800; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
|
||
</a>
|
||
<figcaption style="margin-top: 8px; font-size: 1.1em;">
|
||
Ollama Search.
|
||
</figcaption>
|
||
</figure>
|
||
<br>
|
||
|
||
3. **Copy the `ollama run` command** that appears in the top‑right corner of the model card.
|
||
4. **Paste the command into your terminal** and press **Enter**:
|
||
|
||
```bash
|
||
> ollama run llama3.2
|
||
```
|
||
|
||
<figure style="text-align: center;">
|
||
<a href="https://i.imgur.com/ammtbmI.png" target="_blank">
|
||
<img
|
||
src="https://i.imgur.com/ammtbmI.png"
|
||
style="width: 800; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
|
||
</a>
|
||
<figcaption style="margin-top: 8px; font-size: 1.1em;">
|
||
Ollama Run command.
|
||
</figcaption>
|
||
</figure>
|
||
<br>
|
||
|
||
### 2 Explore: Interacting with Ollama Inference
|
||
|
||
When finished, you will be presented with a prompt, similar to the `llama-cli` commands. No need to download, convert, or quantize! Feel free to interact with this model until you're ready to move on.
|
||
|
||
<figure style="text-align: center;">
|
||
<a href="https://i.imgur.com/XZ6OYNI.png" target="_blank">
|
||
<img
|
||
src="https://i.imgur.com/XZ6OYNI.png"
|
||
style="width: 800; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
|
||
</a>
|
||
<figcaption style="margin-top: 8px; font-size: 1.1em;">
|
||
Ollama Inference.
|
||
</figcaption>
|
||
</figure>
|
||
<br>
|
||
|
||
### 3 Execute: Pull and Run a Pre-Built Model from HuggingFace.com
|
||
|
||
Similarly, we can do the same by pulling a model directly from **HuggingFace**. As long as the source file is a .gguf of any quantization level that fits within our system memory, Ollama can fetch it directly.
|
||
|
||
1. **Select the Quantized Model from Objective 1** – visit [CodeIsAbstract](https://huggingface.co/CodeIsAbstract/Llama-3.2-1B-Q8_0-GGUF) in your browser.
|
||
2. **Use this model** - Click Use this model → choose the Ollama tab. The page displays a ready‑to‑run command:
|
||
|
||
<figure style="text-align: center;">
|
||
<a href="https://i.imgur.com/lg2INAs.png" target="_blank">
|
||
<img
|
||
src="https://i.imgur.com/lg2INAs.png"
|
||
style="width: 800; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
|
||
</a>
|
||
<figcaption style="margin-top: 8px; font-size: 1.1em;">
|
||
HuggingFace Direct Ollama Pull.
|
||
</figcaption>
|
||
</figure>
|
||
<br>
|
||
|
||
|
||
3. **Copy the command** and execute it in your terminal.
|
||
|
||
```bash
|
||
ollama run hf.co/CodeIsAbstract/Llama-3.2-1B-Q8_0-GGUF:Q8
|
||
```
|
||
|
||
4. **Explore:** Interact with the model as normal.
|
||
|
||
|
||
### 4 Execute: Load a Custom `.gguf` Model
|
||
|
||
We can also import our WhiteRabbitNeo **.GGUF** model into Ollama, without having to upload it to **HuggingFace** first. In order to do so however, we need to create a **ModelFile**, a `.yml` file that describes to **Ollama** where the **.GGUF** is located, as well as any additional defaults we'd like Ollama to run with when performing inference.
|
||
|
||
1. **Create a simple modelfile** – This will tell Ollama where the model lives.
|
||
|
||
```bash
|
||
echo "FROM /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q4_K_M.gguf" > Modelfile
|
||
```
|
||
|
||
2. **Register the model with Ollama**
|
||
|
||
```bash
|
||
ollama create WhiteRabbitNeo -f Modelfile
|
||
```
|
||
|
||
3. **Run the newly registered model**
|
||
|
||
```bash
|
||
ollama run WhiteRabbitNeo
|
||
```
|
||
|
||
4. **Explore:** The model is now stored locally under the tag *WhiteRabbitNeo* and can be invoked just as any other model.
|
||
|
||
<figure style="text-align: center;">
|
||
<a href="https://i.imgur.com/ijsAl6m.png" target="_blank">
|
||
<img
|
||
src="https://i.imgur.com/ijsAl6m.png"
|
||
style="width: 800; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
|
||
</a>
|
||
<figcaption style="margin-top: 8px; font-size: 1.1em;">
|
||
Importing WhiteRabbitNeo V3.
|
||
</figcaption>
|
||
</figure>
|
||
<br>
|
||
|
||
---
|
||
|
||
#### Additional Useful Ollama Commands
|
||
| Command | Description |
|
||
|---------|-------------|
|
||
| `ollama list` | Shows all models currently registered with Ollama. |
|
||
| `ollama rm <tag>` | Deletes the specified model (freeing disk space). |
|
||
| `ollama show <tag>` | Prints model metadata (architecture, context length, quantization). |
|
||
| `ollama show <tag> --modelfile` | Prints an existing model's modelfile. Often useful for templating our own. |
|
||
| `ollama serve` | Starts the OpenAI-compatible API server (runs automatically when you first use `ollama run`). |
|
||
|
||
---
|
||
|
||
## Conclusion
|
||
|
||
Ollama bridges the gap between low-level LLaMa.cpp tools and high-level usability, making it an ideal choice for rapid deployment and educational labs. By leveraging its API, model registry, and automation features, you can focus on experimentation rather than infrastructure. However, understanding LLaMa.cpp’s underlying mechanics (e.g., quantization, perplexity) remains critical for optimizing performance, or going off the beaten path.
|
||
|
||
<br>
|
||
|
||
---
|