New Lab 2

This commit is contained in:
2026-04-07 16:02:48 -06:00
parent 6bcebd55ee
commit 9f3af49845
65 changed files with 6650 additions and 1553 deletions
@@ -1,11 +1,19 @@
---
order: 1
title: Lab 1 - Visualizing LLMs in TransformerLab
description: Explore model structure, tokenization, and next-token prediction inside TransformerLab.
---
<!-- breakout-style: instruction-rails -->
<!-- step-style: underline -->
<!-- objective-style: divider -->
# Lab 1 - Visualizing LLMs in TransformerLab
In this lab, we will:
* Download and Visualize LLama-3.2-1B-Instruct
* Visualize Tokenization & Prediction with LLama-3.2-1B-Instruct
- Download and Visualize LLama-3.2-1B-Instruct
- Visualize Tokenization & Prediction with LLama-3.2-1B-Instruct
<div class="lab-callout lab-callout--info">
<strong>Lab Flow Guide</strong><br />
@@ -13,14 +21,13 @@ In this lab, we will:
<strong>Execute</strong> steps require performing actions in the lab environment.
</div>
## Objective 1: Starting TransformerLab
### Execute: Access the Lab Environment
To start Lab 1, ensure you've received a WireGuard configuration and system IP from your instructor. If you're unfamiliar with WireGuard, assistance will be provided to ensure you can access the lab environment for the duration of class.
All systems use the default username and password of `student`. All labs are located in the student home folder. To start Lab 1, run
All systems use the default username and password of `student`. All labs are located in the student home folder. To start Lab 1, run
```bash
~/lab1/lab1_start.sh
@@ -28,7 +35,7 @@ All systems use the default username and password of `student`. All labs are loc
using the `lab1_start.sh` script in the `lab1` folder.
Lastly, if necessary, you can `su -` to root at any time. No password will be required.
Lastly, if necessary, you can `su -` to root at any time. No password will be required.
Once started, you can reach TransformerLab on port 8338 of your Lab VM (http://<IP>:8338).
@@ -54,11 +61,11 @@ Navigate to **Plugins**, and in the search bar type `Fastchat`. Note that it has
Plugins
</figcaption>
</figure>
<br>
<br>
### Execute: Find and Load `LLama-3.2-1B-Instruct`
Next, navigate to **Model Registry**. You should see `LLama-3.2-1B-Instruct` right away on your screen, but if not, please start searching for this model using the search bar.
Next, navigate to **Model Registry**. You should see `LLama-3.2-1B-Instruct` right away on your screen, but if not, please start searching for this model using the search bar.
<figure style="text-align: center;">
<a href="https://i.imgur.com/UyWdnMR.png" target="_blank">
@@ -86,7 +93,7 @@ Once downloaded, Select **Foundation** & our newly downloaded `LLama-3.2-1B-Inst
</figure>
<br>
Once selected, click **Run**. Give TransformerLab a moment to successfully load the model.
Once selected, click **Run**. Give TransformerLab a moment to successfully load the model.
<figure style="text-align: center;">
<a href="https://i.imgur.com/f4YcA8P.png" target="_blank">
@@ -101,8 +108,8 @@ Once selected, click **Run**. Give TransformerLab a moment to successfully load
<br>
### Explore: Inspect the Architecture View
To start, lets navigate to the **Interact** page, and then select **Model Architecture** from the Chat drop down.
To start, lets navigate to the **Interact** page, and then select **Model Architecture** from the Chat drop down.
<figure style="text-align: center;">
<a href="https://i.imgur.com/X0CM31h.png" target="_blank">
@@ -117,8 +124,9 @@ To start, lets navigate to the **Interact** page, and then select **Model Archit
<br>
This page allows us to visualize the actively loaded model, in this case our downloaded `LLama-3.2-1B-Instruct-`. This interactive view is equivalent to the greatly simplified version shown on the slide “Transformation: Multylayer Perceptron” from our lecture. We can explore this view by:
* Holding down both right and left mouse buttons and dragging will move the entire model.
* Holding down just the left mouse button will allow you to rotate the view.
- Holding down both right and left mouse buttons and dragging will move the entire model.
- Holding down just the left mouse button will allow you to rotate the view.
<figure style="text-align: center;">
<a href="https://i.imgur.com/8hXTGlt.png" target="_blank">
@@ -134,10 +142,11 @@ This page allows us to visualize the actively loaded model, in this case our dow
### Explore: Interpret Layers, Blocks, and Parameters
Each layer of the model performs a specific task, taking the input provided, and transforming it into the statistically most likely completion of text, token by token. This format of Llama 3.1 1B is made up of 372 **layers**. Each layer will transform the input of the layer above it, until eventually, we end up with the statically likely completion.
You have likely also noticed that the colors repeat. Each set of repeating **layers** is organized into **blocks**. Each **block** is a grouping of **layers** that perform the same functions, but with a slightly different focus. For example, one **block** may focus on nouns, and another may focus on adjectives, and so on.
Each layer of the model performs a specific task, taking the input provided, and transforming it into the statistically most likely completion of text, token by token. This format of Llama 3.1 1B is made up of 372 **layers**. Each layer will transform the input of the layer above it, until eventually, we end up with the statically likely completion.
You have likely also noticed that the colors repeat. Each set of repeating **layers** is organized into **blocks**. Each **block** is a grouping of **layers** that perform the same functions, but with a slightly different focus. For example, one **block** may focus on nouns, and another may focus on adjectives, and so on.
The **layers** within Llama 3.1 1B are as follows:
<ul class="concept-pill-list">
<li>
<span class="concept-pill-label">Attention:</span>
@@ -157,11 +166,10 @@ The **layers** within Llama 3.1 1B are as follows:
</li>
</ul>
Each of these **layers** also has a different type, corresponding to Q, K, V, and much more.
5. The **layers** between the small “Attention” **layers** are all considered to make up a single “block.”
To the side, we can see the actual number values of each weight within each layer.
Each of these **layers** also has a different type, corresponding to Q, K, V, and much more. 5. The **layers** between the small “Attention” **layers** are all considered to make up a single “block.”
To the side, we can see the actual number values of each weight within each layer.
Fundamentally, the LLM itself is this stack of numbers. Those numbers allow us to transform tokenized input (such as English), and transform that into a useful output. The more **layers** & **blocks**, the bigger the model, the more accurate and “intelligent” the model will behave. This 1B parameter model is incredibly small however, so the “truthfulness” of generated predictions is likely to be suspect (aka Hallucinated). The model will at least sound very confident however!
Fundamentally, the LLM itself is this stack of numbers. Those numbers allow us to transform tokenized input (such as English), and transform that into a useful output. The more **layers** & **blocks**, the bigger the model, the more accurate and “intelligent” the model will behave. This 1B parameter model is incredibly small however, so the “truthfulness” of generated predictions is likely to be suspect (aka Hallucinated). The model will at least sound very confident however!
<br>
@@ -185,7 +193,7 @@ Lets next move on to active conversation with the model. Navigate to the **Chat*
</figure>
<br>
Once loaded, feel free to type any message and interact with the model in any way. To speed up the pace of our lab, I recommend setting your maximum output length to 64 tokens.
Once loaded, feel free to type any message and interact with the model in any way. To speed up the pace of our lab, I recommend setting your maximum output length to 64 tokens.
<figure style="text-align: center;">
<a href="https://i.imgur.com/MdAIKLn.png" target="_blank">
@@ -203,7 +211,7 @@ If text generation fails, or acts weird (such as merely repeating your input bac
### Execute: View Tokenization
If everything is in working order, review the **Tokenize** view. This allows us to visually see how Llama 3.2 will convert our input text into “tokens,” or numbers that represent the input English. Feel free to input any sentence into the box to review what the final tokenized version will be.
If everything is in working order, review the **Tokenize** view. This allows us to visually see how Llama 3.2 will convert our input text into “tokens,” or numbers that represent the input English. Feel free to input any sentence into the box to review what the final tokenized version will be.
<figure style="text-align: center;">
<a href="https://i.imgur.com/I9tU8jK.png" target="_blank">
@@ -219,7 +227,7 @@ If everything is in working order, review the **Tokenize** view. This allows us
### Execute: Visualize Next-Token Activations
Next, select Model Activations. By entering “The quick brown fox” and selecting visualize, we can see how the model selects the next word, and the models level of confidence. Also feel free to redo this process with alternative sentences.
Next, select Model Activations. By entering “The quick brown fox” and selecting visualize, we can see how the model selects the next word, and the models level of confidence. Also feel free to redo this process with alternative sentences.
<figure style="text-align: center;">
<a href="https://i.imgur.com/JeWpoqV.png" target="_blank">
@@ -235,7 +243,7 @@ Next, select Model Activations. By entering “The quick brown fox” and selec
### Execute: Compare Confidence Views
Note how confident the model is about the word jumps in this famous phrase. For an alternative view of the same output, you can also select the **Visualize Logprobes** option from the menu, which will show the same information but by color.
Note how confident the model is about the word jumps in this famous phrase. For an alternative view of the same output, you can also select the **Visualize Logprobes** option from the menu, which will show the same information but by color.
<figure style="text-align: center;">
<a href="https://i.imgur.com/PvkgQUr.png" target="_blank">
@@ -251,12 +259,13 @@ Note how confident the model is about the word jumps in this famous phrase. For
### Explore: Continue Exploring TransformerLab Features
Please continue to explore Transformers Lab until youre ready to move on. While we will utilize many different tools other than Transformers Lab throughout this course due to its beta nature, this software is improving all the time and is worth watching! Transformers lab supports many advanced features, in various stages of development, such as:
* Batch Text Generation
* LLM Fine Tuning
* LLM Evaluation
* Retrieval Augmented Generation (RAG)
We will discuss these topics and more throughout the course.
Please continue to explore Transformers Lab until youre ready to move on. While we will utilize many different tools other than Transformers Lab throughout this course due to its beta nature, this software is improving all the time and is worth watching! Transformers lab supports many advanced features, in various stages of development, such as:
- Batch Text Generation
- LLM Fine Tuning
- LLM Evaluation
- Retrieval Augmented Generation (RAG)
We will discuss these topics and more throughout the course.
<br>
@@ -264,6 +273,6 @@ We will discuss these topics and more throughout the course.
## Conclusion
In this lab, we observed the foundational concepts of all LLMs in action using TransformerLab. Through hands-on exploration, we observed the process of tokenization how text is converted into numerical representations for the model and visualized the model's prediction process, including its confidence levels for different token selections. By navigating the models layers and blocks, we gained an appreciation for the sheer scale and complexity inherent in modern LLMs.
In this lab, we observed the foundational concepts of all LLMs in action using TransformerLab. Through hands-on exploration, we observed the process of tokenization how text is converted into numerical representations for the model and visualized the model's prediction process, including its confidence levels for different token selections. By navigating the models layers and blocks, we gained an appreciation for the sheer scale and complexity inherent in modern LLMs.
This initial experience provides a crucial stepping stone for further exploration of LLMs, laying the groundwork for future labs focused on fine-tuning, evaluation, and advanced techniques like Retrieval Augmented Generation.
-547
View File
@@ -1,547 +0,0 @@
<!-- breakout-style: instruction-rails -->
<!-- step-style: underline -->
<!-- objective-style: divider -->
# Lab 2 - LLaMa.cpp, Ollama & Quantization
In this lab, we will:
* Download a model from huggingface.com and quantize it for llama.cpp
* Download a model from huggingface.com infer it in llama.cpp
* Download a model from ollama.com
* Download a custom model from huggingface.com
* Import a custom model into Ollama.
<div class="lab-callout lab-callout--info">
<strong>Lab Flow Guide</strong><br />
<strong>Explore</strong> sections focus on investigation and comparison.<br />
<strong>Execute</strong> sections require running commands and producing output.
</div>
To start this lab, you'll need CLI access:
* SSH - <IP>:22
* All necessary artifacts are in the lab2 folder
## Objective 1: HuggingFace & LLaMa.cpp
### 1. What Is LLaMa.cpp?
LLaMa.cpp is an open-source project created to enable efficient running of Meta's LLaMA (Large Language Model Meta AI) family of large language models on consumer-grade hardware. It was initially developed by **Georgi Gerganov** in early March 2023, shortly after Meta released the weights of the LLaMA models to approved researchers.
The projects original goal was to make LLaMA models accessible on systems without powerful GPUs, including laptops, desktops, and even mobile devices. **LLaMa.cpp** achieves this by implementing the LLaMA inference in pure C/C++ and introducing highly efficient quantization techniques, allowing models to run with drastically reduced memory requirements. **LLaMa.cpp** is also the underlying project behind a number of inference wrappers and technologies, such as Llamafile, LM Studio, and Ollama, amongst many others.
### Key Features
| Capability | Why it matters |
|------------|----------------|
| **Efficient local inference** | Runs large language models without a powerful GPU. |
| **Quantization tools** (`llama-quantize`) | Shrinks model size (down to 1-bit) while preserving usable performance. |
| **Model conversion to .GGUF** | Provides a compact, fast-loading format that works with Ollama, LM Studio, and other wrappers. |
| **Cross-platform support** | Works on Linux, macOS, Windows, Apple Silicon, and ARM devices. |
| **CLI and debugging utilities** (`llama-cli`, `gguf-dump.py`) | Enables quick interactive testing and inspection of model metadata. |
| **Perplexity measurement** (`llama-perplexity`) | Quantifies how confident the model is about its predictions. |
| **Active community** | Powers tools such as LM Studio, Llamafile, and Ollama. |
---
## 1.2 Explore: HuggingFace - Model Cards
[HuggingFace](https://huggingface.com) is the “GitHub” for LLMs, datasets, and more. The following steps walk you through locating Metas **LLaMA3.21B** model card and its files.
1. **Open the LLaMA3.21B page**
<https://huggingface.co/meta-llama/Llama-3.2-1B>
<br>
2. **Read the model card** note the description, license, tags (e.g., *Text Generation*, *SafeTensors*, *PyTorch*), and links to finetunes/quantizations.
<br>
3. **Navigate to “Quantizations.”**
This tab lists communitycreated quantizations, including GGUF, GTPQ, AWQ, and EXL3 versions. Common providers include **Bartowski**, **Unsloth**, and **NousResearch**, although these players change periodically. Additionally, note that we can often download quantized versions *without* having agreed to the Meta license restrictions for the original model.
<figure style="text-align:center;">
<a href="https://i.imgur.com/Po0Ll3o.png" target="_blank">
<img src="https://i.imgur.com/Po0Ll3o.png" width="800" style="border:5px solid black;">
</a>
<figcaption>Model Card Quantizations Convenience Link</figcaption>
</figure>
<br>
<figure style="text-align:center;">
<a href="https://i.imgur.com/NM1rbXV.png" target="_blank">
<img src="https://i.imgur.com/NM1rbXV.png" width="800" style="border:5px solid black;">
</a>
<figcaption>Model Quantization Options</figcaption>
</figure>
4. **Open “Files and versions.”**
Here you see the raw `.safetensors` files (the unquantized checkpoint). For the model to successfully run, the full set of files needs to be loaded into system memory. Note how this 1Bparameter model is small enough to fit comfortably in a phones memory, even raw.
<figure style="text-align:center;">
<a href="https://i.imgur.com/6I9zkeu.png" target="_blank">
<img src="https://i.imgur.com/6I9zkeu.png" width="800" style="border:5px solid black;">
</a>
<figcaption>Distrubution Restriction</figcaption>
</figure>
Unless you've accepted Meta's EULA for this model, you'll be unable to download the model directly from Meta. This view may or may not appear based on your own HuggingFace account.
## 1.3 Explore: HuggingFace - Find and Download WhiteRabbitNeo
For this lab we will work with **WhiteRabbitNeoV37B**, a cybersecurityoriented finetune of Qwen2.5Coder7B. This model is less popular than LLaMA-3.2, and if we'd like to run this models in Ollama, we'll need to perform our own quantization.
<div class="lab-callout lab-callout--warning">
<strong>Warning:</strong> Although the next two steps show how to find and download this model so you can replicate the process, support files are already provided in <code>/home/student/lab2/WhiteRabbitNeo</code> to speed up lab execution.
</div>
### 1. Locate & download the model
1. Go to <https://huggingface.co/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B>.
2. Points of Interest on this modelcard:
1. This model appears to be a fine tune of **Qwen2.5-Coder-7B**
2. This model is openly licensed, and does have any requirements to download and use for our purposes.
3. This model is in **Safetensors** format, which is compatible with **LLaMa.cpp**'s quantization tools.
<figure style="text-align:center;">
<a href="https://i.imgur.com/9GrHRuh.png" target="_blank">
<img src="https://i.imgur.com/9GrHRuh.png" width="800" style="border:5px solid black;">
</a>
<figcaption>WhiteRabbitNeo model card.</figcaption>
</figure>
3. Click **Files and versions** → review the `.safetensors` checkpoints (≈ 15GB @ **FP16*).
<figure style="text-align:center;">
<a href="https://i.imgur.com/Emx97nL.png" target="_blank">
<img src="https://i.imgur.com/Emx97nL.png" width="800" style="border:5px solid black;">
</a>
<figcaption>Model safetensors (size ≈ 15GB).</figcaption>
</figure>
### 2 Download the Model
To prepare this model, create a target folder wherever you desire on your system to work out of. Once chosen, perform the following:
1. Ensure you have git & git-lfs installed to enable successful cloning from HuggingFace. If necessary, git can be installed on Debian based distributions via:
```bash
sudo apt install git git-lfs
git lfs install
```
2. Clone the model:
```bash
git clone https://huggingface.co/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B
```
### 3 Execute: Convert the Downloaded Model
**LLaMa.cpp** makes it easy for us to package models downloaded in SafeTensors format to GGUF. We can convert the model with the following official project script command:
```bash
convert_hf_to_gguf.py /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B/WhiteRabbitNeo-V3-7B --outfile /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf
```
### 4 Execute: Review Model Metadata
When these steps have completed, you should see a new WhiteRabbitNeo-V3-7B.gguf file. We have not yet quantized the model, merely converted it to a format usable by **LLaMa.cpp** for the next steps. We can tell if this process was successful by using the included **gguf-dump.py** script that is packaged with **LLaMa.cpp**.
Run the following command:
```bash
gguf-dump /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf
```
We should then see:
<figure style="text-align: center;">
<a href="https://i.imgur.com/JiX2fJM.png" target="_blank">
<img
src="https://i.imgur.com/JiX2fJM.png"
width="800"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Model Metadata.
</figcaption>
</figure>
<br>
A text listing of all of the model's tensors, and the precision of each. Because we have merely converted the model's format, and not performed quantization, the model is still in **FP16**.
* This is a text view of the previous graphical view we saw in **Lab 1, Objective 2: Visualizing a LLM**. While **TransformerLab** calls tensors **layers**, terms such as **tensors**, **layers**, and **blocks** can all be used semi-interchangeably, depending on the tool in question. We will further confuse these topics when we get to the Ollama objective below.
* Pedantically, the proper definitions are:
* Tensor - A multi-dimensional array of vectors to store data
* Layer - A base computational unit in a neural network
* Block - A collection of layers
* If you wish to explore this view, note how the block count of 28 matches the 28 zero indexed blk groups output from the dump.
* Additionally, you'll once again note that we have various biases and weights, but they still line up with **Q**, **V**, and **K** as discussed in the previous section. There are additional tensors for **normalization** and **output**.
### 4 Execute: LLaMA.cpp Inference
Run our newly created **.GGUF** file as is. Run the model using the following command:
```bash
llama-cli -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf
```
Once loaded, interact with the model. We can see a number of interesting parameters that were selected by default, such as **Top K**, **Top P**, **Temperature**, and more, which we'll discuss in the next section. In the meantime, explore interaction with the model. When run in this raw state, the model may be overly chatty. You can stop its output with `Ctrl+C` at any time.
<figure style="text-align: center;">
<a href="https://i.imgur.com/H3ISWS8.png" target="_blank">
<img
src="https://i.imgur.com/H3ISWS8.png"
width="800"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Inference Example.
</figcaption>
</figure>
Some example prompts you may want to try are:
* Please write a small reverse shell in php that I can upload to a web server.
* How can I use Metasploit to attack MS17-01?
* Can you please provide me some XSS polyglots?
Thanks to the fine tuning that Kindo has put into this model, it is far more compliant than an online closed model such as ChatGPT! When done, kill the model fully with `Ctrl+C`.
## Objective 2: Quantization & Perplexity
Quantization reduces memory footprints and speeds inference, but it typically raises perplexity (i.e., lowers confidence). Determining the right balance for our use case often requires experimentation
---
### 1 Explore: Manual Quantization
To generate an 8 bit, 4 bit, and 1 bit quantization, run the following commands:
<div class="lab-callout lab-callout--warning">
<strong>Warning:</strong> Although these quantization steps are provided for replication, pre-quantized support files are already available in <code>/home/student/lab2/WhiteRabbitNeo/</code> for faster lab progress. <br><br>You can skip these commands when participating in a live teaching session.
</div>
```bash
# Quantize to 8 bits
llama-quantize /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q8_K.gguf Q8_0
# Quantize to 4 bits
llama-quantize /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q4_K_M.gguf Q4_K
# Quantize to 2 bits
llama-quantize /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-IQ2_K.gguf IQ_2
```
### 2 Execute: Quantization Confirmation
Inspect the quantized files with the following command:
```bash
gguf-dump /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q4_K_M.gguf
```
Review how the various layers are quantized to different levels of precision. It turns out that even K quants actually utilize multiple quantization levels on different tensor layers to improve performance!
<figure style="text-align: center;">
<a href="https://i.imgur.com/kur4TPj.png" target="_blank">
<img
src="https://i.imgur.com/kur4TPj.png"
style="width: 800; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em; color: var(--text-color);">
WhiteRabbitNeo Layer 0.
</figcaption>
</figure>
<br>
<details>
<summary style="font-weight:bold; color:#a94442; cursor:pointer;">
Full explanation for the brave...
</summary>
### What each Tensor Layer does
### **1. Token Embeddings**
- **Tensor 1: `token_embd.weight`**
- **Responsibility:** Maps each token in the vocabulary to a dense vector of size 3584.
---
### **2. Layer Normalization**
- **Tensor 2: `blk.0.attn_norm.weight`**
- **Responsibility:** Scales the normalized inputs to the self-attention mechanism in the first block.
- **Tensor 6: `blk.0.ffn_norm.weight`**
- **Responsibility:** Scales the normalized outputs of the feed-forward network (FFN) in the first block.
---
### **3. Feed-Forward Network (FFN)**
- **Tensor 3: `blk.0.ffn_down.weight`**
- **Responsibility:** Projects the input from dimension 3584 to 18944 in the FFN down-projection.
- **Tensor 4: `blk.0.ffn_gate.weight`**
- **Responsibility:** Projects the output back to dimension 3584 after the non-linear transformation in the FFN gate mechanism.
- **Tensor 5: `blk.0.ffn_up.weight`**
- **Responsibility:** Projects the output of the non-linear transformation back to dimension 3584 in the FFN up-projection.
---
### **4. Self-Attention Mechanism**
#### **Key Projection**
- **Tensor 7: `blk.0.attn_k.bias`**
- **Responsibility:** Adds a learnable offset to the key vectors in the self-attention mechanism.
- **Tensor 8: `blk.0.attn_k.weight`**
- **Responsibility:** Projects the input to dimension 512 for key vectors in the self-attention mechanism.
#### **Query Projection**
- **Tensor 10: `blk.0.attn_q.bias`**
- **Responsibility:** Adds a learnable offset to the query vectors in the self-attention mechanism.
- **Tensor 11: `blk.0.attn_q.weight`**
- **Responsibility:** Projects the input to dimension 3584 for query vectors in the self-attention mechanism.
#### **Value Projection**
- **Tensor 12: `blk.0.attn_v.bias`**
- **Responsibility:** Adds a learnable offset to the value vectors in the self-attention mechanism.
- **Tensor 13: `blk.0.attn_v.weight`**
- **Responsibility:** Projects the input to dimension 512 for value vectors in the self-attention mechanism.
#### **Attention Output Projection**
- **Tensor 9: `blk.0.attn_output.weight`**
- **Responsibility:** Projects the concatenated attention outputs back to dimension 3584 before residual connection.
---
### **Summary by Purpose**
- **Token Embeddings:** Maps tokens to dense vectors.
- **Layer Normalization:** Scales normalized inputs/outputs in attention and FFN blocks.
- **Feed-Forward Network (FFN):** Handles down-projection, gating, and up-projection for non-linear transformations.
- **Self-Attention Mechanism:** Manages key, query, value projections, biases, and output projection for attention computations.
</details>
### 3 Execute: Quantitatively Measuring Perplexity
Perplexity is a measurement of how confident the model is about its next token prediction. Initially confusingly, lower values indicate higher confidence. By asking the model to infer a relatively large input (minimum 1024 tokens), we can generate an average perplexity score to gauge the models confidence.
```bash
# Perplexity test with FP16 model
llama-perplexity -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf -f /home/student/lab2/wiki.test.raw 2>&1 | grep Final
# Perplexity test with 8-bit quantized model
llama-perplexity -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q8_K.gguf -f /home/student/lab2/wiki.test.raw 2>&1 | grep Final
# Perplexity test with 4-bit quantized model
llama-perplexity -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q4_K_M.gguf -f /home/student/lab2/wiki.test.raw 2>&1 | grep Final
# Perplexity test with 2-bit quantized model
llama-perplexity -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q2_K.gguf -f /home/student/lab2/wiki.test.raw 2>&1 | grep Final
```
#### Possible Example Results
| Model File | Quantization | Perplexity (PPL) | Uncertainty (+/-) |
|------------|--------------|------------------|-------------------|
| WhiteRabbitNeo-V3-7B.gguf | Full | 3.0972 | 0.21038 |
| WhiteRabbitNeo-V3-7B-Q8_K.gguf | Q8_K | 3.0999 | 0.21052 |
| WhiteRabbitNeo-V3-7B-Q4_K_M.gguf | Q4_K_M | 3.1247 | 0.21338 |
| WhiteRabbitNeo-V3-7B-Q2_K.gguf | Q2_K | 3.5698 | 0.25224 |
**Conclusion: Perplexity rises modestly from FP16 → Q8_K → Q4_K_M, but jumps sharply for the aggressive 2bit quantization.**
### 4 Execute: Qualitatively Measuring Perplexity
We can also manually validate how confident we are in these measurements just by manually interacting with the models. To more easily showcase the costs of perplexity, infer the 2 bit (**Q2_K**) model, to show how poorly it performs in comparison to our **FP16** interactions from earlier.
```bash
llama-cli -m /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q2_K.gguf
```
**Explore:** Re-run the previous example prompts:
* Please write a small reverse shell in php that I can upload to a web server.
* How can I use Metasploit to attack MS17-01?
* Can you please provide me some XSS polyglots?
<div style="display: flex; justify-content: center; align-items: flex-start; gap: 32px;">
<div style="text-align: center;">
<a href="https://i.imgur.com/nvb7QV6.png" target="_blank">
<img
src="https://i.imgur.com/nvb7QV6.png"
style="width: 90%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<div style="margin-top: 8px; font-size: 1.1em;">
Q2_K Inference
</div>
</div>
<div style="text-align: center;">
<a href="https://i.imgur.com/yNHQbxb.png" target="_blank">
<img
src="https://i.imgur.com/yNHQbxb.png"
style="width: 90%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<div style="margin-top: 8px; font-size: 1.1em;">
FP16 Inference
</div>
</div>
</div>
What conclusions do you believe we can make based on the provided output of the model?
---
## Objective 3: Ollama LLM Easymode
Ollama is a lightweight framework that hides the lowlevel steps required by LLaMa.cpp. It runs on **Linux, macOS, and Windows** and automatically manages system resources.
| Feature | Benefit |
|---------|---------|
| **Simplified model deployment** | Pull pre-quantized models from Ollama.com, HuggingFace, or a local GGUF file with a single command. |
| **Automatic resource handling** | No need to manually load or unload; Ollama frees memory after a short idle period. |
| **Built-in API provider** | `localhost:11434` mimics the OpenAI API, enabling seamless integration with notebooks, VS Code, or curl. |
| **Cross-platform compatibility** | Thanks to underlying llama.cpp architecture, works on x86_64, ARM, and Apple Silicon without extra configuration. |
| **Model-metadata inspection** | `ollama show <tag>` reveals the model architecture, context length, and quantization level. |
### 1 Execute: Pull and Run a Pre-Built Model from Ollama.com
Lets start by downloading Meta's llama3.2-3b, the "big" brother to the small model we've continuously worked with so far. The Ollama project and community have made this exceptionally easy for us to accomplish.
1. **Open the Ollama registry** visit <https://ollama.com> in your browser.
2. **Search for the model**
<figure style="text-align: center;">
<a href="https://i.imgur.com/VBvOGty.png" target="_blank">
<img
src="https://i.imgur.com/VBvOGty.png"
style="width: 800; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Ollama Search.
</figcaption>
</figure>
<br>
3. **Copy the `ollama run` command** that appears in the topright corner of the model card.
4. **Paste the command into your terminal** and press **Enter**:
```bash
> ollama run llama3.2
```
<figure style="text-align: center;">
<a href="https://i.imgur.com/ammtbmI.png" target="_blank">
<img
src="https://i.imgur.com/ammtbmI.png"
style="width: 800; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Ollama Run command.
</figcaption>
</figure>
<br>
### 2 Explore: Interacting with Ollama Inference
When finished, you will be presented with a prompt, similar to the `llama-cli` commands. No need to download, convert, or quantize! Feel free to interact with this model until you're ready to move on.
<figure style="text-align: center;">
<a href="https://i.imgur.com/XZ6OYNI.png" target="_blank">
<img
src="https://i.imgur.com/XZ6OYNI.png"
style="width: 800; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Ollama Inference.
</figcaption>
</figure>
<br>
### 3 Execute: Pull and Run a Pre-Built Model from HuggingFace.com
Similarly, we can do the same by pulling a model directly from **HuggingFace**. As long as the source file is a .gguf of any quantization level that fits within our system memory, Ollama can fetch it directly.
1. **Select the Quantized Model from Objective 1** visit [CodeIsAbstract](https://huggingface.co/CodeIsAbstract/Llama-3.2-1B-Q8_0-GGUF) in your browser.
2. **Use this model** - Click Use this model → choose the Ollama tab. The page displays a readytorun command:
<figure style="text-align: center;">
<a href="https://i.imgur.com/lg2INAs.png" target="_blank">
<img
src="https://i.imgur.com/lg2INAs.png"
style="width: 800; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
HuggingFace Direct Ollama Pull.
</figcaption>
</figure>
<br>
3. **Copy the command** and execute it in your terminal.
```bash
ollama run hf.co/CodeIsAbstract/Llama-3.2-1B-Q8_0-GGUF:Q8
```
4. **Explore:** Interact with the model as normal.
### 4 Execute: Load a Custom `.gguf` Model
We can also import our WhiteRabbitNeo **.GGUF** model into Ollama, without having to upload it to **HuggingFace** first. In order to do so however, we need to create a **ModelFile**, a `.yml` file that describes to **Ollama** where the **.GGUF** is located, as well as any additional defaults we'd like Ollama to run with when performing inference.
1. **Create a simple modelfile** This will tell Ollama where the model lives.
```bash
echo "FROM /home/student/lab2/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q4_K_M.gguf" > Modelfile
```
2. **Register the model with Ollama**
```bash
ollama create WhiteRabbitNeo -f Modelfile
```
3. **Run the newly registered model**
```bash
ollama run WhiteRabbitNeo
```
4. **Explore:** The model is now stored locally under the tag *WhiteRabbitNeo* and can be invoked just as any other model.
<figure style="text-align: center;">
<a href="https://i.imgur.com/ijsAl6m.png" target="_blank">
<img
src="https://i.imgur.com/ijsAl6m.png"
style="width: 800; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Importing WhiteRabbitNeo V3.
</figcaption>
</figure>
<br>
---
#### Additional Useful Ollama Commands
| Command | Description |
|---------|-------------|
| `ollama list` | Shows all models currently registered with Ollama. |
| `ollama rm <tag>` | Deletes the specified model (freeing disk space). |
| `ollama show <tag>` | Prints model metadata (architecture, context length, quantization). |
| `ollama show <tag> --modelfile` | Prints an existing model's modelfile. Often useful for templating our own. |
| `ollama serve` | Starts the OpenAI-compatible API server (runs automatically when you first use `ollama run`). |
---
## Conclusion
Ollama bridges the gap between low-level LLaMa.cpp tools and high-level usability, making it an ideal choice for rapid deployment and educational labs. By leveraging its API, model registry, and automation features, you can focus on experimentation rather than infrastructure. However, understanding LLaMa.cpps underlying mechanics (e.g., quantization, perplexity) remains critical for optimizing performance, or going off the beaten path.
<br>
---
@@ -0,0 +1,124 @@
---
order: 2
title: "Lab 2 - Quantization Tradeoffs: Comparing 2-bit, 4-bit, and 8-bit"
description: Download Gemma 4 E2B in three GGUF quantizations and compare size, metadata, and output quality.
---
<!-- breakout-style: instruction-rails -->
<!-- step-style: underline -->
<!-- objective-style: divider -->
In this lab, we will:
- Download the same Gemma model in `UD-IQ2_M`, `Q4_K_M`, and `Q8_0`
- Compare file size and GGUF metadata across those quantizations
- Observe how lower precision changes the model's behavior
- Build intuition for when a smaller quant may or may not be worth it
<div class="lab-callout lab-callout--info">
<strong>Lab Flow Guide</strong><br />
<strong>Explore</strong> sections focus on comparison and trade-off analysis.<br />
<strong>Execute</strong> sections require collecting evidence from each quantized model.
</div>
## Objective 1: Understand the Model and the Quantizations
For this lab, we will use the Hugging Face repository for **Unsloth's GGUF release of Gemma 4 E2B Instruct**:
<https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF>
This repository currently exposes multiple GGUF variants of the same base model. We will focus on one file from each of these precision bands:
| Precision band | GGUF file | Why we are using it | File Size |
| -------------- | ------------------------------ | --------------------------------------- |-----------|
| 2-bit | `gemma-4-E2B-it-UD-IQ2_M.gguf` | Most aggressive compression in this lab | 2.4 GB |
| 4-bit | `gemma-4-E2B-it-Q4_K_M.gguf` | Common middle-ground quant | 3.17 GB |
| 8-bit | `gemma-4-E2B-it-Q8_0.gguf` | Highest-quality quant in this lab | 5.05 GB |
Even though the filenames differ, these are all the same underlying instruction-tuned Gemma 4 E2B model. The main variable we are changing is how the weights are stored.
When we say these files are the same model, we mean that the overall neural network is still the same:
- The same architecture
- The same layer count
- The same tokenizer
- The same training and instruction tuning
- The same general behavior the model learned during training
What changes is the numeric representation of the learned weights.
Imagine one learned weight in the original model is:
```text
0.156347
```
That number came from training. It is one of many values the model uses while computing each next token. Quantization does not invent a new model from scratch. Instead, it takes that trained value and asks:
```text
How can we store a close-enough version of this number using fewer bits?
```
If we use a simplified integer-style quantization scheme, the math looks like this:
```text
scale = max(|w|) / (2^(bits - 1) - 1)
q = round(w / scale)
w_hat = q * scale
```
Where:
- `w` is the original weight
- `q` is the stored integer bucket
- `scale` maps integers back into the original numeric range
- `w_hat` is the reconstructed approximation used at inference time
So if the original trained value was `0.156347`, a lower-bit quantized file may not store that exact number anymore. It may store an integer bucket like `1`, `5`, or `22`, plus a scale, and reconstruct an approximation such as:
- `0.000000`
- `0.130029`
- `0.146806`
- `0.157782`
Those are not identical to the original weight, but they may still be close enough for useful inference.
<div data-quantization-explorer></div>
### Explore: Interactive precision viewer
The viewer below zooms out from one weight and instead shows a toy layer with 16 stored values. Real GGUF schemes such as `Q4_K_M` and `UD-IQ2_M` are more sophisticated than this toy example, but the core idea is the same:
- Fewer bits means fewer representable values
- More weights get pushed into the same small set of stored buckets
- The layer becomes more compressed as precision drops
<div data-quantization-grid-explorer></div>
### Explore: Compare the same prompts through the hosted chat widget
If your instructor provides an OpenAI-compatible endpoint, you can compare the same prompts through the embedded chat tool below:
- Paste the lab endpoint and API key into the settings row
- Switch between `Q8_0`, `Q4_K_M`, and `UD-IQ2_M`
- Re-run the same prompt so you can compare coherence, stability, and SVG output
- Try a visual prompt such as `Draw a pelican riding a bicycle.`
The widget keeps the transcript in your browser so you can switch models without losing your place. Refresh the page to clear the chat history.
<div data-objective5-chat></div>
## Objective 6: Reflect on the Tradeoff
By this point, you should have:
- Compared three quantized versions of the same model
- Measured the storage savings directly
- Verified that the core model metadata remains largely the same
- Observed where output quality begins to degrade
The important takeaway is not that one quant is always "best." The important takeaway is that quantization is a deployment decision. The right choice depends on your hardware limits, acceptable quality loss, and the task you need the model to perform.
## Conclusion
This lab isolates quantization as the main variable. By downloading **Gemma 4 E2B Instruct** in `UD-IQ2_M`, `Q4_K_M`, and `Q8_0`, you can directly observe one of the most important tradeoffs in local inference: balancing model quality against disk usage and resource constraints.
+367
View File
@@ -0,0 +1,367 @@
---
order: 3
title: Lab 3 - LLaMa.cpp and Ollama Workflows
description: Convert a Hugging Face checkpoint to GGUF, run it in llama.cpp, and load it into Ollama.
---
<!-- breakout-style: instruction-rails -->
<!-- step-style: underline -->
<!-- objective-style: divider -->
# Lab 3 - LLaMa.cpp and Ollama Workflows
In this lab, we will:
- Download a model from Hugging Face
- Convert a model to GGUF for `llama.cpp`
- Run a model directly in `llama.cpp`
- Download a model from Ollama.com
- Import a custom `.gguf` model into Ollama
<div class="lab-callout lab-callout--info">
<strong>Lab Flow Guide</strong><br />
<strong>Explore</strong> sections focus on investigation and comparison.<br />
<strong>Execute</strong> sections require running commands and producing output.
</div>
To start this lab, you'll need CLI access:
- SSH - <IP>:22
- All necessary artifacts are in the `lab3` folder
## Objective 1: HuggingFace & LLaMa.cpp
### 1. What Is LLaMa.cpp?
LLaMa.cpp is an open-source project created to enable efficient running of Meta's LLaMA (Large Language Model Meta AI) family of large language models on consumer-grade hardware. It was initially developed by **Georgi Gerganov** in early March 2023, shortly after Meta released the weights of the LLaMA models to approved researchers.
The projects original goal was to make LLaMA models accessible on systems without powerful GPUs, including laptops, desktops, and even mobile devices. **LLaMa.cpp** achieves this by implementing the LLaMA inference in pure C/C++ and introducing highly efficient quantization techniques, allowing models to run with drastically reduced memory requirements. **LLaMa.cpp** is also the underlying project behind a number of inference wrappers and technologies, such as Llamafile, LM Studio, and Ollama, amongst many others.
### Key Features
| Capability | Why it matters |
| ------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- |
| **Efficient local inference** | Runs large language models without a powerful GPU. |
| **Quantization tools** (`llama-quantize`) | Shrinks model size (down to 1-bit) while preserving usable performance. |
| **Model conversion to .GGUF** | Provides a compact, fast-loading format that works with Ollama, LM Studio, and other wrappers. |
| **Cross-platform support** | Works on Linux, macOS, Windows, Apple Silicon, and ARM devices. |
| **CLI and debugging utilities** (`llama-cli`, `gguf-dump.py`) | Enables quick interactive testing and inspection of model metadata. |
| **Perplexity measurement** (`llama-perplexity`) | Quantifies how confident the model is about its predictions. |
| **Active community** | Powers tools such as LM Studio, Llamafile, and Ollama. |
---
## 1.2 Explore: HuggingFace - Model Cards
[HuggingFace](https://huggingface.com) is the “GitHub” for LLMs, datasets, and more. The following steps walk you through locating Metas **LLaMA3.21B** model card and its files.
1. **Open the LLaMA3.21B page**
<https://huggingface.co/meta-llama/Llama-3.2-1B>
<br>
2. **Read the model card** note the description, license, tags (e.g., _Text Generation_, _SafeTensors_, _PyTorch_), and links to finetunes/quantizations.
<br>
3. **Navigate to “Quantizations.”**
This tab lists communitycreated quantizations, including GGUF, GTPQ, AWQ, and EXL3 versions. Common providers include **Bartowski**, **Unsloth**, and **NousResearch**, although these players change periodically. Additionally, note that we can often download quantized versions _without_ having agreed to the Meta license restrictions for the original model.
<figure style="text-align:center;">
<a href="https://i.imgur.com/Po0Ll3o.png" target="_blank">
<img src="https://i.imgur.com/Po0Ll3o.png" width="800" style="border:5px solid black;">
</a>
<figcaption>Model Card Quantizations Convenience Link</figcaption>
</figure>
<br>
<figure style="text-align:center;">
<a href="https://i.imgur.com/NM1rbXV.png" target="_blank">
<img src="https://i.imgur.com/NM1rbXV.png" width="800" style="border:5px solid black;">
</a>
<figcaption>Model Quantization Options</figcaption>
</figure>
4. **Open “Files and versions.”**
Here you see the raw `.safetensors` files (the unquantized checkpoint). For the model to successfully run, the full set of files needs to be loaded into system memory. Note how this 1Bparameter model is small enough to fit comfortably in a phones memory, even raw.
<figure style="text-align:center;">
<a href="https://i.imgur.com/6I9zkeu.png" target="_blank">
<img src="https://i.imgur.com/6I9zkeu.png" width="800" style="border:5px solid black;">
</a>
<figcaption>Distrubution Restriction</figcaption>
</figure>
Unless you've accepted Meta's EULA for this model, you'll be unable to download the model directly from Meta. This view may or may not appear based on your own HuggingFace account.
## 1.3 Explore: HuggingFace - Find and Download WhiteRabbitNeo
For this lab we will work with **WhiteRabbitNeoV37B**, a cybersecurityoriented finetune of Qwen2.5Coder7B. This model is less popular than LLaMA-3.2, and if we'd like to run it in `llama.cpp` or Ollama, we first need to convert it into a usable GGUF artifact.
<div class="lab-callout lab-callout--warning">
<strong>Warning:</strong> Although the next two steps show how to find and download this model so you can replicate the process, support files are already provided in <code>/home/student/lab3/WhiteRabbitNeo</code> to speed up lab execution.
</div>
### 1. Locate & download the model
1. Go to <https://huggingface.co/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B>.
2. Points of Interest on this modelcard:
1. This model appears to be a fine tune of **Qwen2.5-Coder-7B**
2. This model is openly licensed, and does have any requirements to download and use for our purposes.
3. This model is in **Safetensors** format, which is compatible with **LLaMa.cpp**'s quantization tools.
<figure style="text-align:center;">
<a href="https://i.imgur.com/9GrHRuh.png" target="_blank">
<img src="https://i.imgur.com/9GrHRuh.png" width="800" style="border:5px solid black;">
</a>
<figcaption>WhiteRabbitNeo model card.</figcaption>
</figure>
3. Click **Files and versions** → review the `.safetensors` checkpoints (≈ 15GB @ \*_FP16_).
<figure style="text-align:center;">
<a href="https://i.imgur.com/Emx97nL.png" target="_blank">
<img src="https://i.imgur.com/Emx97nL.png" width="800" style="border:5px solid black;">
</a>
<figcaption>Model safetensors (size ≈ 15GB).</figcaption>
</figure>
### 2 Download the Model
To prepare this model, create a target folder wherever you desire on your system to work out of. Once chosen, perform the following:
1. Ensure you have git & git-lfs installed to enable successful cloning from HuggingFace. If necessary, git can be installed on Debian based distributions via:
```bash
sudo apt install git git-lfs
git lfs install
```
2. Clone the model:
```bash
git clone https://huggingface.co/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B
```
### 3 Execute: Convert the Downloaded Model
**LLaMa.cpp** makes it easy for us to package models downloaded in SafeTensors format to GGUF. We can convert the model with the following official project script command:
```bash
convert_hf_to_gguf.py /home/student/lab3/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B/WhiteRabbitNeo-V3-7B --outfile /home/student/lab3/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf
```
### 4 Execute: Review Model Metadata
When these steps have completed, you should see a new WhiteRabbitNeo-V3-7B.gguf file. We have not yet quantized the model, merely converted it to a format usable by **LLaMa.cpp** for the next steps. We can tell if this process was successful by using the included **gguf-dump.py** script that is packaged with **LLaMa.cpp**.
Run the following command:
```bash
gguf-dump /home/student/lab3/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf
```
We should then see:
<figure style="text-align: center;">
<a href="https://i.imgur.com/JiX2fJM.png" target="_blank">
<img
src="https://i.imgur.com/JiX2fJM.png"
width="800"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Model Metadata.
</figcaption>
</figure>
<br>
A text listing of all of the model's tensors, and the precision of each. Because we have merely converted the model's format, and not performed quantization, the model is still in **FP16**.
- This is a text view of the previous graphical view we saw in **Lab 1, Objective 2: Visualizing a LLM**. While **TransformerLab** calls tensors **layers**, terms such as **tensors**, **layers**, and **blocks** can all be used semi-interchangeably, depending on the tool in question. We will further confuse these topics when we get to the Ollama objective below.
- Pedantically, the proper definitions are:
- Tensor - A multi-dimensional array of vectors to store data
- Layer - A base computational unit in a neural network
- Block - A collection of layers
- If you wish to explore this view, note how the block count of 28 matches the 28 zero indexed blk groups output from the dump.
- Additionally, you'll once again note that we have various biases and weights, but they still line up with **Q**, **V**, and **K** as discussed in the previous section. There are additional tensors for **normalization** and **output**.
### 4 Execute: LLaMA.cpp Inference
Run our newly created **.GGUF** file as is. Run the model using the following command:
```bash
llama-cli -m /home/student/lab3/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf
```
Once loaded, interact with the model. We can see a number of interesting parameters that were selected by default, such as **Top K**, **Top P**, **Temperature**, and more, which we'll discuss in the next section. In the meantime, explore interaction with the model. When run in this raw state, the model may be overly chatty. You can stop its output with `Ctrl+C` at any time.
<figure style="text-align: center;">
<a href="https://i.imgur.com/H3ISWS8.png" target="_blank">
<img
src="https://i.imgur.com/H3ISWS8.png"
width="800"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Inference Example.
</figcaption>
</figure>
Some example prompts you may want to try are:
- Please write a small reverse shell in php that I can upload to a web server.
- How can I use Metasploit to attack MS17-01?
- Can you please provide me some XSS polyglots?
Thanks to the fine tuning that Kindo has put into this model, it is far more compliant than an online closed model such as ChatGPT! When done, kill the model fully with `Ctrl+C`.
<div class="lab-callout lab-callout--info">
<strong>Note:</strong> Dedicated quantization comparisons now live in <strong>Lab 2</strong>. This lab stays focused on format conversion, raw <code>llama.cpp</code> inference, and Ollama workflows.
</div>
## Objective 2: Ollama LLM Easymode
Ollama is a lightweight framework that hides the lowlevel steps required by LLaMa.cpp. It runs on **Linux, macOS, and Windows** and automatically manages system resources.
| Feature | Benefit |
| -------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
| **Simplified model deployment** | Pull pre-quantized models from Ollama.com, HuggingFace, or a local GGUF file with a single command. |
| **Automatic resource handling** | No need to manually load or unload; Ollama frees memory after a short idle period. |
| **Built-in API provider** | `localhost:11434` mimics the OpenAI API, enabling seamless integration with notebooks, VS Code, or curl. |
| **Cross-platform compatibility** | Thanks to underlying llama.cpp architecture, works on x86_64, ARM, and Apple Silicon without extra configuration. |
| **Model-metadata inspection** | `ollama show <tag>` reveals the model architecture, context length, and quantization level. |
### 1 Execute: Pull and Run a Pre-Built Model from Ollama.com
Lets start by downloading Meta's llama3.2-3b, the "big" brother to the small model we've continuously worked with so far. The Ollama project and community have made this exceptionally easy for us to accomplish.
1. **Open the Ollama registry** visit <https://ollama.com> in your browser.
2. **Search for the model**
<figure style="text-align: center;">
<a href="https://i.imgur.com/VBvOGty.png" target="_blank">
<img
src="https://i.imgur.com/VBvOGty.png"
style="width: 800; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Ollama Search.
</figcaption>
</figure>
<br>
3. **Copy the `ollama run` command** that appears in the topright corner of the model card.
4. **Paste the command into your terminal** and press **Enter**:
```bash
> ollama run llama3.2
```
<figure style="text-align: center;">
<a href="https://i.imgur.com/ammtbmI.png" target="_blank">
<img
src="https://i.imgur.com/ammtbmI.png"
style="width: 800; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Ollama Run command.
</figcaption>
</figure>
<br>
### 2 Explore: Interacting with Ollama Inference
When finished, you will be presented with a prompt, similar to the `llama-cli` commands. No need to download, convert, or quantize! Feel free to interact with this model until you're ready to move on.
<figure style="text-align: center;">
<a href="https://i.imgur.com/XZ6OYNI.png" target="_blank">
<img
src="https://i.imgur.com/XZ6OYNI.png"
style="width: 800; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Ollama Inference.
</figcaption>
</figure>
<br>
### 3 Execute: Pull and Run a Pre-Built Model from HuggingFace.com
Similarly, we can do the same by pulling a model directly from **HuggingFace**. As long as the source file is a .gguf of any quantization level that fits within our system memory, Ollama can fetch it directly.
1. **Select a pre-quantized GGUF model** visit [CodeIsAbstract](https://huggingface.co/CodeIsAbstract/Llama-3.2-1B-Q8_0-GGUF) in your browser.
2. **Use this model** - Click Use this model → choose the Ollama tab. The page displays a readytorun command:
<figure style="text-align: center;">
<a href="https://i.imgur.com/lg2INAs.png" target="_blank">
<img
src="https://i.imgur.com/lg2INAs.png"
style="width: 800; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
HuggingFace Direct Ollama Pull.
</figcaption>
</figure>
<br>
3. **Copy the command** and execute it in your terminal.
```bash
ollama run hf.co/CodeIsAbstract/Llama-3.2-1B-Q8_0-GGUF:Q8
```
4. **Explore:** Interact with the model as normal.
### 4 Execute: Load a Custom `.gguf` Model
We can also import our WhiteRabbitNeo **.GGUF** model into Ollama, without having to upload it to **HuggingFace** first. In order to do so however, we need to create a **ModelFile**, a `.yml` file that describes to **Ollama** where the **.GGUF** is located, as well as any additional defaults we'd like Ollama to run with when performing inference.
1. **Create a simple modelfile** This will tell Ollama where the model lives.
```bash
echo "FROM /home/student/lab3/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf" > Modelfile
```
2. **Register the model with Ollama**
```bash
ollama create WhiteRabbitNeo -f Modelfile
```
3. **Run the newly registered model**
```bash
ollama run WhiteRabbitNeo
```
4. **Explore:** The model is now stored locally under the tag _WhiteRabbitNeo_ and can be invoked just as any other model.
<figure style="text-align: center;">
<a href="https://i.imgur.com/ijsAl6m.png" target="_blank">
<img
src="https://i.imgur.com/ijsAl6m.png"
style="width: 800; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Importing WhiteRabbitNeo V3.
</figcaption>
</figure>
<br>
---
#### Additional Useful Ollama Commands
| Command | Description |
| ------------------------------- | --------------------------------------------------------------------------------------------- |
| `ollama list` | Shows all models currently registered with Ollama. |
| `ollama rm <tag>` | Deletes the specified model (freeing disk space). |
| `ollama show <tag>` | Prints model metadata (architecture, context length, quantization). |
| `ollama show <tag> --modelfile` | Prints an existing model's modelfile. Often useful for templating our own. |
| `ollama serve` | Starts the OpenAI-compatible API server (runs automatically when you first use `ollama run`). |
---
## Conclusion
Ollama bridges the gap between low-level LLaMa.cpp tools and high-level usability, making it an ideal choice for rapid deployment and educational labs. By leveraging its API, model registry, and automation features, you can focus on experimentation rather than infrastructure. Quantization tradeoffs still matter, but they now have a dedicated home in Lab 2 so this lab can stay centered on conversion and deployment workflows.
<br>
---

Before

Width:  |  Height:  |  Size: 284 KiB

After

Width:  |  Height:  |  Size: 284 KiB

Before

Width:  |  Height:  |  Size: 208 KiB

After

Width:  |  Height:  |  Size: 208 KiB

@@ -1,355 +1,363 @@
<!-- breakout-style: instruction-rails -->
<!-- step-style: underline -->
<!-- objective-style: divider -->
# Lab 3 - Open WebUI & Prompting
In this lab, we will:
* Run Open WebUI
* Using an Ollama Model within Open WebUI
* Experimenting with Inference Parameters
* Experimenting with Prompting Techniques
<div class="lab-callout lab-callout--info">
<strong>Lab Flow Guide</strong><br />
<strong>Explore</strong> sections focus on investigation and comparison.<br />
<strong>Execute</strong> sections require running steps and validating output.
</div>
To start this lab, one web service has been preconfigured:
* Open WebUI - http://<IP>:8080
## Objective 1 Execute: Accessing Open WebUI
Your lab machine has been pre-installed with Open Webui. It is accessible on your provided system IP at port 8080 (http://<IP>:8080). You can log in or register with the following default credentials:
Username: student@openwebui.com
Password: student
<figure style="text-align: center;">
<a href="https://i.imgur.com/nwk73eW.png" target="_blank">
<img
src="https://i.imgur.com/nwk73eW.png"
style="width: 50%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Initial Registration
</figcaption>
</figure>
<br>
## Objective 2 Execute: Downloading Our First Model through Open WebUI (OUI)
Locate, pull, and run **Qwen3.5 4B** using the **OpenWebUI**. By defualt, Open WebUI comes pre-configured to talk to a local install of Ollama, a legacy configuration from this projects original intent (it originally released as Ollama-WebUI). By the end of this section you should be able to start a model with a single click and generate a response in the UI.
### Execute: Download Qwen 3.5 4B
1. **Open the Ollama model registry**
* Go to <https://ollama.com> in your web browser.
* Locate the search box at the top of the page.
<figure style="text-align:center;">
<a href="https://i.imgur.com/btkT9IH.png" target="_blank">
<img src="https://i.imgur.com/btkT9IH.png" width="600"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
</a>
<figcaption>Ollama homepage use the search bar to look for “Qwen3.5”.</figcaption>
</figure>
2. **Find the Qwen 3.5 family**
* Type **`Qwen 3.5`** and press **Enter**.
* The results page lists several parameter sizes (1B → 27B).
3. **Navigate to the list of tags**
* Click the **`Tags`** link beneath the model description.
<figure style="text-align:center;">
<a href="https://i.imgur.com/TuUbK7O.png" target="_blank">
<img src="https://i.imgur.com/TuUbK7O.png" width="600"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
</a>
<figcaption>Tag view each entry shows the model size and a short description.</figcaption>
</figure>
4. **Select the 4B variant**
* Locate **`Qwen3.5:4b`** in the table.
* The size column reads **`3.4GB`**, indicating the VRAM required for inference.
<figure style="text-align:center;">
<a href="https://i.imgur.com/eaRaqnq.png" target="_blank">
<img src="https://i.imgur.com/eaRaqnq.png" width="600"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
</a>
<figcaption>Model size for `Qwen3.5:4b` (≈ 3.3GB VRAM).</figcaption>
</figure>
5. **Copy the model tag**
* Click the **copytoclipboard** icon next to the tag (or highlight the text and press **Ctrl+C**).
6. **Open the OpenWebUI interface**
* In a new browser tab, navigate to the URL where your OpenWebUI instance is running (e.g., `http://localhost:8080`).
7. **Pull the model through the UI**
* In the **“Select a model”** dropdown, paste the copied tag into the text field.
* Click **`Pull`**. The UI will display a progress bar while Ollama downloads the GGUF file.
<figure style="text-align:center;">
<a href="https://i.imgur.com/Sf8sSs3.png" target="_blank">
<img src="https://i.imgur.com/Sf8sSs3.png" width="600"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
</a>
<figcaption>OpenWebUI paste the tag and press “Pull”.</figcaption>
</figure>
8. **Verify the model works**
* Once the download finishes, type a prompt in the chat window (e.g., “Tell me a short, funny story about a cat that learns to code”).
* Press **Enter** and watch the response appear.
<figure style="text-align:center;">
<a href="https://i.imgur.com/30OMNsk.png" target="_blank">
<img src="https://i.imgur.com/30OMNsk.png" width="600"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
</a>
<figcaption>Successful inference the model returns a coherent answer.</figcaption>
</figure>
9. **Download Gemma3n e2B**
* While we're downloading models, let us download one more. You can either repeat the process from the previous steps to find and download **Gemma3n e2B**, or just use the following model tag to download the model via the Open WebUI search bar:
```bash
ollama pull gemma3n:e2b
```
Google has designed gemma 3n models designed for efficient execution on resource constrained devices such as laptops, tablets, phones, or Nvidia 2080 Super GPUs.
---
## Objective 3: Inference Settings
### Explore: OUI Inference Parameter Valves
Prior to this lab, we discussed inference settings such as Top K, Top P, and Temperature. Let's quickly review the most common settings to customize:
* `Context Length` - The amount of tokens the model is allowed to keep in active memory
* `Temperature` - Changes the score of low probability token generation
* `Top K` - Limits the possible tokens selection during inference to the most likely `K` selections
* `Top P` - Limits the possible tokens to those whose cumulative probability exceeds `P`
Open WebUI allows us to easily modify these parameters on the fly through the chat controls, found on the right hand side next to your user's icon.
<figure style="text-align: center;">
<a href="https://i.imgur.com/Tp4LqGs.png" target="_blank">
<img
src="https://i.imgur.com/Tp4LqGs.png"
style="width: 600px; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Chat Controls
</figcaption>
</figure>
<br>
By default, Open WebUI selects the following generically sound options, with the expectation that users have access to modest hardware:
* `Context Length` - 2048
* `Temperature` - .8
* `Top K` - 40
* `Top P` - .9
While we won't play with `Context Length`, this parameter is critical for successfully accomplishing more complicated tasks using local models. With only the small default context length value, the model will quickly forget your instructions and interactions, rendering the results the model generates less useful. Unfortunately, just increasing this value is not always an option, as your selected model + `Context Length` must fit within your available memory. As with many challenges in AI, a key to solving issues with `Context Length` is often scaling your hardware to meet the demands of the task. This generally means utilizing hardware with larger amounts of VRAM or unified memory either by purchasing it or renting access.
Additionally, these defaults can be overruled by the Ollama model file, which can specify its own "preferred" default Hyperparameters. Below are the defaults that come with the model we've downloaded, or feel free to interactively explore the `params` page for the model at this link: [qwen3.5:4b-q4_K_M](https://ollama.com/library/qwen3.5:4b-q4_K_M/blobs/9371364b27a5).
<br>
<figure style="text-align: center;">
<a href="https://i.imgur.com/HfnH17e.png" target="_blank">
<img
src="https://i.imgur.com/HfnH17e.png"
style="width: 600px; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Modelfile Defaults
</figcaption>
</figure>
<br>
The best model makers will often override the defaults with their own preferred ones, as we've just seen. These Qwen selected defaults were the values they found to produce the best outputs for most tasks. When possible, it is likely that you'll want to stick with these defaults unless you have a very good reason to change them.
Thankfully, our lab gives us just such a reason! We can manually modify these options with the aforementioned chat controls options. Depending on our end goal, we can either help the model to write more "creatively" or "precisely" through setting `Temperature`, `Top K`, and `Top P`.
Lets test this with a series of interactions, themed around Magic the Gathering. Qwen is considered a multi-modal model, meaning we're not just limited to inputing text! Input the following image, and ask `What is this? What does it do?`
Next, set our inference parameters to the following:
* `Temperature` - 1.1
* `Top K` - 100
* `Top P` - .95
Repeat your first interaction, noting the differences in model output. Less "likely" or common words were hopefully selected!
When satisfied, lets next set our inference parameters to the following:
* `Temperature` - 2
* `Top K` - 400
* `Top P` - .95
The model this time likely has gone off the rails, answering for an extended period of time, and trailing off incoherently. This is due to us increasing the likelihood of improbable tokens far beyond the expected performance thresholds google has set for us. Lets next test the opposite:
* `Temperature` - Default
* `Top K` - 1
* `Top P` - Default
Feel free to continue to explore with other topics or images. Note how each time we restart our conversation, the model gives us the exact same answer. This is because Top K limits the model to select only the single most likely token for the provided input! Even with this restriction however, note that the model can still provide different answers based on GPU differences, random fluctuations in the GPU hardware, or other similarly improbable events. Never forget that LLMs are deterministic, and even when highly restricted, can output unexpected results.
<br>
---
## Objective 4: Prompting Techniques
### Explore: Prompt Engineering & System Prompting
<div class="lab-callout lab-callout--warning">
<strong>Warning:</strong> As you explore chat via Open WebUI, ensure you turn <code>think (Ollama)</code> to OFF. <strong>Qwen3.5 4B</strong> is likely to enter an infinite thinking loop for these tasks otherwise, which will require a VM reboot.
<br><br>
Alternatively, choose to perform these steps with **Gemma3n e2B**, which can handle tight environments more gracefully.
</div>
Next, lets review different ways we can coax a model to perform better, without having to perform fine tuning or parameter customization. We can do this by "priming" the model with our first prompt in a number of ways:
<br>
* Few Shot Prompting - Providing examples of our desired outcome up front
* Meta Prompting - Providing a guide to reach the desired outcome
* Chain of Thought - Providing the model guidance to think through its response
* Self Criticism - Asking the model to play "devil's' advocate" against itself
<br>
Each of these tools can be combined to help achieve a greater effect. Below is a suggested list of Magic the Gathering game design challenges which we can task Qwen 3.5 with, but each will require either some luck, or great prompt engineering. If you have a different topic you're more familiar with, feel free to first use Qwen 3.5 to adapt these challenges to a more familiar theme:
<br>
* Design a black rare creature card that fits thematically and mechanically into a Graveyard Matters Magic the Gathering set. Provide a few existing cards to help give the model a template.
* Design the same card, but this time outline the type, mechanics, tone, and identity
* Invent a new keyword. Have the model reason step by step how the keyword will work within the game
* Review your new keyword for game balance. Have the model challenge its decisions.
<br>
There is one final prompting tool that we have yet to deep dive, which is system prompting. While the `chat controls` menu provides the option to override the default system prompt, Open WebUI provides a powerful flow for "creating" new models with saved system prompts and inference parameters. This is also a great convenience feature, as changing Hyperparameters via Chat Controls for every chat becomes tedious. This is especially useful once we have created a system prompt that we especially prefer, or would like to set inference parameters once, and reuse them many times.
Let's create a new model by selecting the `Workspace` link, and then selecting the `+` button to create a new model:
<figure style="text-align: center;">
<a href="https://i.imgur.com/TjNyWNa.png" target="_blank">
<img
src="https://i.imgur.com/TjNyWNa.png"
style="width: 600px; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Create custom model
</figcaption>
</figure>
<br>
In the new model window, we can customize many different options for our model, even beyond the previously used chat specific controls. Create a new model named `Qwen 3.5 LLM Demo` by performing the following steps:
1. Set the name to `Qwen 3.5 LLM Demo`
2. Set the Base Model to `Qwen3.5:4b`
3. Provide a system prompt. You can set this to be any task you'd like model to focus on, or we can stick with our Magic the Gathering theme. Utilize the following prompt, or for bonus points, have Qwen 3.5 generate one for you.
```text
"You are a creative designer for Magic: The Gathering, tasked with generating new Sliver creature cards. Follow these guidelines to ensure the cards align with the game's mechanics and lore:
Card Outline Structure:
* Name: Give the Sliver a unique name that reflects its abilities or traits (e.g., 'Predatory Sliver', 'Aetherwing Sliver').
* Mana Cost: Assign a mana cost appropriate for the cards power level and complexity. Use standard Magic symbols (e.g., {1}{G}{U}).
* Type Line: Always include 'Creature — Sliver' in the type line.
* Power/Toughness: Set values that balance the cards abilities.
* Abilities: Include one or more keyword abilities, triggered abilities, or static effects. Ensure they synergize with existing Sliver mechanics.
* Flavor Text (optional): Add a short, thematic quote or description to enhance the card's lore.
Sliver Mechanics:
Slivers are a tribe of creatures that share abilities among themselves. Include the phrase 'All Slivers have...' in the ability text to reflect this tribal synergy.
Abilities should be consistent with existing Sliver themes, such as combat enhancements, adaptability, or swarm tactics.
Balance and Creativity:
Ensure the card is balanced for gameplay while introducing innovative mechanics or flavor.
Example:
Name: Swiftwing Sliver
Mana Cost: {2}{W}
Type Line: Creature — Sliver
Power/Toughness: 2/2
Abilities: Flying, All Slivers have flying.
Flavor Text: 'The skies belong to the swift and the bold.'
When provided a name, generate a new Sliver card following this structure."
```
<figure style="text-align: center;">
<a href="https://i.imgur.com/ZtLpw9y.png" target="_blank">
<img
src="https://i.imgur.com/ZtLpw9y.png"
style="width: 600px; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
System Prompt Creation
</figcaption>
</figure>
<br>
4. To ensure only the best card generation, show the `Advanced Params` and set the following to add creativity:
* `Temperature` - 1.1
* `Top K` - 100
* `Top P` - .95
* `Ollama (Think)` - Off
Note: While we haven't actively discussed them as a part of this lab, as you play with more advanced inference problems, you may also find the following parameters of interest:
* `Max Tokens` - Limit the possible length of a response to the desired number of tokens
* `num_gpu` - Manually override Ollama's built in layer offload determination. Useful for increasing performance on mixed GPU setups.
* `use_mlock` - Manually force Ollama to ensure all model components are kept within active memory. Useful for smaller systems.
<figure style="text-align: center;">
<a href="https://i.imgur.com/9RcJVjK.png" target="_blank">
<img
src="https://i.imgur.com/9RcJVjK.png"
style="width: 600px; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Custom Parameters
</figcaption>
</figure>
<br>
5. When done, hit save. We can now test creating new Sliver cards! Select our newly created model from the chat drop down, and try inventing a few names.
<br>
---
## Conclusion
Throughout this lab, we've explored the fascinating world of Open WebUI and prompt engineering. Let's summarize the key topics we've covered:
1. **Model Selection and Management**: We explored how to download and manage models like Qwen 3.5, understanding their resource requirements and capabilities. This taught us about the practical considerations of working with different model sizes.
2. **Inference Parameters**: We experimented with critical inference parameters including:
- Temperature: Controls randomness in output
- Top K: Limits token selection to top K most likely options
- Top P: Uses nucleus sampling based on cumulative probability
3. **Prompting Techniques**: We examined various prompting strategies:
- Few Shot Prompting: Providing examples of desired outputs
- Meta Prompting: Giving guidance to reach outcomes
- Chain of Thought: Encouraging step-by-step reasoning
- Self Criticism: Having the model evaluate its own responses
4. **System Prompting**: We created custom models with specific system prompts and parameter settings, learning how to tailor LLM behavior for specialized tasks.
These concepts are foundational for effectively working with large language models in real-world applications. Remember that prompt engineering is both an art and a science - it requires understanding both the capabilities of the model and the nuances of human language. As you continue your journey with LLMs, don't hesitate to experiment with different approaches and parameters to find what works best for your specific use cases.
---
order: 4
title: Lab 4 - Open WebUI and Prompting
description: Use Open WebUI to run local models and experiment with prompting and inference parameters.
---
<!-- breakout-style: instruction-rails -->
<!-- step-style: underline -->
<!-- objective-style: divider -->
# Lab 4 - Open WebUI & Prompting
In this lab, we will:
- Run Open WebUI
- Using an Ollama Model within Open WebUI
- Experimenting with Inference Parameters
- Experimenting with Prompting Techniques
<div class="lab-callout lab-callout--info">
<strong>Lab Flow Guide</strong><br />
<strong>Explore</strong> sections focus on investigation and comparison.<br />
<strong>Execute</strong> sections require running steps and validating output.
</div>
To start this lab, one web service has been preconfigured:
- Open WebUI - http://<IP>:8080
## Objective 1 Execute: Accessing Open WebUI
Your lab machine has been pre-installed with Open Webui. It is accessible on your provided system IP at port 8080 (http://<IP>:8080). You can log in or register with the following default credentials:
Username: student@openwebui.com
Password: student
<figure style="text-align: center;">
<a href="https://i.imgur.com/nwk73eW.png" target="_blank">
<img
src="https://i.imgur.com/nwk73eW.png"
style="width: 50%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Initial Registration
</figcaption>
</figure>
<br>
## Objective 2 Execute: Downloading Our First Model through Open WebUI (OUI)
Locate, pull, and run **Qwen3.5 4B** using the **OpenWebUI**. By defualt, Open WebUI comes pre-configured to talk to a local install of Ollama, a legacy configuration from this projects original intent (it originally released as Ollama-WebUI). By the end of this section you should be able to start a model with a single click and generate a response in the UI.
### Execute: Download Qwen 3.5 4B
1. **Open the Ollama model registry**
- Go to <https://ollama.com> in your web browser.
- Locate the search box at the top of the page.
<figure style="text-align:center;">
<a href="https://i.imgur.com/btkT9IH.png" target="_blank">
<img src="https://i.imgur.com/btkT9IH.png" width="600"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
</a>
<figcaption>Ollama homepage use the search bar to look for “Qwen3.5”.</figcaption>
</figure>
2. **Find the Qwen 3.5 family**
- Type **`Qwen 3.5`** and press **Enter**.
- The results page lists several parameter sizes (1B → 27B).
3. **Navigate to the list of tags**
- Click the **`Tags`** link beneath the model description.
<figure style="text-align:center;">
<a href="https://i.imgur.com/TuUbK7O.png" target="_blank">
<img src="https://i.imgur.com/TuUbK7O.png" width="600"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
</a>
<figcaption>Tag view each entry shows the model size and a short description.</figcaption>
</figure>
4. **Select the 4B variant**
- Locate **`Qwen3.5:4b`** in the table.
- The size column reads **`3.4GB`**, indicating the VRAM required for inference.
<figure style="text-align:center;">
<a href="https://i.imgur.com/eaRaqnq.png" target="_blank">
<img src="https://i.imgur.com/eaRaqnq.png" width="600"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
</a>
<figcaption>Model size for `Qwen3.5:4b` (≈ 3.3GB VRAM).</figcaption>
</figure>
5. **Copy the model tag**
- Click the **copytoclipboard** icon next to the tag (or highlight the text and press **Ctrl+C**).
6. **Open the OpenWebUI interface**
- In a new browser tab, navigate to the URL where your OpenWebUI instance is running (e.g., `http://localhost:8080`).
7. **Pull the model through the UI**
- In the **“Select a model”** dropdown, paste the copied tag into the text field.
- Click **`Pull`**. The UI will display a progress bar while Ollama downloads the GGUF file.
<figure style="text-align:center;">
<a href="https://i.imgur.com/Sf8sSs3.png" target="_blank">
<img src="https://i.imgur.com/Sf8sSs3.png" width="600"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
</a>
<figcaption>OpenWebUI paste the tag and press “Pull”.</figcaption>
</figure>
8. **Verify the model works**
- Once the download finishes, type a prompt in the chat window (e.g., “Tell me a short, funny story about a cat that learns to code”).
- Press **Enter** and watch the response appear.
<figure style="text-align:center;">
<a href="https://i.imgur.com/30OMNsk.png" target="_blank">
<img src="https://i.imgur.com/30OMNsk.png" width="600"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
</a>
<figcaption>Successful inference the model returns a coherent answer.</figcaption>
</figure>
9. **Download Gemma3n e2B**
- While we're downloading models, let us download one more. You can either repeat the process from the previous steps to find and download **Gemma3n e2B**, or just use the following model tag to download the model via the Open WebUI search bar:
```bash
ollama pull gemma3n:e2b
```
Google has designed gemma 3n models designed for efficient execution on resource constrained devices such as laptops, tablets, phones, or Nvidia 2080 Super GPUs.
---
## Objective 3: Inference Settings
### Explore: OUI Inference Parameter Valves
Prior to this lab, we discussed inference settings such as Top K, Top P, and Temperature. Let's quickly review the most common settings to customize:
- `Context Length` - The amount of tokens the model is allowed to keep in active memory
- `Temperature` - Changes the score of low probability token generation
- `Top K` - Limits the possible tokens selection during inference to the most likely `K` selections
- `Top P` - Limits the possible tokens to those whose cumulative probability exceeds `P`
Open WebUI allows us to easily modify these parameters on the fly through the chat controls, found on the right hand side next to your user's icon.
<figure style="text-align: center;">
<a href="https://i.imgur.com/Tp4LqGs.png" target="_blank">
<img
src="https://i.imgur.com/Tp4LqGs.png"
style="width: 600px; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Chat Controls
</figcaption>
</figure>
<br>
By default, Open WebUI selects the following generically sound options, with the expectation that users have access to modest hardware:
- `Context Length` - 2048
- `Temperature` - .8
- `Top K` - 40
- `Top P` - .9
While we won't play with `Context Length`, this parameter is critical for successfully accomplishing more complicated tasks using local models. With only the small default context length value, the model will quickly forget your instructions and interactions, rendering the results the model generates less useful. Unfortunately, just increasing this value is not always an option, as your selected model + `Context Length` must fit within your available memory. As with many challenges in AI, a key to solving issues with `Context Length` is often scaling your hardware to meet the demands of the task. This generally means utilizing hardware with larger amounts of VRAM or unified memory either by purchasing it or renting access.
Additionally, these defaults can be overruled by the Ollama model file, which can specify its own "preferred" default Hyperparameters. Below are the defaults that come with the model we've downloaded, or feel free to interactively explore the `params` page for the model at this link: [qwen3.5:4b-q4_K_M](https://ollama.com/library/qwen3.5:4b-q4_K_M/blobs/9371364b27a5).
<br>
<figure style="text-align: center;">
<a href="https://i.imgur.com/HfnH17e.png" target="_blank">
<img
src="https://i.imgur.com/HfnH17e.png"
style="width: 600px; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Modelfile Defaults
</figcaption>
</figure>
<br>
The best model makers will often override the defaults with their own preferred ones, as we've just seen. These Qwen selected defaults were the values they found to produce the best outputs for most tasks. When possible, it is likely that you'll want to stick with these defaults unless you have a very good reason to change them.
Thankfully, our lab gives us just such a reason! We can manually modify these options with the aforementioned chat controls options. Depending on our end goal, we can either help the model to write more "creatively" or "precisely" through setting `Temperature`, `Top K`, and `Top P`.
Lets test this with a series of interactions, themed around Magic the Gathering. Qwen is considered a multi-modal model, meaning we're not just limited to inputing text! Input the following image, and ask `What is this? What does it do?`
Next, set our inference parameters to the following:
- `Temperature` - 1.1
- `Top K` - 100
- `Top P` - .95
Repeat your first interaction, noting the differences in model output. Less "likely" or common words were hopefully selected!
When satisfied, lets next set our inference parameters to the following:
- `Temperature` - 2
- `Top K` - 400
- `Top P` - .95
The model this time likely has gone off the rails, answering for an extended period of time, and trailing off incoherently. This is due to us increasing the likelihood of improbable tokens far beyond the expected performance thresholds google has set for us. Lets next test the opposite:
- `Temperature` - Default
- `Top K` - 1
- `Top P` - Default
Feel free to continue to explore with other topics or images. Note how each time we restart our conversation, the model gives us the exact same answer. This is because Top K limits the model to select only the single most likely token for the provided input! Even with this restriction however, note that the model can still provide different answers based on GPU differences, random fluctuations in the GPU hardware, or other similarly improbable events. Never forget that LLMs are deterministic, and even when highly restricted, can output unexpected results.
<br>
---
## Objective 4: Prompting Techniques
### Explore: Prompt Engineering & System Prompting
<div class="lab-callout lab-callout--warning">
<strong>Warning:</strong> As you explore chat via Open WebUI, ensure you turn <code>think (Ollama)</code> to OFF. <strong>Qwen3.5 4B</strong> is likely to enter an infinite thinking loop for these tasks otherwise, which will require a VM reboot.
<br><br>
Alternatively, choose to perform these steps with **Gemma3n e2B**, which can handle tight environments more gracefully.
</div>
Next, lets review different ways we can coax a model to perform better, without having to perform fine tuning or parameter customization. We can do this by "priming" the model with our first prompt in a number of ways:
<br>
- Few Shot Prompting - Providing examples of our desired outcome up front
- Meta Prompting - Providing a guide to reach the desired outcome
- Chain of Thought - Providing the model guidance to think through its response
- Self Criticism - Asking the model to play "devil's' advocate" against itself
<br>
Each of these tools can be combined to help achieve a greater effect. Below is a suggested list of Magic the Gathering game design challenges which we can task Qwen 3.5 with, but each will require either some luck, or great prompt engineering. If you have a different topic you're more familiar with, feel free to first use Qwen 3.5 to adapt these challenges to a more familiar theme:
<br>
- Design a black rare creature card that fits thematically and mechanically into a Graveyard Matters Magic the Gathering set. Provide a few existing cards to help give the model a template.
- Design the same card, but this time outline the type, mechanics, tone, and identity
- Invent a new keyword. Have the model reason step by step how the keyword will work within the game
- Review your new keyword for game balance. Have the model challenge its decisions.
<br>
There is one final prompting tool that we have yet to deep dive, which is system prompting. While the `chat controls` menu provides the option to override the default system prompt, Open WebUI provides a powerful flow for "creating" new models with saved system prompts and inference parameters. This is also a great convenience feature, as changing Hyperparameters via Chat Controls for every chat becomes tedious. This is especially useful once we have created a system prompt that we especially prefer, or would like to set inference parameters once, and reuse them many times.
Let's create a new model by selecting the `Workspace` link, and then selecting the `+` button to create a new model:
<figure style="text-align: center;">
<a href="https://i.imgur.com/TjNyWNa.png" target="_blank">
<img
src="https://i.imgur.com/TjNyWNa.png"
style="width: 600px; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Create custom model
</figcaption>
</figure>
<br>
In the new model window, we can customize many different options for our model, even beyond the previously used chat specific controls. Create a new model named `Qwen 3.5 LLM Demo` by performing the following steps:
1. Set the name to `Qwen 3.5 LLM Demo`
2. Set the Base Model to `Qwen3.5:4b`
3. Provide a system prompt. You can set this to be any task you'd like model to focus on, or we can stick with our Magic the Gathering theme. Utilize the following prompt, or for bonus points, have Qwen 3.5 generate one for you.
```text
"You are a creative designer for Magic: The Gathering, tasked with generating new Sliver creature cards. Follow these guidelines to ensure the cards align with the game's mechanics and lore:
Card Outline Structure:
* Name: Give the Sliver a unique name that reflects its abilities or traits (e.g., 'Predatory Sliver', 'Aetherwing Sliver').
* Mana Cost: Assign a mana cost appropriate for the cards power level and complexity. Use standard Magic symbols (e.g., {1}{G}{U}).
* Type Line: Always include 'Creature — Sliver' in the type line.
* Power/Toughness: Set values that balance the cards abilities.
* Abilities: Include one or more keyword abilities, triggered abilities, or static effects. Ensure they synergize with existing Sliver mechanics.
* Flavor Text (optional): Add a short, thematic quote or description to enhance the card's lore.
Sliver Mechanics:
Slivers are a tribe of creatures that share abilities among themselves. Include the phrase 'All Slivers have...' in the ability text to reflect this tribal synergy.
Abilities should be consistent with existing Sliver themes, such as combat enhancements, adaptability, or swarm tactics.
Balance and Creativity:
Ensure the card is balanced for gameplay while introducing innovative mechanics or flavor.
Example:
Name: Swiftwing Sliver
Mana Cost: {2}{W}
Type Line: Creature — Sliver
Power/Toughness: 2/2
Abilities: Flying, All Slivers have flying.
Flavor Text: 'The skies belong to the swift and the bold.'
When provided a name, generate a new Sliver card following this structure."
```
<figure style="text-align: center;">
<a href="https://i.imgur.com/ZtLpw9y.png" target="_blank">
<img
src="https://i.imgur.com/ZtLpw9y.png"
style="width: 600px; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
System Prompt Creation
</figcaption>
</figure>
<br>
4. To ensure only the best card generation, show the `Advanced Params` and set the following to add creativity:
- `Temperature` - 1.1
- `Top K` - 100
- `Top P` - .95
- `Ollama (Think)` - Off
Note: While we haven't actively discussed them as a part of this lab, as you play with more advanced inference problems, you may also find the following parameters of interest:
- `Max Tokens` - Limit the possible length of a response to the desired number of tokens
- `num_gpu` - Manually override Ollama's built in layer offload determination. Useful for increasing performance on mixed GPU setups.
- `use_mlock` - Manually force Ollama to ensure all model components are kept within active memory. Useful for smaller systems.
<figure style="text-align: center;">
<a href="https://i.imgur.com/9RcJVjK.png" target="_blank">
<img
src="https://i.imgur.com/9RcJVjK.png"
style="width: 600px; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Custom Parameters
</figcaption>
</figure>
<br>
5. When done, hit save. We can now test creating new Sliver cards! Select our newly created model from the chat drop down, and try inventing a few names.
<br>
---
## Conclusion
Throughout this lab, we've explored the fascinating world of Open WebUI and prompt engineering. Let's summarize the key topics we've covered:
1. **Model Selection and Management**: We explored how to download and manage models like Qwen 3.5, understanding their resource requirements and capabilities. This taught us about the practical considerations of working with different model sizes.
2. **Inference Parameters**: We experimented with critical inference parameters including:
- Temperature: Controls randomness in output
- Top K: Limits token selection to top K most likely options
- Top P: Uses nucleus sampling based on cumulative probability
3. **Prompting Techniques**: We examined various prompting strategies:
- Few Shot Prompting: Providing examples of desired outputs
- Meta Prompting: Giving guidance to reach outcomes
- Chain of Thought: Encouraging step-by-step reasoning
- Self Criticism: Having the model evaluate its own responses
4. **System Prompting**: We created custom models with specific system prompts and parameter settings, learning how to tailor LLM behavior for specialized tasks.
These concepts are foundational for effectively working with large language models in real-world applications. Remember that prompt engineering is both an art and a science - it requires understanding both the capabilities of the model and the nuances of human language. As you continue your journey with LLMs, don't hesitate to experiment with different approaches and parameters to find what works best for your specific use cases.

Before

Width:  |  Height:  |  Size: 81 KiB

After

Width:  |  Height:  |  Size: 81 KiB

Before

Width:  |  Height:  |  Size: 118 KiB

After

Width:  |  Height:  |  Size: 118 KiB

Before

Width:  |  Height:  |  Size: 157 KiB

After

Width:  |  Height:  |  Size: 157 KiB

Before

Width:  |  Height:  |  Size: 89 KiB

After

Width:  |  Height:  |  Size: 89 KiB

Before

Width:  |  Height:  |  Size: 38 KiB

After

Width:  |  Height:  |  Size: 38 KiB

@@ -1,475 +0,0 @@
<!-- breakout-style: instruction-rails -->
<!-- step-style: underline -->
<!-- objective-style: divider -->
# Lab 5 - Dataset Generation and Fine Tuning
In this lab, we will:
* Explore public datasets
* Generate a dataset with Kiln.ai
* Fine-tune Gemma3 with Unsloth Studio
<div class="lab-callout lab-callout--info">
<strong>Lab Flow Guide</strong><br />
<strong>Explore</strong> sections focus on understanding dataset choices and trade-offs.<br />
<strong>Execute</strong> sections focus on building, reviewing, and preparing data for fine-tuning workflows.
</div>
To start this lab, one web service has been preconfigured:
* Unsloth - http://<IP>:8888
You'll need to install Kiln from the following URL - https://github.com/Kiln-AI/Kiln/releases/tag/v0.18.1
## Objective 1 Explore: Public Datasets
While fine tunes may not have the same level of impact as in the early days of LLMs, they can still provide hyper specialized capabilities to enable small LLMs such as those we've used throughout the course to compete with large, closed LLMs such as ChatGPT and Gemini. For use cases where data needs to be private, where the costs of a closed model are too high, or we want a model that is focused for a specific RAG dataset.
There are multiple ways to generate a useful dataset, including but not limited to:
| # | Method | Typical usecase | Key advantage |
|---|--------|-----------------|----------------|
| 1 | **Manual data collection** | Surveys, interviews, domainexpert annotation | Highest specificity; fully controlled quality |
| 2 | **Web scraping** | Harvesting public articles, forum posts, code snippets | Scalable; leverages existing web content |
| 3 | **APIs & databases** | Accessing structured resources (e.g., Wikipedia API, PubMed) | Structured data; often welldocumented |
| 4 | **Crowdsourcing** | Largescale labeling (e.g., image bounding boxes) | Costeffective for repetitive tasks |
| 5 | **Data augmentation** | Expanding a small set of images or text | Improves diversity without new collection |
| 6 | **Public datasets** | Readymade corpora from repositories like HuggingFace | Immediate availability; often preprocessed |
| 7 | **Synthetic data generation** | Simulated sensor readings, procedurally generated text | Useful when real data is scarce or sensitive |
Let's at least quickly touch on option 6, **Public Datasets**. While they may vary in quality, they're a great way to jumpstart a particular focus for a fine tune. Many are found on https://huggingface.co/datasets, and we can see there are over 400k datasets readily accessible for many different tasks, from many different providers, including [OpenAI](https://huggingface.co/datasets/openai/gsm8k), [Nvidia](https://huggingface.co/datasets/nvidia/Nemotron-CrossThink), and more. Much like with models, there are numerous tools we can utilize to filter these datasets, such as on format, modality, or license.
<figure style="text-align: center;">
<img
src="https://i.imgur.com/kdnBCyL.png"
width="600"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
Example Datasets.
</figcaption>
</figure>
#### Explore a dataset (GSM8K)
Navigate to [GSM8K](https://huggingface.co/datasets/openai/gsm8k). Much like how models have **model cards**, datasets have **dataset cards**. These perform a similar job, providing:
1. Tags
2. Example data & a *Data Studio* button for interacting with the dataset on **HuggingFace** directly.
3. Easy Download Links (although we can also use `git clone`)
4. The Description
<figure style="text-align: center;">
<img
src="https://i.imgur.com/Y55FAPV.png"
width="600"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
Dataset Model Card Contents.
</figcaption>
</figure>
At the heart of each data set is the pairing of *input* and *result*. In the case of math, this is relatively easy, as these are quite literally *question* and *answer* pairs to math problems.
Larger datasets, such as [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), utilize more complicated structures, but all still fundamentally follow this same principle. In the case of [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), the inputs are titles and summaries of web pages, with links to the precise web page as scraped from the internet.
<div class="lab-callout lab-callout--info">
<strong>Explore:</strong> Open the <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/viewer/sample-10BT/train" target="_blank" rel="noreferrer">Fineweb sample viewer</a> in a new tab and inspect a subset of this <strong>15 trillion token</strong> dataset directly on Hugging Face.
</div>
#### Openweight vs. opensource
One last note on public datasets. A common misconception is that *open weight* models are **open source**.
<br>
- *Openweight* models (e.g., Gemma, DeepSeekR1, Qwen) provide publicly released checkpoints but **do not** include permissive sourcecode licenses.
- True **opensource** LLMs remain rare; there are very few models that freely share their Dataset and Training pipeline. Examples are **INTELLECT2**, which was built via a distributed "SETI@Homestyle" effort, or Nvidia's **Nemotron 3** family of models.
<br>
Unfortunately, **INTELLECT2** does not favorably compare to existing *open weight* models such as **Gemma**, **DeepSeek R1**, **Qwen**, or other bleeding edge models. **Nemotron 3** also is behind the State of the Art (SOTA) models, but instead serves as a showcase on how anyone can train models using Nvidia hardware.
Regardless of model type though, when using any *open weight* model for corporate purposes, review the license for allowed use!
<br>
---
## Objective 2: Synthetic Dataset Generation
If you can, I strongly encourage you to try and find ready made, or easily massaged datasets that do not require synthetic data. You'll often obtain better results with less effort this way. After all, the original frontier ChatGPT family of models merely scraped the entire internet, every book, scientific papers, and other "pre made" raw data to help generate their first dataset. However, this is often unrealistic, as at minimum, we need **1000** input-output pairs in order to begin fine tuning, so...
### Why Use Synthetic Data?
| Reason | Explanation |
|--------|-------------|
| **Data scarcity** | Niche domains (e.g., MITREATT&CK classification) often lack ≥ 1000 labeled examples. |
| **Scalability** | A single large model can produce thousands of examples in minutes, saving manual effort. |
| **Quality control** | By generating with a *larger* model than the target (e.g., Gemma12Bqat → Gemma4B), you can distill richer responses within specific domains. |
| **Iterative refinement** | Kiln lets you rate or repair each pair, turning noisy outputs into a clean training set. |
<div class="lab-callout lab-callout--warning">
<strong>Rule of Thumb:</strong> Never generate data with a model that is smaller than the model you plan to fine-tune.
</div>
---
### Execute: Install & Launch KilnAI
### 1. Install & Launch KilnAI
If you haven't yet, download [Kiln AI](https://github.com/Kiln-AI/Kiln/releases/tag/v0.18.1) and run the installer for your OS.
<div class="lab-callout lab-callout--info">
<strong>Tip:</strong> These steps were designed for <strong>Kiln v0.18</strong>. While compatible with newer versions, v0.18 features a polished, simplified UI ideal for this lab. Note that Kiln undergoes active development with frequent UI changes across versions.
</div>
1. **Open Kiln**. It should automatically go to `http://localhost:3000` in your machine's browser.
2. Click **`Get Started`**.
<figure style="text-align:center;">
<img src="https://i.imgur.com/hJNehuE.png" width="400"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Welcome screen click "Get Started".</figcaption>
</figure>
3. Choose **`Continue`** (or **`Skip Tour`** if you prefer).
4. Dismiss the newsletter prompt (optional).
Kiln is now ready for configuration.
### 2. Connect Kiln to Ollama
1. In Kiln's lefthand **Providers** panel, click **`Connect`** under the Ollama entry.
<div class="lab-callout lab-callout--warning">
Use your Ollama instance IP to connect (I.E. http://<STUDENT IP>:11434). You must be connected to the VPN for this to work.
</div>
<figure style="text-align:center;">
<img src="https://i.imgur.com/vEwUszl.png" width="600"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Connect to a local or remote Ollama instance.</figcaption>
</figure>
2. Click **`Continue`** to confirm the connection.
<div class="lab-callout lab-callout--info">
<strong>Tip:</strong> If you have access to a commercial LLM (for example, OpenAI GPT-4o), you can point Kiln to that endpoint for higher-quality synthetic data by replacing the Ollama URL in <strong>Providers → Connect</strong>.
</div>
---
### 3. Create a Kiln Project
1. Kiln will prompt you to **Create a Project**. Enter any descriptive name (e.g., `MITREATTACKFineTune`).
<figure style="text-align:center;">
<img src="https://i.imgur.com/8CLEp9s.png" width="400"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Name your project.</figcaption>
</figure>
2. Press **`Create`**. You are now inside the project workspace.
---
### 4. Define the FineTuning Task
1. Click **`Add Task`** and fill out the form with the details below.
* **Task name:** `ATT&CK Classification`
* **Goal:** "Given a description of an attack technique, tactic, or procedure, return only an accurate MITRE ATT&CK ID and Name in the format: "ID# - Technique". "
* **System prompt (autofilled):** Kiln will prepend this text to every generation request.
<figure style="text-align:center;">
<img src="https://i.imgur.com/43o2s0Y.png" width="400"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Task definition screen.</figcaption>
</figure>
2. Click **`Save Task`**. The task now appears in the lefthand **Tasks** list.
---
### 5. Kiln Main Interface Overview
| Sidebar item | Primary use |
|--------------|------------|
| **Run** | Manually generate one inputoutput pair at a time (useful for quick checks). |
| **Dataset** | View, edit, export, or import the entire collection of pairs. |
| **SyntheticData** | Bulkgenerate pairs using a model of your choice. |
| **Evals** | Run automatic evaluation against a heldout test set. |
| **Settings** | Projectlevel configuration (e.g., default model, output format). |
When you first open a project, Kiln lands on the **Run** page.
---
## 6 Manual Generation (Run Page)
1. In the **Run** view, set the parameters as shown below (you may substitute a larger model if your hardware permits).
<figure style="text-align:center;">
<img src="https://i.imgur.com/vvW0wjk.png" width="600"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Configure the Run settings.</figcaption>
</figure>
2. Type a **scenario description** (e.g., "An attacker dumps LSASS memory using Mimikatz") and click **`Run`**.
3. Kiln sends the prompt to the selected Ollama model (by default `gemma3:12bitqat`).
4. When the model returns an answer, you can **rate** it from 1 ★ to 5 ★.
*5 ★* → Accept and click **`Next`**.
*< 5 ★* → Click **`Attempt Repair`**, edit the response, then **`Accept Repair`** or **`Reject`**.
<figure style="text-align:center;">
<img src="https://i.imgur.com/wqVsYMk.png" width="600"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Rate a correct response with 5 ★.</figcaption>
</figure>
5. Repeat until you have a handful of highquality pairs. This manual step is optional but useful for seeding the dataset with "goldstandard" examples.
---
### 7 Bulk Synthetic Data Generation
#### 7.1 Open the Generator
1. In the sidebar, click **`Synthetic Data``Generate Fine-Tuning Data`**.
<figure style="text-align:center;">
<img src="https://i.imgur.com/l6OiUeP.png" width="600"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Enter the bulkgeneration workflow.</figcaption>
</figure>
#### 7.2 Generate TopLevel Topics
1. Click **`Add Topics`**. This will generate top level topics that follow broad MITRE ATT&CK categories.
2. Choose **`Gemma-3n-2B`**.
3. Set **Number of topics** to **8** and click **`Generate`**.
<figure style="text-align:center;">
<img src="https://i.imgur.com/SHh8v0y.png" width="400"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Select model & number of topics.</figcaption>
</figure>
4. Review the generated list. Delete any unsatisfactory topics (hover → click the trash icon) or click **`Add Topics`** again to generate more. Alternatively, if additoinal depth is required, click **`Add Subtopics`** to drill down deeper into any of the high level topics created by Gemma initially.
<figure style="text-align:center;">
<img src="https://i.imgur.com/wHNv3Om.png" width="800"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Final set of 8 topics.</figcaption>
</figure>
#### 7.3 Create Input Scenarios for All Topics
1. With the topics selected, click **`Generate Model Inputs`**. Ensure **`Gemma-3n-2B`** is still chosen, and then affirm your selection.
Kiln now asks the model to produce a short *scenario description* for each topic.
2. After the model finishes, review the generated inputs. You may edit any that look off.
#### 7.4 Generate Corresponding Outputs
1. Click **`Save All Model Outputs`**. Kiln now runs the model a second time—this time using each generated input as the prompt—to produce the *output* (the ATT&CK technique label).
<figure style="text-align:center;">
<img src="https://i.imgur.com/A47GRVr.png" width="800"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Produce the "output" side and store the pair.</figcaption>
</figure>
2. The full inputoutput pairs are automatically added to the project's dataset.
#### 7.5 Review the Completed Dataset
1. Switch to the **`Dataset`** tab.
2. You should see a table of 64 (8topics × 8samples) pairs. Clicking any row opens the same **Run** view, where you can **rate**, **repair**, or **delete** the pair.
<figure style="text-align:center;">
<img src="https://i.imgur.com/DnyXYJO.png" width="800"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Dataset overview with generated pairs.</figcaption>
</figure>
---
### 8. Dataset Export (Create a Fine-Tune)
1. Once you are satisfied with the dataset, you can export it to numerous forms of JSONL via the **Fine Tune → Create a Fine Tune** button.
2. Kiln will first ask what format it would like our data to be exported to. We can leave the default setting of *Download: OpenAI chat format (JSONL). Next, select *Create a New Fine-Tuning Dataset.*
3. Kiln supports splitting our generated data into a number of buckets, including *`Training`* *`Test`* and *`Validation`*. Each of these dataset segments is critical to a great fine tune, but at our generated 64 examples, we don't have the luxury of creating a split. As such, under **`Advanced Options`**, select *100% training*, and click *Create Dataset*.
<figure style="text-align:center;">
<img src="https://i.imgur.com/vp6jobS.png" width="400"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Dataset overview with generated pairs.</figcaption>
</figure>
4. We can ignore all further options, and select *Download Split*. A new .jsonl file will be saved!
---
## Objective 3: Fine Tuning with Unsloth Studio
There are many popular options for performing fine tunes, although many have their drawbacks:
* [Unsloth](https://unsloth.ai) is the most popular solution, but currently does not support multi-gpu setups without a commercial license.
* [Axoltl](https://axolotl.ai) is built off of Unsloth, and does support multi-gpu setups, but often lags behind Unsloth in features and capability, and does not feature any Web UI.
* [LLaMaFactory](https://github.com/hiyouga/LLaMA-Factory) is the most flexible of these options, supporting both Unsloth & Axlotle, as well as additional backends. However, this tool is daunting for a beginner to approach fine tuning, and is best left for later experimentation.
<br>
While I encourage you to explore all of these tools, they are unfortunately out of the scope for this lab. Instead, we're going to focus on **Unsloth**, as it provides the best web UI to easily navigate the fine tuning process.
### Explore: Touring Unsloth Studio
Although Unsloth Studio does its best to simplify the fine tuning process, there are still many dials and knobs to turn! Lets take a brief tour of the most important options:
1. Model Selection - This area allows us to select any model that we're interested in fine tuning. Unsloth Studio will handle downloading the FP16 version of the model from **HuggingFace** for us.
2. Quantization Selection - Without much better hardware, we will usually be training **LoRA**s (Low-Rank Adapters). These will slightly nudge the parameters of the model in the direction we're interested in. If we need additional headroom, we can instead **quantize the base model** (e.g., reduce its precision from 16-bit to 4-bit) and then apply **LoRA** to the quantized model, generating a **QLoRA** (Quantized LoRA). This approach combines the efficiency of quantization with the parameter-efficiency of LoRA. Unsloth will conveniently tell us its estimate for how well a given combination of *Model* & **QLoRA** will fit in our system's available VRAM.
<figure style="text-align: center;">
<img
src="https://i.imgur.com/XwAdaKJ.png"
width="800"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
Model & LoRA Type Selections. Note how models are labeled "OOM" or "Tight" based on hardware.
</figcaption>
</figure>
3. Dataset Selection - This is where we can utilize our custom made dataset. Unfortunately, while we've gone through the process of making a dataset, we had to use a very small model to simulate the process. Conveniently, Unsloth allows us to search for any dataset available publicly on HuggingFace. We can select conveniently select the sarahwei/cyber_MITRE_CTI_dataset_v15 for our purposes. You can select "View Dataset" if you'd like to see some of the raw contents of this data.
<figure style="text-align: center;">
<img
src="https://i.imgur.com/8xBdcnd.png"
width="400"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
Dataset Selection
</figcaption>
</figure>
4. Train Settings - This is where we can configure exactly how our model will be trained. The majority of these settings can stay default, until you've a specific need that pushes you down the rabbit hole. In particular, we'll be interested in
* **Learning Rate** - Controls how large an adjustment to the model's weights are made during each step
* **Epoch** - Determines the number of times the training algorithm will iterate over the entire dataset (aka repeats training 3 times by default). Critical to help avoid under or over fitting.
* **Cutoff length** - Equivalent to Ollama's context. As always, larger context training requires more memory.
* **Batch Size** - Can speed up training, as long as we have the hardware to support.
* **Warmup Steps** - The number of initial training steps during which the learning rate gradually increases to the set target. Helps with stability.
<figure style="text-align: center;">
<img
src="https://i.imgur.com/fzSvggY.png"
width="400"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
Fine Tuning Settings
</figcaption>
</figure>
### Execute: Unsloth Studio Fine Tuning
Set the following before we start to fine tune Gemma:
1. **Model**: `unsloth/gemma-3-270m-it`
2. **Max Steps**: `100` (NOTE: For real fine tuning, use Epochs, not Steps.)
3. **Learning Rate**: `0.00005`
4. **Dataset**: `sarahwei/cyber_MITRE_CTI_dataset_v15`
5. **Warmup Steps**: `100`
* Scroll to the bottom of the page, and click `Preview command`. The WebUI is merely a front end for constructuing `llamafactory-cli` commands, and this shows exactly what will be run.
* When done reviewing, next click `Start`. It will take some time for Unsloth Studio to start its process, as it will first need to download the full `FP16` raw `Gemma-3-4B` model files.
<figure style="text-align: center;">
<img
src="https://i.imgur.com/fzSvggY.png"
width="400"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
Setting Max Steps, Learning Rate, and Warmup Steps
</figcaption>
</figure>
**Monitor the loss graph** | The graph is measuring **Loss** per **Training step** (roughly 8k steps, 2.5k examples * 3 epochs), or put simply, how different the model's predicted answer is from our data. This should gradually, logarithmically slope downwards if training is stable.
#### What to Look For
- **Training Loss:** Decreasing smoothly → model is learning effectively and training is stable
- **Gradient Norm:** Drops then stabilizes → gradients are well-behaved (no major spikes)
- **Learning Rate:** Gradually increasing, then eventually decreasing → expected warmup behavior helping stable early training
<figure style="text-align: center;">
<img
src="https://i.imgur.com/Cue7afQ.png"
width="600"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
Typical Training Run
</figcaption>
</figure>
Unfortunately, due to the time constraints of a live classroom, we'll be unable to pursue this training run to completion. On the lab provided GPUs, a full Epoch could take up to two hours! Feel free to cancel it at your leisure.
We can however chat with a version of Gemma 3 4B that was trained before this class. It was trained against roughly 60,000 examples, partially generated using kiln, partially harvested from various datasets throughout Huggingface. While not perfect, we can see that the model is signifigantly better than the default.
<figure style="text-align: center;">
<img
src="https://i.imgur.com/FKZXaV3.png"
width="600"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
Load Model for Chat
</figcaption>
</figure>
To test this ourselves, select:
1. The chat button at the very top of the screen
2. Download our model. Its under my personal HuggingFace Account name, c4ch3c4d3
3. Set the system prompt to the one we selected when using **Kiln.ai** - "Given a description of an attack technique, tactic, or procedure, return only an accurate MITRE ATT&CK ID and Name in the format: "ID# - Technique".
<figure style="text-align: center;">
<img
src="https://i.imgur.com/GHExjE3.png"
width="600"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
Test prompt
</figcaption>
</figure>
| Test Prompt | Expected Output Format |
|------------|------------------------|
| "A malicious actor uses PowerShell to download a file from a remote server." | `T1059.001 PowerShell` |
| "The adversary exfiltrates data via a compressed archive sent over HTTP." | `T1567.001 Exfiltration Over Web Services` |
| "Credential dumping is performed using Mimikatz." | `T1003.001 LSASS Memory` |
The Unsloth chat view is relatively simplistic, but does provide options for changing inference perameters, such as Top-P or Temperature, as well as a location for us to input our system prompt. If we're looking to test the model's accuracy with our fine tune, we normally need to ensure these values match the desired endstate values as closely as possible.
### Export the FineTuned Model
<div class="lab-callout lab-callout--warning">
<strong>Skippable:</strong> These steps are provided for reference as we never successfully finished a fine tune within the lab time period.
</div>
1. Switch to the **Export** tab.
2. Select the training run of the model you've performed.
3. Select the latest checkpoint, or if you'd like to explore an alternative, the checkpoint desired.
4. We can export in a number of formats:
- **Merged Model** A BF16 .safetensors format of the model which can be utilized in other projects
- **LORA** Only export the LORA adapter layers generated during training. Useful if we wish to share only our new files with other users who already have the model downloaded, but not our fine tune.
- **GGUF** A compact file ready for import into **Ollama** or other GGUFcompatible runtimes.
<br>
---
## Conclusion
In this lab, we completed a LoRA fine-tuning workflow:
1. **Dataset Generation** - We explored public datasets on HuggingFace and used Kiln AI to generate a synthetic dataset for MITRE ATT&CK classification.
2. **Fine Tuning** - We used Unsloth Studio to fine-tune Gemma-3-4B on our generated dataset.
3. **Validation & Export** - We tested the model with sample prompts and exported the fine-tuned model in both FP16 and GGUF formats.
If all has gone well, then the model should be much more accurate at identifying MITRE ATT&CK codes from user input scenarios. If not, additional experimentation may be necessary to produce a good fine tune. Playing with the parameters we've discussed, improving and expanding our dataset, or even fine tuning a larger or better base model can also help affect our success rate.
@@ -1,13 +1,20 @@
---
order: 5
title: Lab 5 - Embedding and Chunking
description: Explore chunking strategies and embeddings, then connect them to retrieval workflows.
---
<!-- breakout-style: instruction-rails -->
<!-- step-style: underline -->
<!-- objective-style: divider -->
# Lab 4 - Embedding and Chunking
# Lab 5 - Embedding and Chunking
In this lab, we will:
* Explore various chunking strategies
* Explore how embeddings and vectors allow similar concepts to cluster together within n-dimensional spaces
* Connect chunking and embedding concepts to a functional RAG workflow
- Explore various chunking strategies
- Explore how embeddings and vectors allow similar concepts to cluster together within n-dimensional spaces
- Connect chunking and embedding concepts to a functional RAG workflow
<div class="lab-callout lab-callout--info">
<strong>Lab Flow Guide</strong><br />
@@ -17,8 +24,8 @@ In this lab, we will:
To start this lab, two web services have been preconfigured:
* ChunkViz - http://<IP>:3000
* Embedding Atlas - http://<IP>:5055
- ChunkViz - http://<IP>:3000
- Embedding Atlas - http://<IP>:5055
## Objective 1 Explore: Chunking Strategy
@@ -42,8 +49,8 @@ In a web browser, navigate to http://<STUDENT ASSIGNED SYSTEM IP>:3000. Once loa
ChunkViz starts with example text that has already been split using a default character-based strategy. In this view, every 200 characters is treated as a chunk. Modify the sliders to set the following values:
* `Chunk Size` - `256`
* `Chunk Overlap` - `20`
- `Chunk Size` - `256`
- `Chunk Overlap` - `20`
<figure style="text-align: center;">
<a href="https://i.imgur.com/9SDyh7I.png" target="_blank">
@@ -61,11 +68,11 @@ Notice how the colors in the text below dynamically change. Each color represent
Next, explore the major chunking strategies available in ChunkViz:
| Strategy | Description |
|---|---|
| Character Splitter | Splits text into chunks based on a fixed number of characters. |
| Token Splitter | Splits chunks based on tokenization values using **tiktoken**. |
| Sentence Splitter | Splits chunks into rough sizes based on what the tool interprets as a sentence. |
| Strategy | Description |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| Character Splitter | Splits text into chunks based on a fixed number of characters. |
| Token Splitter | Splits chunks based on tokenization values using **tiktoken**. |
| Sentence Splitter | Splits chunks into rough sizes based on what the tool interprets as a sentence. |
| Recursive Character | Splits chunks using multiple separators, such as new lines (`\n`), periods (`.`), commas (`,`), or other language-aware section boundaries. |
Select each option and observe the different ways ChunkViz breaks text into chunks.
@@ -84,7 +91,7 @@ Select each option and observe the different ways ChunkViz breaks text into chun
Each strategy comes with its own benefits and drawbacks. Character-based splitting is often one of the easiest strategies to implement because OCR and text extraction ultimately produce characters. Token-based splitting is useful when keeping chunk sizes consistent for a specific model matters most. Sentence and recursive strategies are often better at preserving complete thoughts, although real-world documents do not always follow clean sentence boundaries.
Explore one more chunking example using a larger document. Open your provided copy of *Blindsight* by Peter Watts in `.txt` format, paste its contents into ChunkViz, and then continue experimenting with chunk sizes from `64` up to `1024` using different strategies. Notice how different chunk sizes and separators change the resulting structure.
Explore one more chunking example using a larger document. Open your provided copy of _Blindsight_ by Peter Watts in `.txt` format, paste its contents into ChunkViz, and then continue experimenting with chunk sizes from `64` up to `1024` using different strategies. Notice how different chunk sizes and separators change the resulting structure.
<figure style="text-align: center;">
<a href="https://i.imgur.com/M51ASNK.png" target="_blank">
@@ -205,10 +212,10 @@ At this point, you have seen the two major stages that make retrieval-augmented
Use what you observed in ChunkViz and Embedding Atlas to reason through the following questions:
* How would a chunk that is too small affect retrieval quality?
* How would a chunk that is too large dilute the meaning of an embedding?
* Why might a semantically similar result appear visually distant on a 2D projection?
* How do chunking strategy and embedding quality work together to improve downstream answers?
- How would a chunk that is too small affect retrieval quality?
- How would a chunk that is too large dilute the meaning of an embedding?
- Why might a semantically similar result appear visually distant on a 2D projection?
- How do chunking strategy and embedding quality work together to improve downstream answers?
This objective is meant to connect the lab tools back to the full RAG workflow. The better your chunking choices and embeddings are, the more useful the retrieved context will be for the model that answers the user.

Before

Width:  |  Height:  |  Size: 278 KiB

After

Width:  |  Height:  |  Size: 278 KiB

Before

Width:  |  Height:  |  Size: 323 KiB

After

Width:  |  Height:  |  Size: 323 KiB

Before

Width:  |  Height:  |  Size: 333 KiB

After

Width:  |  Height:  |  Size: 333 KiB

Before

Width:  |  Height:  |  Size: 792 KiB

After

Width:  |  Height:  |  Size: 792 KiB

Before

Width:  |  Height:  |  Size: 632 KiB

After

Width:  |  Height:  |  Size: 632 KiB

Before

Width:  |  Height:  |  Size: 26 KiB

After

Width:  |  Height:  |  Size: 26 KiB

@@ -0,0 +1,482 @@
---
order: 6
title: Lab 6 - Dataset Generation and Fine Tuning
description: Review dataset options, generate examples with Kiln.ai, and fine-tune a model in Unsloth.
---
<!-- breakout-style: instruction-rails -->
<!-- step-style: underline -->
<!-- objective-style: divider -->
# Lab 6 - Dataset Generation and Fine Tuning
In this lab, we will:
- Explore public datasets
- Generate a dataset with Kiln.ai
- Fine-tune Gemma3 with Unsloth Studio
<div class="lab-callout lab-callout--info">
<strong>Lab Flow Guide</strong><br />
<strong>Explore</strong> sections focus on understanding dataset choices and trade-offs.<br />
<strong>Execute</strong> sections focus on building, reviewing, and preparing data for fine-tuning workflows.
</div>
To start this lab, one web service has been preconfigured:
- Unsloth - http://<IP>:8888
You'll need to install Kiln from the following URL - https://github.com/Kiln-AI/Kiln/releases/tag/v0.18.1
## Objective 1 Explore: Public Datasets
While fine tunes may not have the same level of impact as in the early days of LLMs, they can still provide hyper specialized capabilities to enable small LLMs such as those we've used throughout the course to compete with large, closed LLMs such as ChatGPT and Gemini. For use cases where data needs to be private, where the costs of a closed model are too high, or we want a model that is focused for a specific RAG dataset.
There are multiple ways to generate a useful dataset, including but not limited to:
| # | Method | Typical usecase | Key advantage |
| --- | ----------------------------- | ------------------------------------------------------------ | --------------------------------------------- |
| 1 | **Manual data collection** | Surveys, interviews, domainexpert annotation | Highest specificity; fully controlled quality |
| 2 | **Web scraping** | Harvesting public articles, forum posts, code snippets | Scalable; leverages existing web content |
| 3 | **APIs & databases** | Accessing structured resources (e.g., Wikipedia API, PubMed) | Structured data; often welldocumented |
| 4 | **Crowdsourcing** | Largescale labeling (e.g., image bounding boxes) | Costeffective for repetitive tasks |
| 5 | **Data augmentation** | Expanding a small set of images or text | Improves diversity without new collection |
| 6 | **Public datasets** | Readymade corpora from repositories like HuggingFace | Immediate availability; often preprocessed |
| 7 | **Synthetic data generation** | Simulated sensor readings, procedurally generated text | Useful when real data is scarce or sensitive |
Let's at least quickly touch on option 6, **Public Datasets**. While they may vary in quality, they're a great way to jumpstart a particular focus for a fine tune. Many are found on https://huggingface.co/datasets, and we can see there are over 400k datasets readily accessible for many different tasks, from many different providers, including [OpenAI](https://huggingface.co/datasets/openai/gsm8k), [Nvidia](https://huggingface.co/datasets/nvidia/Nemotron-CrossThink), and more. Much like with models, there are numerous tools we can utilize to filter these datasets, such as on format, modality, or license.
<figure style="text-align: center;">
<img
src="https://i.imgur.com/kdnBCyL.png"
width="600"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
Example Datasets.
</figcaption>
</figure>
#### Explore a dataset (GSM8K)
Navigate to [GSM8K](https://huggingface.co/datasets/openai/gsm8k). Much like how models have **model cards**, datasets have **dataset cards**. These perform a similar job, providing:
1. Tags
2. Example data & a _Data Studio_ button for interacting with the dataset on **HuggingFace** directly.
3. Easy Download Links (although we can also use `git clone`)
4. The Description
<figure style="text-align: center;">
<img
src="https://i.imgur.com/Y55FAPV.png"
width="600"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
Dataset Model Card Contents.
</figcaption>
</figure>
At the heart of each data set is the pairing of _input_ and _result_. In the case of math, this is relatively easy, as these are quite literally _question_ and _answer_ pairs to math problems.
Larger datasets, such as [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), utilize more complicated structures, but all still fundamentally follow this same principle. In the case of [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), the inputs are titles and summaries of web pages, with links to the precise web page as scraped from the internet.
<div class="lab-callout lab-callout--info">
<strong>Explore:</strong> Open the <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/viewer/sample-10BT/train" target="_blank" rel="noreferrer">Fineweb sample viewer</a> in a new tab and inspect a subset of this <strong>15 trillion token</strong> dataset directly on Hugging Face.
</div>
#### Openweight vs. opensource
One last note on public datasets. A common misconception is that _open weight_ models are **open source**.
<br>
- _Openweight_ models (e.g., Gemma, DeepSeekR1, Qwen) provide publicly released checkpoints but **do not** include permissive sourcecode licenses.
- True **opensource** LLMs remain rare; there are very few models that freely share their Dataset and Training pipeline. Examples are **INTELLECT2**, which was built via a distributed "SETI@Homestyle" effort, or Nvidia's **Nemotron 3** family of models.
<br>
Unfortunately, **INTELLECT2** does not favorably compare to existing _open weight_ models such as **Gemma**, **DeepSeek R1**, **Qwen**, or other bleeding edge models. **Nemotron 3** also is behind the State of the Art (SOTA) models, but instead serves as a showcase on how anyone can train models using Nvidia hardware.
Regardless of model type though, when using any _open weight_ model for corporate purposes, review the license for allowed use!
<br>
---
## Objective 2: Synthetic Dataset Generation
If you can, I strongly encourage you to try and find ready made, or easily massaged datasets that do not require synthetic data. You'll often obtain better results with less effort this way. After all, the original frontier ChatGPT family of models merely scraped the entire internet, every book, scientific papers, and other "pre made" raw data to help generate their first dataset. However, this is often unrealistic, as at minimum, we need **1000** input-output pairs in order to begin fine tuning, so...
### Why Use Synthetic Data?
| Reason | Explanation |
| ------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| **Data scarcity** | Niche domains (e.g., MITREATT&CK classification) often lack ≥ 1000 labeled examples. |
| **Scalability** | A single large model can produce thousands of examples in minutes, saving manual effort. |
| **Quality control** | By generating with a _larger_ model than the target (e.g., Gemma12Bqat → Gemma4B), you can distill richer responses within specific domains. |
| **Iterative refinement** | Kiln lets you rate or repair each pair, turning noisy outputs into a clean training set. |
<div class="lab-callout lab-callout--warning">
<strong>Rule of Thumb:</strong> Never generate data with a model that is smaller than the model you plan to fine-tune.
</div>
---
### Execute: Install & Launch KilnAI
### 1. Install & Launch KilnAI
If you haven't yet, download [Kiln AI](https://github.com/Kiln-AI/Kiln/releases/tag/v0.18.1) and run the installer for your OS.
<div class="lab-callout lab-callout--info">
<strong>Tip:</strong> These steps were designed for <strong>Kiln v0.18</strong>. While compatible with newer versions, v0.18 features a polished, simplified UI ideal for this lab. Note that Kiln undergoes active development with frequent UI changes across versions.
</div>
1. **Open Kiln**. It should automatically go to `http://localhost:3000` in your machine's browser.
2. Click **`Get Started`**.
<figure style="text-align:center;">
<img src="https://i.imgur.com/hJNehuE.png" width="400"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Welcome screen click "Get Started".</figcaption>
</figure>
3. Choose **`Continue`** (or **`Skip Tour`** if you prefer).
4. Dismiss the newsletter prompt (optional).
Kiln is now ready for configuration.
### 2. Connect Kiln to Ollama
1. In Kiln's lefthand **Providers** panel, click **`Connect`** under the Ollama entry.
<div class="lab-callout lab-callout--warning">
Use your Ollama instance IP to connect (I.E. http://<STUDENT IP>:11434). You must be connected to the VPN for this to work.
</div>
<figure style="text-align:center;">
<img src="https://i.imgur.com/vEwUszl.png" width="600"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Connect to a local or remote Ollama instance.</figcaption>
</figure>
2. Click **`Continue`** to confirm the connection.
<div class="lab-callout lab-callout--info">
<strong>Tip:</strong> If you have access to a commercial LLM (for example, OpenAI GPT-4o), you can point Kiln to that endpoint for higher-quality synthetic data by replacing the Ollama URL in <strong>Providers → Connect</strong>.
</div>
---
### 3. Create a Kiln Project
1. Kiln will prompt you to **Create a Project**. Enter any descriptive name (e.g., `MITREATTACKFineTune`).
<figure style="text-align:center;">
<img src="https://i.imgur.com/8CLEp9s.png" width="400"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Name your project.</figcaption>
</figure>
2. Press **`Create`**. You are now inside the project workspace.
---
### 4. Define the FineTuning Task
1. Click **`Add Task`** and fill out the form with the details below.
- **Task name:** `ATT&CK Classification`
- **Goal:** "Given a description of an attack technique, tactic, or procedure, return only an accurate MITRE ATT&CK ID and Name in the format: "ID# - Technique". "
- **System prompt (autofilled):** Kiln will prepend this text to every generation request.
<figure style="text-align:center;">
<img src="https://i.imgur.com/43o2s0Y.png" width="400"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Task definition screen.</figcaption>
</figure>
2. Click **`Save Task`**. The task now appears in the lefthand **Tasks** list.
---
### 5. Kiln Main Interface Overview
| Sidebar item | Primary use |
| ------------------ | ---------------------------------------------------------------------------- |
| **Run** | Manually generate one inputoutput pair at a time (useful for quick checks). |
| **Dataset** | View, edit, export, or import the entire collection of pairs. |
| **SyntheticData** | Bulkgenerate pairs using a model of your choice. |
| **Evals** | Run automatic evaluation against a heldout test set. |
| **Settings** | Projectlevel configuration (e.g., default model, output format). |
When you first open a project, Kiln lands on the **Run** page.
---
## 6 Manual Generation (Run Page)
1. In the **Run** view, set the parameters as shown below (you may substitute a larger model if your hardware permits).
<figure style="text-align:center;">
<img src="https://i.imgur.com/vvW0wjk.png" width="600"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Configure the Run settings.</figcaption>
</figure>
2. Type a **scenario description** (e.g., "An attacker dumps LSASS memory using Mimikatz") and click **`Run`**.
3. Kiln sends the prompt to the selected Ollama model (by default `gemma3:12bitqat`).
4. When the model returns an answer, you can **rate** it from 1 ★ to 5 ★.
_5 ★_ → Accept and click **`Next`**.
_< 5 ★_ → Click **`Attempt Repair`**, edit the response, then **`Accept Repair`** or **`Reject`**.
<figure style="text-align:center;">
<img src="https://i.imgur.com/wqVsYMk.png" width="600"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Rate a correct response with 5 ★.</figcaption>
</figure>
5. Repeat until you have a handful of highquality pairs. This manual step is optional but useful for seeding the dataset with "goldstandard" examples.
---
### 7 Bulk Synthetic Data Generation
#### 7.1 Open the Generator
1. In the sidebar, click **`Synthetic Data``Generate Fine-Tuning Data`**.
<figure style="text-align:center;">
<img src="https://i.imgur.com/l6OiUeP.png" width="600"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Enter the bulkgeneration workflow.</figcaption>
</figure>
#### 7.2 Generate TopLevel Topics
1. Click **`Add Topics`**. This will generate top level topics that follow broad MITRE ATT&CK categories.
2. Choose **`Gemma-3n-2B`**.
3. Set **Number of topics** to **8** and click **`Generate`**.
<figure style="text-align:center;">
<img src="https://i.imgur.com/SHh8v0y.png" width="400"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Select model & number of topics.</figcaption>
</figure>
4. Review the generated list. Delete any unsatisfactory topics (hover → click the trash icon) or click **`Add Topics`** again to generate more. Alternatively, if additoinal depth is required, click **`Add Subtopics`** to drill down deeper into any of the high level topics created by Gemma initially.
<figure style="text-align:center;">
<img src="https://i.imgur.com/wHNv3Om.png" width="800"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Final set of 8 topics.</figcaption>
</figure>
#### 7.3 Create Input Scenarios for All Topics
1. With the topics selected, click **`Generate Model Inputs`**. Ensure **`Gemma-3n-2B`** is still chosen, and then affirm your selection.
Kiln now asks the model to produce a short _scenario description_ for each topic.
2. After the model finishes, review the generated inputs. You may edit any that look off.
#### 7.4 Generate Corresponding Outputs
1. Click **`Save All Model Outputs`**. Kiln now runs the model a second time—this time using each generated input as the prompt—to produce the _output_ (the ATT&CK technique label).
<figure style="text-align:center;">
<img src="https://i.imgur.com/A47GRVr.png" width="800"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Produce the "output" side and store the pair.</figcaption>
</figure>
2. The full inputoutput pairs are automatically added to the project's dataset.
#### 7.5 Review the Completed Dataset
1. Switch to the **`Dataset`** tab.
2. You should see a table of 64 (8topics × 8samples) pairs. Clicking any row opens the same **Run** view, where you can **rate**, **repair**, or **delete** the pair.
<figure style="text-align:center;">
<img src="https://i.imgur.com/DnyXYJO.png" width="800"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Dataset overview with generated pairs.</figcaption>
</figure>
---
### 8. Dataset Export (Create a Fine-Tune)
1. Once you are satisfied with the dataset, you can export it to numerous forms of JSONL via the **Fine Tune → Create a Fine Tune** button.
2. Kiln will first ask what format it would like our data to be exported to. We can leave the default setting of *Download: OpenAI chat format (JSONL). Next, select *Create a New Fine-Tuning Dataset.\*
3. Kiln supports splitting our generated data into a number of buckets, including _`Training`_ _`Test`_ and _`Validation`_. Each of these dataset segments is critical to a great fine tune, but at our generated 64 examples, we don't have the luxury of creating a split. As such, under **`Advanced Options`**, select _100% training_, and click _Create Dataset_.
<figure style="text-align:center;">
<img src="https://i.imgur.com/vp6jobS.png" width="400"
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
<figcaption>Dataset overview with generated pairs.</figcaption>
</figure>
4. We can ignore all further options, and select _Download Split_. A new .jsonl file will be saved!
---
## Objective 3: Fine Tuning with Unsloth Studio
There are many popular options for performing fine tunes, although many have their drawbacks:
- [Unsloth](https://unsloth.ai) is the most popular solution, but currently does not support multi-gpu setups without a commercial license.
- [Axoltl](https://axolotl.ai) is built off of Unsloth, and does support multi-gpu setups, but often lags behind Unsloth in features and capability, and does not feature any Web UI.
- [LLaMaFactory](https://github.com/hiyouga/LLaMA-Factory) is the most flexible of these options, supporting both Unsloth & Axlotle, as well as additional backends. However, this tool is daunting for a beginner to approach fine tuning, and is best left for later experimentation.
<br>
While I encourage you to explore all of these tools, they are unfortunately out of the scope for this lab. Instead, we're going to focus on **Unsloth**, as it provides the best web UI to easily navigate the fine tuning process.
### Explore: Touring Unsloth Studio
Although Unsloth Studio does its best to simplify the fine tuning process, there are still many dials and knobs to turn! Lets take a brief tour of the most important options:
1. Model Selection - This area allows us to select any model that we're interested in fine tuning. Unsloth Studio will handle downloading the FP16 version of the model from **HuggingFace** for us.
2. Quantization Selection - Without much better hardware, we will usually be training **LoRA**s (Low-Rank Adapters). These will slightly nudge the parameters of the model in the direction we're interested in. If we need additional headroom, we can instead **quantize the base model** (e.g., reduce its precision from 16-bit to 4-bit) and then apply **LoRA** to the quantized model, generating a **QLoRA** (Quantized LoRA). This approach combines the efficiency of quantization with the parameter-efficiency of LoRA. Unsloth will conveniently tell us its estimate for how well a given combination of _Model_ & **QLoRA** will fit in our system's available VRAM.
<figure style="text-align: center;">
<img
src="https://i.imgur.com/XwAdaKJ.png"
width="800"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
Model & LoRA Type Selections. Note how models are labeled "OOM" or "Tight" based on hardware.
</figcaption>
</figure>
3. Dataset Selection - This is where we can utilize our custom made dataset. Unfortunately, while we've gone through the process of making a dataset, we had to use a very small model to simulate the process. Conveniently, Unsloth allows us to search for any dataset available publicly on HuggingFace. We can select conveniently select the sarahwei/cyber_MITRE_CTI_dataset_v15 for our purposes. You can select "View Dataset" if you'd like to see some of the raw contents of this data.
<figure style="text-align: center;">
<img
src="https://i.imgur.com/8xBdcnd.png"
width="400"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
Dataset Selection
</figcaption>
</figure>
4. Train Settings - This is where we can configure exactly how our model will be trained. The majority of these settings can stay default, until you've a specific need that pushes you down the rabbit hole. In particular, we'll be interested in
- **Learning Rate** - Controls how large an adjustment to the model's weights are made during each step
- **Epoch** - Determines the number of times the training algorithm will iterate over the entire dataset (aka repeats training 3 times by default). Critical to help avoid under or over fitting.
- **Cutoff length** - Equivalent to Ollama's context. As always, larger context training requires more memory.
- **Batch Size** - Can speed up training, as long as we have the hardware to support.
- **Warmup Steps** - The number of initial training steps during which the learning rate gradually increases to the set target. Helps with stability.
<figure style="text-align: center;">
<img
src="https://i.imgur.com/fzSvggY.png"
width="400"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
Fine Tuning Settings
</figcaption>
</figure>
### Execute: Unsloth Studio Fine Tuning
Set the following before we start to fine tune Gemma:
1. **Model**: `unsloth/gemma-3-270m-it`
2. **Max Steps**: `100` (NOTE: For real fine tuning, use Epochs, not Steps.)
3. **Learning Rate**: `0.00005`
4. **Dataset**: `sarahwei/cyber_MITRE_CTI_dataset_v15`
5. **Warmup Steps**: `100`
- Scroll to the bottom of the page, and click `Preview command`. The WebUI is merely a front end for constructuing `llamafactory-cli` commands, and this shows exactly what will be run.
- When done reviewing, next click `Start`. It will take some time for Unsloth Studio to start its process, as it will first need to download the full `FP16` raw `Gemma-3-4B` model files.
<figure style="text-align: center;">
<img
src="https://i.imgur.com/fzSvggY.png"
width="400"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
Setting Max Steps, Learning Rate, and Warmup Steps
</figcaption>
</figure>
**Monitor the loss graph** | The graph is measuring **Loss** per **Training step** (roughly 8k steps, 2.5k examples \* 3 epochs), or put simply, how different the model's predicted answer is from our data. This should gradually, logarithmically slope downwards if training is stable.
#### What to Look For
- **Training Loss:** Decreasing smoothly → model is learning effectively and training is stable
- **Gradient Norm:** Drops then stabilizes → gradients are well-behaved (no major spikes)
- **Learning Rate:** Gradually increasing, then eventually decreasing → expected warmup behavior helping stable early training
<figure style="text-align: center;">
<img
src="https://i.imgur.com/Cue7afQ.png"
width="600"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
Typical Training Run
</figcaption>
</figure>
Unfortunately, due to the time constraints of a live classroom, we'll be unable to pursue this training run to completion. On the lab provided GPUs, a full Epoch could take up to two hours! Feel free to cancel it at your leisure.
We can however chat with a version of Gemma 3 4B that was trained before this class. It was trained against roughly 60,000 examples, partially generated using kiln, partially harvested from various datasets throughout Huggingface. While not perfect, we can see that the model is signifigantly better than the default.
<figure style="text-align: center;">
<img
src="https://i.imgur.com/FKZXaV3.png"
width="600"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
Load Model for Chat
</figcaption>
</figure>
To test this ourselves, select:
1. The chat button at the very top of the screen
2. Download our model. Its under my personal HuggingFace Account name, c4ch3c4d3
3. Set the system prompt to the one we selected when using **Kiln.ai** - "Given a description of an attack technique, tactic, or procedure, return only an accurate MITRE ATT&CK ID and Name in the format: "ID# - Technique".
<figure style="text-align: center;">
<img
src="https://i.imgur.com/GHExjE3.png"
width="600"
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
Test prompt
</figcaption>
</figure>
| Test Prompt | Expected Output Format |
| ---------------------------------------------------------------------------- | -------------------------------------------- |
| "A malicious actor uses PowerShell to download a file from a remote server." | `T1059.001 PowerShell` |
| "The adversary exfiltrates data via a compressed archive sent over HTTP." | `T1567.001 Exfiltration Over Web Services` |
| "Credential dumping is performed using Mimikatz." | `T1003.001 LSASS Memory` |
The Unsloth chat view is relatively simplistic, but does provide options for changing inference perameters, such as Top-P or Temperature, as well as a location for us to input our system prompt. If we're looking to test the model's accuracy with our fine tune, we normally need to ensure these values match the desired endstate values as closely as possible.
### Export the FineTuned Model
<div class="lab-callout lab-callout--warning">
<strong>Skippable:</strong> These steps are provided for reference as we never successfully finished a fine tune within the lab time period.
</div>
1. Switch to the **Export** tab.
2. Select the training run of the model you've performed.
3. Select the latest checkpoint, or if you'd like to explore an alternative, the checkpoint desired.
4. We can export in a number of formats:
- **Merged Model** A BF16 .safetensors format of the model which can be utilized in other projects
- **LORA** Only export the LORA adapter layers generated during training. Useful if we wish to share only our new files with other users who already have the model downloaded, but not our fine tune.
- **GGUF** A compact file ready for import into **Ollama** or other GGUFcompatible runtimes.
<br>
---
## Conclusion
In this lab, we completed a LoRA fine-tuning workflow:
1. **Dataset Generation** - We explored public datasets on HuggingFace and used Kiln AI to generate a synthetic dataset for MITRE ATT&CK classification.
2. **Fine Tuning** - We used Unsloth Studio to fine-tune Gemma-3-4B on our generated dataset.
3. **Validation & Export** - We tested the model with sample prompts and exported the fine-tuned model in both FP16 and GGUF formats.
If all has gone well, then the model should be much more accurate at identifying MITRE ATT&CK codes from user input scenarios. If not, additional experimentation may be necessary to produce a good fine tune. Playing with the parameters we've discussed, improving and expanding our dataset, or even fine tuning a larger or better base model can also help affect our success rate.
@@ -1,12 +1,19 @@
---
order: 7
title: Lab 7 - Evaluation and Red Teaming
description: Probe model defenses manually and with Promptfoo to evaluate security controls.
---
<!-- breakout-style: instruction-rails -->
<!-- step-style: underline -->
<!-- objective-style: divider -->
# Lab 6 - Evaluation and Red Teaming
# Lab 7 - Evaluation and Red Teaming
In this lab, we will:
* Perform prompt injection against three layers of model protection
* Use Promptfoo to programmatically evaluate a model's security protections
- Perform prompt injection against three layers of model protection
- Use Promptfoo to programmatically evaluate a model's security protections
<div class="lab-callout lab-callout--info">
<strong>Lab Flow Guide</strong><br />
@@ -15,10 +22,12 @@ In this lab, we will:
</div>
To start this lab, one web service has been preconfigured:
* Promptfoo - http://<IP>:15500
- Promptfoo - http://<IP>:15500
You'll also need to access:
* Open WebUI - https://ai.zuccaro.me/
- Open WebUI - https://ai.zuccaro.me/
## Objective 1 Explore: Direct Prompt Injection
@@ -38,8 +47,8 @@ Each level will be more difficult than the last, based on how the protection int
To access the lab, navigate to https://ai.zuccaro.me and log in with the following credentials:
* `Username` - `student@zuccaro.me`
* `Password` - `Student9205!`
- `Username` - `student@zuccaro.me`
- `Password` - `Student9205!`
<br>
@@ -88,7 +97,7 @@ Promptfoo is available on our lab machine at http://<YOUR STUDENT IP>:15500. We
Promptfoo is designed to be approachable for both beginners and practitioners. Its wizard guides you through configuring the target, selecting datasets and mutation strategies, and tracking execution.
<div class="lab-callout lab-callout--info">
<strong>Tip:</strong> Although the Promptfoo WebUI is convenient, it hides a critical configuration option for this lab inside the YAML file. Please use the provided configuration file: [lab-6-evaluation-and-red-teaming/promptfoo.yaml](content/labs/lab-6-evaluation-and-red-teaming/promptfoo.yaml). Upload it with the <strong>Load Config</strong> button in the lower-left corner, then proceed with the following screenshot steps.
<strong>Tip:</strong> Although the Promptfoo WebUI is convenient, it hides a critical configuration option for this lab inside the YAML file. Please use the provided configuration file: [lab-7-evaluation-and-red-teaming/promptfoo.yaml](content/labs/lab-7-evaluation-and-red-teaming/promptfoo.yaml). Upload it with the <strong>Load Config</strong> button in the lower-left corner, then proceed with the following screenshot steps.
</div>
<figure style="text-align: center;">
@@ -139,7 +148,6 @@ Promptfoo is designed to be approachable for both beginners and practitioners. I
</figure>
<br>
Once we select `Start`, Promptfoo handles the rest. Mutations, tests, and results are all tracked by the WebUI. Promptfoo runs can take a significant amount of time, but when they finish you will be presented with a new results screen.
<figure style="text-align: center;">
@@ -159,10 +167,9 @@ Promptfoo is highly flexible. Anything that involves mass evaluation of prompts
### Explore: Promptfoo evaluation workflow
<div class="lab-callout lab-callout--info">
<strong>Tip:</strong> Please use the provided evaluation configuration file: [lab-6-evaluation-and-red-teaming/mmlu-promptfoo-config.yaml](content/labs/lab-6-evaluation-and-red-teaming/mmlu-promptfoo-config.yaml). Upload it with the <strong>Load Config</strong> button in the lower-left corner, then proceed with the following screenshot steps.
<strong>Tip:</strong> Please use the provided evaluation configuration file: [lab-7-evaluation-and-red-teaming/mmlu-promptfoo-config.yaml](content/labs/lab-7-evaluation-and-red-teaming/mmlu-promptfoo-config.yaml). Upload it with the <strong>Load Config</strong> button in the lower-left corner, then proceed with the following screenshot steps.
</div>
<figure style="text-align: center;">
<a href="https://i.imgur.com/23iFYNo.png" target="_blank">
<img

Before

Width:  |  Height:  |  Size: 68 KiB

After

Width:  |  Height:  |  Size: 68 KiB

Before

Width:  |  Height:  |  Size: 88 KiB

After

Width:  |  Height:  |  Size: 88 KiB

Before

Width:  |  Height:  |  Size: 91 KiB

After

Width:  |  Height:  |  Size: 91 KiB

Before

Width:  |  Height:  |  Size: 47 KiB

After

Width:  |  Height:  |  Size: 47 KiB

Before

Width:  |  Height:  |  Size: 200 KiB

After

Width:  |  Height:  |  Size: 200 KiB

Before

Width:  |  Height:  |  Size: 103 KiB

After

Width:  |  Height:  |  Size: 103 KiB

Before

Width:  |  Height:  |  Size: 140 KiB

After

Width:  |  Height:  |  Size: 140 KiB

Before

Width:  |  Height:  |  Size: 163 KiB

After

Width:  |  Height:  |  Size: 163 KiB

Before

Width:  |  Height:  |  Size: 25 KiB

After

Width:  |  Height:  |  Size: 25 KiB