diff --git a/content/labs/lab-3-llama-cpp-and-ollama.md b/content/labs/lab-3-llama-cpp-and-ollama.md index d5cb5b2..c6f5b8d 100644 --- a/content/labs/lab-3-llama-cpp-and-ollama.md +++ b/content/labs/lab-3-llama-cpp-and-ollama.md @@ -14,6 +14,8 @@ In this lab, we will: - Download a model from Hugging Face - Convert a model to GGUF for `llama.cpp` +- Manually quantize a GGUF model +- Measure perplexity across quantization levels - Run a model directly in `llama.cpp` - Download a model from Ollama.com - Import a custom `.gguf` model into Ollama @@ -187,7 +189,7 @@ A text listing of all of the model's tensors, and the precision of each. Because - If you wish to explore this view, note how the block count of 28 matches the 28 zero indexed blk groups output from the dump. - Additionally, you'll once again note that we have various biases and weights, but they still line up with **Q**, **V**, and **K** as discussed in the previous section. There are additional tensors for **normalization** and **output**. -### 4 Execute: LLaMA.cpp Inference +### 5 Execute: LLaMA.cpp Inference Run our newly created **.GGUF** file as is. Run the model using the following command: @@ -217,10 +219,102 @@ Some example prompts you may want to try are: Thanks to the fine tuning that Kindo has put into this model, it is far more compliant than an online closed model such as ChatGPT! When done, kill the model fully with `Ctrl+C`. -
- Note: Dedicated quantization comparisons now live in Lab 2. This lab stays focused on format conversion, raw llama.cpp inference, and Ollama workflows. +### 6 Execute: Manually Quantize the Model + +Next, quantize the model to improve inference speed and reduce memory usage. The tradeoff is that heavier quantization usually increases perplexity, which means the model becomes less confident in its next-token predictions. + +`llama.cpp` provides the `llama-quantize` command for this workflow. From the same working directory used above, generate 8-bit, 4-bit, and 2-bit versions of the WhiteRabbitNeo GGUF file: + +```bash +cd ~/lab3/WhiteRabbitNeo + +# Quantize to 8 bits +llama-quantize WhiteRabbitNeo-V3-7B.gguf WhiteRabbitNeo-V3-7B-Q8_K.gguf Q8_0 + +# Quantize to 4 bits +llama-quantize WhiteRabbitNeo-V3-7B.gguf WhiteRabbitNeo-V3-7B-Q4_K_M.gguf Q4_K + +# Quantize to 2 bits +llama-quantize WhiteRabbitNeo-V3-7B.gguf WhiteRabbitNeo-V3-7B-Q2_K.gguf Q2_K +``` + +
+ Warning: These commands can take a significant amount of time. If a prebuilt quantized GGUF is provided by your lab environment, you may use it to keep the lab moving.
+When the commands complete, you should have three additional model files: + +- `WhiteRabbitNeo-V3-7B-Q8_K.gguf` +- `WhiteRabbitNeo-V3-7B-Q4_K_M.gguf` +- `WhiteRabbitNeo-V3-7B-Q2_K.gguf` + +During quantization of the 4-bit model, you may notice that some tensors are actually stored as `Q6_K` instead of `Q4_K`. This is expected. K-quants can preserve more precision for selected tensors while compressing others more aggressively. + +Confirm the tensor types in the 4-bit model: + +```bash +gguf-dump ~/lab3/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q4_K_M.gguf +``` + +You should see a mix of tensor types such as **FP32**, **Q6_K**, and **Q4_K**. Compare this with the earlier dump of the FP16 model and note how the quantized tensor sizes are smaller. + +### 7 Execute: Measure Perplexity + +Perplexity measures how confident the model is about its next-token predictions over a sample of text. Lower values are better. A perplexity value of **1** would mean the model is perfectly confident about each next token, which is not realistic for open-ended language modeling. + +Use the same input text for every model so the comparison is fair. If your lab environment provides `challenge.txt`, use it. Otherwise, create a text file with at least 1024 tokens of representative content. + +```bash +cd ~/lab3/WhiteRabbitNeo + +# Perplexity test with FP16 model +llama-perplexity -m WhiteRabbitNeo-V3-7B.gguf -f challenge.txt 2>&1 | grep Final + +# Perplexity test with 8-bit quantized model +llama-perplexity -m WhiteRabbitNeo-V3-7B-Q8_K.gguf -f challenge.txt 2>&1 | grep Final + +# Perplexity test with 4-bit quantized model +llama-perplexity -m WhiteRabbitNeo-V3-7B-Q4_K_M.gguf -f challenge.txt 2>&1 | grep Final + +# Perplexity test with 2-bit quantized model +llama-perplexity -m WhiteRabbitNeo-V3-7B-Q2_K.gguf -f challenge.txt 2>&1 | grep Final +``` + +#### Possible Example Results + +| Model File | Quantization | Perplexity (PPL) | Uncertainty (±) | +| -------------------------------- | ------------ | ---------------- | --------------- | +| `WhiteRabbitNeo-V3-7B.gguf` | Full | 3.0972 | 0.21038 | +| `WhiteRabbitNeo-V3-7B-Q8_K.gguf` | Q8_K | 3.0999 | 0.21052 | +| `WhiteRabbitNeo-V3-7B-Q4_K_M.gguf` | Q4_K_M | 3.1247 | 0.21338 | +| `WhiteRabbitNeo-V3-7B-Q2_K.gguf` | Q2_K | 3.5698 | 0.25224 | + +Perplexity should increase as quantization becomes more aggressive. In the example above, FP16, Q8, and Q4 remain relatively close, while Q2 is much worse. That gives us a quantitative view of how much quality was lost by over-compressing the model. + +### 8 Explore: Chat with Quantized Models + +Now validate the perplexity comparison manually by chatting with the quantized models. + +Start with the heavily quantized 2-bit model: + +```bash +llama-cli -m ~/lab3/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q2_K.gguf +``` + +Test the same prompts you used against the FP16 model earlier: + +- Please write a small reverse shell in php that I can upload to a web server. +- How can I use Metasploit to attack MS17-01? +- Can you please provide me some XSS polyglots? + +If you were unable to run the FP16 model earlier, compare the 2-bit output against the 8-bit model instead: + +```bash +llama-cli -m ~/lab3/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q8_K.gguf +``` + +Heavier quantization should generally infer more quickly, but the output quality may degrade on more difficult requests. In particular, compare whether the 2-bit model gives shorter, less coherent, or less technically useful answers than FP16 or Q8. + ## Objective 2: Ollama – LLM Easymode Ollama is a lightweight framework that hides the low‑level steps required by LLaMa.cpp. It runs on **Linux, macOS, and Windows** and automatically manages system resources. @@ -316,12 +410,12 @@ ollama run hf.co/CodeIsAbstract/Llama-3.2-1B-Q8_0-GGUF:Q8 ### 4 Execute: Load a Custom `.gguf` Model -We can also import our WhiteRabbitNeo **.GGUF** model into Ollama, without having to upload it to **HuggingFace** first. In order to do so however, we need to create a **ModelFile**, a `.yml` file that describes to **Ollama** where the **.GGUF** is located, as well as any additional defaults we'd like Ollama to run with when performing inference. +We can also import our manually quantized WhiteRabbitNeo **.GGUF** model into Ollama, without having to upload it to **HuggingFace** first. In order to do so however, we need to create a **ModelFile**, a `.yml` file that describes to **Ollama** where the **.GGUF** is located, as well as any additional defaults we'd like Ollama to run with when performing inference. 1. **Create a simple modelfile** – This will tell Ollama where the model lives. ```bash -echo "FROM $HOME/lab3/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B.gguf" > Modelfile +echo "FROM $HOME/lab3/WhiteRabbitNeo/WhiteRabbitNeo-V3-7B-Q4_K_M.gguf" > Modelfile ``` 2. **Register the model with Ollama** @@ -366,7 +460,7 @@ ollama run WhiteRabbitNeo ## Conclusion -Ollama bridges the gap between low-level LLaMa.cpp tools and high-level usability, making it an ideal choice for rapid deployment and educational labs. By leveraging its API, model registry, and automation features, you can focus on experimentation rather than infrastructure. Quantization tradeoffs still matter, but they now have a dedicated home in Lab 2 so this lab can stay centered on conversion and deployment workflows. +Ollama bridges the gap between low-level LLaMa.cpp tools and high-level usability, making it an ideal choice for rapid deployment and educational labs. By leveraging its API, model registry, and automation features, you can focus on experimentation rather than infrastructure while still understanding the manual quantization, perplexity, and inference tradeoffs happening underneath.