New Lab 2

2026-04-07 16:02:48 -06:00
parent 6bcebd55ee
commit 9f3af49845
65 changed files with 6650 additions and 1553 deletions
@@ -0,0 +1,124 @@
+---
+order: 2
+title: "Lab 2 - Quantization Tradeoffs: Comparing 2-bit, 4-bit, and 8-bit"
+description: Download Gemma 4 E2B in three GGUF quantizations and compare size, metadata, and output quality.
+---
+
+<!-- breakout-style: instruction-rails -->
+<!-- step-style: underline -->
+<!-- objective-style: divider -->
+
+In this lab, we will:
+
+- Download the same Gemma model in `UD-IQ2_M`, `Q4_K_M`, and `Q8_0`
+- Compare file size and GGUF metadata across those quantizations
+- Observe how lower precision changes the model's behavior
+- Build intuition for when a smaller quant may or may not be worth it
+
+<div class="lab-callout lab-callout--info">
+  <strong>Lab Flow Guide</strong><br />
+  <strong>Explore</strong> sections focus on comparison and trade-off analysis.<br />
+  <strong>Execute</strong> sections require collecting evidence from each quantized model.
+</div>
+
+## Objective 1: Understand the Model and the Quantizations
+
+For this lab, we will use the Hugging Face repository for **Unsloth's GGUF release of Gemma 4 E2B Instruct**:
+
+<https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF>
+
+This repository currently exposes multiple GGUF variants of the same base model. We will focus on one file from each of these precision bands:
+
+| Precision band | GGUF file                      | Why we are using it                     | File Size |
+| -------------- | ------------------------------ | --------------------------------------- |-----------|
+| 2-bit          | `gemma-4-E2B-it-UD-IQ2_M.gguf` | Most aggressive compression in this lab | 2.4 GB    |
+| 4-bit          | `gemma-4-E2B-it-Q4_K_M.gguf`   | Common middle-ground quant              | 3.17 GB   |
+| 8-bit          | `gemma-4-E2B-it-Q8_0.gguf`     | Highest-quality quant in this lab       | 5.05 GB   |
+
+Even though the filenames differ, these are all the same underlying instruction-tuned Gemma 4 E2B model. The main variable we are changing is how the weights are stored.
+
+When we say these files are the same model, we mean that the overall neural network is still the same:
+
+- The same architecture
+- The same layer count
+- The same tokenizer
+- The same training and instruction tuning
+- The same general behavior the model learned during training
+
+What changes is the numeric representation of the learned weights.
+
+Imagine one learned weight in the original model is:
+
+```text
+0.156347
+```
+
+That number came from training. It is one of many values the model uses while computing each next token. Quantization does not invent a new model from scratch. Instead, it takes that trained value and asks:
+
+```text
+How can we store a close-enough version of this number using fewer bits?
+```
+
+If we use a simplified integer-style quantization scheme, the math looks like this:
+
+```text
+scale = max(|w|) / (2^(bits - 1) - 1)
+q = round(w / scale)
+w_hat = q * scale
+```
+
+Where:
+
+- `w` is the original weight
+- `q` is the stored integer bucket
+- `scale` maps integers back into the original numeric range
+- `w_hat` is the reconstructed approximation used at inference time
+
+So if the original trained value was `0.156347`, a lower-bit quantized file may not store that exact number anymore. It may store an integer bucket like `1`, `5`, or `22`, plus a scale, and reconstruct an approximation such as:
+
+- `0.000000`
+- `0.130029`
+- `0.146806`
+- `0.157782`
+
+Those are not identical to the original weight, but they may still be close enough for useful inference.
+
+<div data-quantization-explorer></div>
+
+### Explore: Interactive precision viewer
+
+The viewer below zooms out from one weight and instead shows a toy layer with 16 stored values. Real GGUF schemes such as `Q4_K_M` and `UD-IQ2_M` are more sophisticated than this toy example, but the core idea is the same:
+
+- Fewer bits means fewer representable values
+- More weights get pushed into the same small set of stored buckets
+- The layer becomes more compressed as precision drops
+
+<div data-quantization-grid-explorer></div>
+
+### Explore: Compare the same prompts through the hosted chat widget
+
+If your instructor provides an OpenAI-compatible endpoint, you can compare the same prompts through the embedded chat tool below:
+
+- Paste the lab endpoint and API key into the settings row
+- Switch between `Q8_0`, `Q4_K_M`, and `UD-IQ2_M`
+- Re-run the same prompt so you can compare coherence, stability, and SVG output
+- Try a visual prompt such as `Draw a pelican riding a bicycle.`
+
+The widget keeps the transcript in your browser so you can switch models without losing your place.  Refresh the page to clear the chat history.
+
+<div data-objective5-chat></div>
+
+## Objective 6: Reflect on the Tradeoff
+
+By this point, you should have:
+
+- Compared three quantized versions of the same model
+- Measured the storage savings directly
+- Verified that the core model metadata remains largely the same
+- Observed where output quality begins to degrade
+
+The important takeaway is not that one quant is always "best." The important takeaway is that quantization is a deployment decision. The right choice depends on your hardware limits, acceptable quality loss, and the task you need the model to perform.
+
+## Conclusion
+
+This lab isolates quantization as the main variable. By downloading **Gemma 4 E2B Instruct** in `UD-IQ2_M`, `Q4_K_M`, and `Q8_0`, you can directly observe one of the most important tradeoffs in local inference: balancing model quality against disk usage and resource constraints.