New Lab 2
This commit is contained in:
@@ -0,0 +1,124 @@
|
||||
---
|
||||
order: 2
|
||||
title: "Lab 2 - Quantization Tradeoffs: Comparing 2-bit, 4-bit, and 8-bit"
|
||||
description: Download Gemma 4 E2B in three GGUF quantizations and compare size, metadata, and output quality.
|
||||
---
|
||||
|
||||
<!-- breakout-style: instruction-rails -->
|
||||
<!-- step-style: underline -->
|
||||
<!-- objective-style: divider -->
|
||||
|
||||
In this lab, we will:
|
||||
|
||||
- Download the same Gemma model in `UD-IQ2_M`, `Q4_K_M`, and `Q8_0`
|
||||
- Compare file size and GGUF metadata across those quantizations
|
||||
- Observe how lower precision changes the model's behavior
|
||||
- Build intuition for when a smaller quant may or may not be worth it
|
||||
|
||||
<div class="lab-callout lab-callout--info">
|
||||
<strong>Lab Flow Guide</strong><br />
|
||||
<strong>Explore</strong> sections focus on comparison and trade-off analysis.<br />
|
||||
<strong>Execute</strong> sections require collecting evidence from each quantized model.
|
||||
</div>
|
||||
|
||||
## Objective 1: Understand the Model and the Quantizations
|
||||
|
||||
For this lab, we will use the Hugging Face repository for **Unsloth's GGUF release of Gemma 4 E2B Instruct**:
|
||||
|
||||
<https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF>
|
||||
|
||||
This repository currently exposes multiple GGUF variants of the same base model. We will focus on one file from each of these precision bands:
|
||||
|
||||
| Precision band | GGUF file | Why we are using it | File Size |
|
||||
| -------------- | ------------------------------ | --------------------------------------- |-----------|
|
||||
| 2-bit | `gemma-4-E2B-it-UD-IQ2_M.gguf` | Most aggressive compression in this lab | 2.4 GB |
|
||||
| 4-bit | `gemma-4-E2B-it-Q4_K_M.gguf` | Common middle-ground quant | 3.17 GB |
|
||||
| 8-bit | `gemma-4-E2B-it-Q8_0.gguf` | Highest-quality quant in this lab | 5.05 GB |
|
||||
|
||||
Even though the filenames differ, these are all the same underlying instruction-tuned Gemma 4 E2B model. The main variable we are changing is how the weights are stored.
|
||||
|
||||
When we say these files are the same model, we mean that the overall neural network is still the same:
|
||||
|
||||
- The same architecture
|
||||
- The same layer count
|
||||
- The same tokenizer
|
||||
- The same training and instruction tuning
|
||||
- The same general behavior the model learned during training
|
||||
|
||||
What changes is the numeric representation of the learned weights.
|
||||
|
||||
Imagine one learned weight in the original model is:
|
||||
|
||||
```text
|
||||
0.156347
|
||||
```
|
||||
|
||||
That number came from training. It is one of many values the model uses while computing each next token. Quantization does not invent a new model from scratch. Instead, it takes that trained value and asks:
|
||||
|
||||
```text
|
||||
How can we store a close-enough version of this number using fewer bits?
|
||||
```
|
||||
|
||||
If we use a simplified integer-style quantization scheme, the math looks like this:
|
||||
|
||||
```text
|
||||
scale = max(|w|) / (2^(bits - 1) - 1)
|
||||
q = round(w / scale)
|
||||
w_hat = q * scale
|
||||
```
|
||||
|
||||
Where:
|
||||
|
||||
- `w` is the original weight
|
||||
- `q` is the stored integer bucket
|
||||
- `scale` maps integers back into the original numeric range
|
||||
- `w_hat` is the reconstructed approximation used at inference time
|
||||
|
||||
So if the original trained value was `0.156347`, a lower-bit quantized file may not store that exact number anymore. It may store an integer bucket like `1`, `5`, or `22`, plus a scale, and reconstruct an approximation such as:
|
||||
|
||||
- `0.000000`
|
||||
- `0.130029`
|
||||
- `0.146806`
|
||||
- `0.157782`
|
||||
|
||||
Those are not identical to the original weight, but they may still be close enough for useful inference.
|
||||
|
||||
<div data-quantization-explorer></div>
|
||||
|
||||
### Explore: Interactive precision viewer
|
||||
|
||||
The viewer below zooms out from one weight and instead shows a toy layer with 16 stored values. Real GGUF schemes such as `Q4_K_M` and `UD-IQ2_M` are more sophisticated than this toy example, but the core idea is the same:
|
||||
|
||||
- Fewer bits means fewer representable values
|
||||
- More weights get pushed into the same small set of stored buckets
|
||||
- The layer becomes more compressed as precision drops
|
||||
|
||||
<div data-quantization-grid-explorer></div>
|
||||
|
||||
### Explore: Compare the same prompts through the hosted chat widget
|
||||
|
||||
If your instructor provides an OpenAI-compatible endpoint, you can compare the same prompts through the embedded chat tool below:
|
||||
|
||||
- Paste the lab endpoint and API key into the settings row
|
||||
- Switch between `Q8_0`, `Q4_K_M`, and `UD-IQ2_M`
|
||||
- Re-run the same prompt so you can compare coherence, stability, and SVG output
|
||||
- Try a visual prompt such as `Draw a pelican riding a bicycle.`
|
||||
|
||||
The widget keeps the transcript in your browser so you can switch models without losing your place. Refresh the page to clear the chat history.
|
||||
|
||||
<div data-objective5-chat></div>
|
||||
|
||||
## Objective 6: Reflect on the Tradeoff
|
||||
|
||||
By this point, you should have:
|
||||
|
||||
- Compared three quantized versions of the same model
|
||||
- Measured the storage savings directly
|
||||
- Verified that the core model metadata remains largely the same
|
||||
- Observed where output quality begins to degrade
|
||||
|
||||
The important takeaway is not that one quant is always "best." The important takeaway is that quantization is a deployment decision. The right choice depends on your hardware limits, acceptable quality loss, and the task you need the model to perform.
|
||||
|
||||
## Conclusion
|
||||
|
||||
This lab isolates quantization as the main variable. By downloading **Gemma 4 E2B Instruct** in `UD-IQ2_M`, `Q4_K_M`, and `Q8_0`, you can directly observe one of the most important tradeoffs in local inference: balancing model quality against disk usage and resource constraints.
|
||||
Reference in New Issue
Block a user