LLM-Labs/content/labs/lab-2-quantization-tradeoffs.md

---
order: 2
title: "Lab 2 - Quantization Tradeoffs: Comparing 2-bit, 4-bit, and 8-bit"
description: Download Gemma 4 E2B in three GGUF quantizations and compare size, metadata, and output quality.
---

<!-- breakout-style: instruction-rails -->
<!-- step-style: underline -->
<!-- objective-style: divider -->

In this lab, we will:

- Download the same Gemma model in `UD-IQ2_M`, `Q4_K_M`, and `Q8_0`
- Compare file size and GGUF metadata across those quantizations
- Observe how lower precision changes the model's behavior
- Build intuition for when a smaller quant may or may not be worth it

<div class="lab-callout lab-callout--info">
  <strong>Lab Flow Guide</strong><br />
  <strong>Explore</strong> sections focus on comparison and trade-off analysis.<br />
  <strong>Execute</strong> sections require collecting evidence from each quantized model.
</div>

## Objective 1: Understand the Model and the Quantizations

For this lab, we will use the Hugging Face repository for **Unsloth's GGUF release of Gemma 4 E2B Instruct**:

<https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF>

This repository currently exposes multiple GGUF variants of the same base model. We will focus on one file from each of these precision bands:

| Precision band | GGUF file                      | Why we are using it                     | File Size |
| -------------- | ------------------------------ | --------------------------------------- |-----------|
| 2-bit          | `gemma-4-E2B-it-UD-IQ2_M.gguf` | Most aggressive compression in this lab | 2.4 GB    |
| 4-bit          | `gemma-4-E2B-it-Q4_K_M.gguf`   | Common middle-ground quant              | 3.17 GB   |
| 8-bit          | `gemma-4-E2B-it-Q8_0.gguf`     | Highest-quality quant in this lab       | 5.05 GB   |

Even though the filenames differ, these are all the same underlying instruction-tuned Gemma 4 E2B model. The main variable we are changing is how the weights are stored.

When we say these files are the same model, we mean that the overall neural network is still the same:

- The same architecture
- The same layer count
- The same tokenizer
- The same training and instruction tuning
- The same general behavior the model learned during training

What changes is the numeric representation of the learned weights.

Imagine one learned weight in the original model is:

```text
0.156347
```

That number came from training. It is one of many values the model uses while computing each next token. Quantization does not invent a new model from scratch. Instead, it takes that trained value and asks:

```text
How can we store a close-enough version of this number using fewer bits?
```

If we use a simplified integer-style quantization scheme, the math looks like this:

```text
scale = max(|w|) / (2^(bits - 1) - 1)
q = round(w / scale)
w_hat = q * scale
```

Where:

- `w` is the original weight
- `q` is the stored integer bucket
- `scale` maps integers back into the original numeric range
- `w_hat` is the reconstructed approximation used at inference time

So if the original trained value was `0.156347`, a lower-bit quantized file may not store that exact number anymore. It may store an integer bucket like `1`, `5`, or `22`, plus a scale, and reconstruct an approximation such as:

- `0.000000`
- `0.130029`
- `0.146806`
- `0.157782`

Those are not identical to the original weight, but they may still be close enough for useful inference.

<div data-quantization-explorer></div>

### Explore: Interactive precision viewer

The viewer below zooms out from one weight and instead shows a toy layer with 16 stored values. Real GGUF schemes such as `Q4_K_M` and `UD-IQ2_M` are more sophisticated than this toy example, but the core idea is the same:

- Fewer bits means fewer representable values
- More weights get pushed into the same small set of stored buckets
- The layer becomes more compressed as precision drops

<div data-quantization-grid-explorer></div>

### Explore: Compare the same prompts through the hosted chat widget

If your instructor provides an OpenAI-compatible endpoint, you can compare the same prompts through the embedded chat tool below:

- Paste the lab endpoint and API key into the settings row
- Switch between `Q8_0`, `Q4_K_M`, and `UD-IQ2_M`
- Re-run the same prompt so you can compare coherence, stability, and SVG output
- Try a visual prompt such as `Draw a pelican riding a bicycle.`

The widget keeps the transcript in your browser so you can switch models without losing your place.  Refresh the page to clear the chat history.

<div data-objective5-chat></div>

## Objective 6: Reflect on the Tradeoff

By this point, you should have:

- Compared three quantized versions of the same model
- Measured the storage savings directly
- Verified that the core model metadata remains largely the same
- Observed where output quality begins to degrade

The important takeaway is not that one quant is always "best." The important takeaway is that quantization is a deployment decision. The right choice depends on your hardware limits, acceptable quality loss, and the task you need the model to perform.

## Conclusion

This lab isolates quantization as the main variable. By downloading **Gemma 4 E2B Instruct** in `UD-IQ2_M`, `Q4_K_M`, and `Q8_0`, you can directly observe one of the most important tradeoffs in local inference: balancing model quality against disk usage and resource constraints.