Files
LLM-Labs/content/labs/lab-2-quantization-tradeoffs.md
T
2026-04-07 16:02:48 -06:00

5.3 KiB

order, title, description
order title description
2 Lab 2 - Quantization Tradeoffs: Comparing 2-bit, 4-bit, and 8-bit Download Gemma 4 E2B in three GGUF quantizations and compare size, metadata, and output quality.

In this lab, we will:

  • Download the same Gemma model in UD-IQ2_M, Q4_K_M, and Q8_0
  • Compare file size and GGUF metadata across those quantizations
  • Observe how lower precision changes the model's behavior
  • Build intuition for when a smaller quant may or may not be worth it
Lab Flow Guide
Explore sections focus on comparison and trade-off analysis.
Execute sections require collecting evidence from each quantized model.

Objective 1: Understand the Model and the Quantizations

For this lab, we will use the Hugging Face repository for Unsloth's GGUF release of Gemma 4 E2B Instruct:

https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF

This repository currently exposes multiple GGUF variants of the same base model. We will focus on one file from each of these precision bands:

Precision band GGUF file Why we are using it File Size
2-bit gemma-4-E2B-it-UD-IQ2_M.gguf Most aggressive compression in this lab 2.4 GB
4-bit gemma-4-E2B-it-Q4_K_M.gguf Common middle-ground quant 3.17 GB
8-bit gemma-4-E2B-it-Q8_0.gguf Highest-quality quant in this lab 5.05 GB

Even though the filenames differ, these are all the same underlying instruction-tuned Gemma 4 E2B model. The main variable we are changing is how the weights are stored.

When we say these files are the same model, we mean that the overall neural network is still the same:

  • The same architecture
  • The same layer count
  • The same tokenizer
  • The same training and instruction tuning
  • The same general behavior the model learned during training

What changes is the numeric representation of the learned weights.

Imagine one learned weight in the original model is:

0.156347

That number came from training. It is one of many values the model uses while computing each next token. Quantization does not invent a new model from scratch. Instead, it takes that trained value and asks:

How can we store a close-enough version of this number using fewer bits?

If we use a simplified integer-style quantization scheme, the math looks like this:

scale = max(|w|) / (2^(bits - 1) - 1)
q = round(w / scale)
w_hat = q * scale

Where:

  • w is the original weight
  • q is the stored integer bucket
  • scale maps integers back into the original numeric range
  • w_hat is the reconstructed approximation used at inference time

So if the original trained value was 0.156347, a lower-bit quantized file may not store that exact number anymore. It may store an integer bucket like 1, 5, or 22, plus a scale, and reconstruct an approximation such as:

  • 0.000000
  • 0.130029
  • 0.146806
  • 0.157782

Those are not identical to the original weight, but they may still be close enough for useful inference.

Explore: Interactive precision viewer

The viewer below zooms out from one weight and instead shows a toy layer with 16 stored values. Real GGUF schemes such as Q4_K_M and UD-IQ2_M are more sophisticated than this toy example, but the core idea is the same:

  • Fewer bits means fewer representable values
  • More weights get pushed into the same small set of stored buckets
  • The layer becomes more compressed as precision drops

Explore: Compare the same prompts through the hosted chat widget

If your instructor provides an OpenAI-compatible endpoint, you can compare the same prompts through the embedded chat tool below:

  • Paste the lab endpoint and API key into the settings row
  • Switch between Q8_0, Q4_K_M, and UD-IQ2_M
  • Re-run the same prompt so you can compare coherence, stability, and SVG output
  • Try a visual prompt such as Draw a pelican riding a bicycle.

The widget keeps the transcript in your browser so you can switch models without losing your place. Refresh the page to clear the chat history.

Objective 6: Reflect on the Tradeoff

By this point, you should have:

  • Compared three quantized versions of the same model
  • Measured the storage savings directly
  • Verified that the core model metadata remains largely the same
  • Observed where output quality begins to degrade

The important takeaway is not that one quant is always "best." The important takeaway is that quantization is a deployment decision. The right choice depends on your hardware limits, acceptable quality loss, and the task you need the model to perform.

Conclusion

This lab isolates quantization as the main variable. By downloading Gemma 4 E2B Instruct in UD-IQ2_M, Q4_K_M, and Q8_0, you can directly observe one of the most important tradeoffs in local inference: balancing model quality against disk usage and resource constraints.