LLM-Labs/content/labs/lab-2-quantization-tradeoffs.md at 883d43dca814bdf2d4fa8536d0bb2b4f5775e24e

bzuccaro/LLM-Labs

Fork 0

Files

T

bzuccaro 9f3af49845 New Lab 2

2026-04-07 16:02:48 -06:00

5.3 KiB

Raw Blame History

order, title, description

order	title	description
2	Lab 2 - Quantization Tradeoffs: Comparing 2-bit, 4-bit, and 8-bit	Download Gemma 4 E2B in three GGUF quantizations and compare size, metadata, and output quality.

In this lab, we will:

Download the same Gemma model in UD-IQ2_M, Q4_K_M, and Q8_0
Compare file size and GGUF metadata across those quantizations
Observe how lower precision changes the model's behavior
Build intuition for when a smaller quant may or may not be worth it

Lab Flow Guide
Explore sections focus on comparison and trade-off analysis.
Execute sections require collecting evidence from each quantized model.

Objective 1: Understand the Model and the Quantizations

For this lab, we will use the Hugging Face repository for Unsloth's GGUF release of Gemma 4 E2B Instruct:

https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF

This repository currently exposes multiple GGUF variants of the same base model. We will focus on one file from each of these precision bands:

Precision band	GGUF file	Why we are using it	File Size
2-bit	`gemma-4-E2B-it-UD-IQ2_M.gguf`	Most aggressive compression in this lab	2.4 GB
4-bit	`gemma-4-E2B-it-Q4_K_M.gguf`	Common middle-ground quant	3.17 GB
8-bit	`gemma-4-E2B-it-Q8_0.gguf`	Highest-quality quant in this lab	5.05 GB

Even though the filenames differ, these are all the same underlying instruction-tuned Gemma 4 E2B model. The main variable we are changing is how the weights are stored.

When we say these files are the same model, we mean that the overall neural network is still the same:

The same architecture
The same layer count
The same tokenizer
The same training and instruction tuning
The same general behavior the model learned during training

What changes is the numeric representation of the learned weights.

Imagine one learned weight in the original model is:

0.156347

That number came from training. It is one of many values the model uses while computing each next token. Quantization does not invent a new model from scratch. Instead, it takes that trained value and asks:

How can we store a close-enough version of this number using fewer bits?

If we use a simplified integer-style quantization scheme, the math looks like this:

scale = max(|w|) / (2^(bits - 1) - 1)
q = round(w / scale)
w_hat = q * scale

Where:

w is the original weight
q is the stored integer bucket
scale maps integers back into the original numeric range
w_hat is the reconstructed approximation used at inference time

So if the original trained value was 0.156347, a lower-bit quantized file may not store that exact number anymore. It may store an integer bucket like 1, 5, or 22, plus a scale, and reconstruct an approximation such as:

0.000000
0.130029
0.146806
0.157782

Those are not identical to the original weight, but they may still be close enough for useful inference.

Explore: Interactive precision viewer

The viewer below zooms out from one weight and instead shows a toy layer with 16 stored values. Real GGUF schemes such as Q4_K_M and UD-IQ2_M are more sophisticated than this toy example, but the core idea is the same:

Fewer bits means fewer representable values
More weights get pushed into the same small set of stored buckets
The layer becomes more compressed as precision drops

If your instructor provides an OpenAI-compatible endpoint, you can compare the same prompts through the embedded chat tool below:

Paste the lab endpoint and API key into the settings row
Switch between Q8_0, Q4_K_M, and UD-IQ2_M
Re-run the same prompt so you can compare coherence, stability, and SVG output
Try a visual prompt such as Draw a pelican riding a bicycle.

The widget keeps the transcript in your browser so you can switch models without losing your place. Refresh the page to clear the chat history.

Objective 6: Reflect on the Tradeoff

By this point, you should have:

Compared three quantized versions of the same model
Measured the storage savings directly
Verified that the core model metadata remains largely the same
Observed where output quality begins to degrade

The important takeaway is not that one quant is always "best." The important takeaway is that quantization is a deployment decision. The right choice depends on your hardware limits, acceptable quality loss, and the task you need the model to perform.

Conclusion

This lab isolates quantization as the main variable. By downloading Gemma 4 E2B Instruct in UD-IQ2_M, Q4_K_M, and Q8_0, you can directly observe one of the most important tradeoffs in local inference: balancing model quality against disk usage and resource constraints.

5.3 KiB Raw Blame History

Objective 1: Understand the Model and the Quantizations

Explore: Interactive precision viewer

Explore: Compare the same prompts through the hosted chat widget

Objective 6: Reflect on the Tradeoff

Conclusion

5.3 KiB

Raw Blame History