LLM-Labs/content/labs/lab-2-quantization-tradeoffs.md at main

bzuccaro/LLM-Labs

Fork 0

Files

T

bzuccaro 562be3fd1f Update lab model defaults and assets

2026-04-24 20:08:56 -06:00

5.0 KiB

Raw Permalink Blame History

order, title, description

order	title	description
2	Lab 2 - Quantization Tradeoffs: Comparing 4-bit and 6-bit	Compare Gemma 4 E2B in two Ollama quantizations and study how lower precision changes behavior.

In this lab, we will:

Pull the same Gemma model in Q4 and Q6 Ollama variants
Compare the quantization labels and model behavior across those variants
Observe how lower precision changes the model's behavior
Build intuition for when a smaller quant may or may not be worth it

Lab Flow Guide
Explore sections focus on comparison and trade-off analysis.
Execute sections require collecting evidence from each quantized model.

Objective 1: Understand the Model and the Quantizations

For this lab, we will use two Ollama-published variants of Gemma 4 E2B that represent distinct precision bands:

Precision band	Ollama model tag	Why we are using it
Q4	`batiai/gemma4-e2b:q4`	Faster, smaller quant
Q6	`batiai/gemma4-e2b:q6`	Higher-quality quant in this lab

Even though the Ollama tags differ, these are all variants of the same underlying Gemma 4 E2B model family. The main variable we are changing is how the weights are stored.

When we say these files are the same model, we mean that the overall neural network is still the same:

The same architecture
The same layer count
The same tokenizer
The same training and instruction tuning
The same general behavior the model learned during training

What changes is the numeric representation of the learned weights.

Imagine one learned weight in the original model is:

0.156347

That number came from training. It is one of many values the model uses while computing each next token. Quantization does not invent a new model from scratch. Instead, it takes that trained value and asks:

How can we store a close-enough version of this number using fewer bits?

If we use a simplified integer-style quantization scheme, the math looks like this:

scale = max(|w|) / (2^(bits - 1) - 1)
q = round(w / scale)
w_hat = q * scale

Where:

w is the original weight
q is the stored integer bucket
scale maps integers back into the original numeric range
w_hat is the reconstructed approximation used at inference time

So if the original trained value was 0.156347, a lower-bit quantized file may not store that exact number anymore. It may store an integer bucket like 1, 5, or 22, plus a scale, and reconstruct an approximation such as:

0.000000
0.130029
0.146806
0.157782

Those are not identical to the original weight, but they may still be close enough for useful inference.

Explore: Interactive precision viewer

The viewer below zooms out from one weight and instead shows a toy layer with 16 stored values. Real GGUF schemes such as Q4_K_M and Q6_K are more sophisticated than this toy example, but the core idea is the same:

Fewer bits means fewer representable values
More weights get pushed into the same small set of stored buckets
The layer becomes more compressed as precision drops

By default, the widget below points to the courseware-managed Ollama service and the Lab 2 model tags above. You can still switch to another endpoint if your instructor provides one.

Use the preloaded managed endpoint or replace it with another compatible endpoint
Optionally add an API key if your chosen endpoint requires one
Switch between the configured Q4 and Q6 Gemma variants
Re-run the same prompt so you can compare coherence, stability, and SVG output
Try a visual prompt such as Draw a pelican riding a bicycle.

The widget keeps the transcript in your browser so you can switch models without losing your place. Refresh the page to clear the chat history.

Objective 6: Reflect on the Tradeoff

By this point, you should have:

Compared two quantized versions of the same model
Measured the storage savings directly
Verified that the core model metadata remains largely the same
Observed where output quality begins to degrade

The important takeaway is not that one quant is always "best." The important takeaway is that quantization is a deployment decision. The right choice depends on your hardware limits, acceptable quality loss, and the task you need the model to perform.

Conclusion

This lab isolates quantization as the main variable. By comparing Gemma 4 E2B in Q4 and Q6 Ollama variants, you can directly observe one of the most important tradeoffs in local inference: balancing model quality against efficiency and resource constraints.

5.0 KiB Raw Permalink Blame History

Objective 1: Understand the Model and the Quantizations

Explore: Interactive precision viewer

Explore: Compare the same prompts through the hosted chat widget

Objective 6: Reflect on the Tradeoff

Conclusion

5.0 KiB

Raw Permalink Blame History