Files
LLM-Labs/content/labs/lab-2-quantization-tradeoffs.md

5.0 KiB

order, title, description
order title description
2 Lab 2 - Quantization Tradeoffs: Comparing 4-bit and 6-bit Compare Gemma 4 E2B in two Ollama quantizations and study how lower precision changes behavior.

In this lab, we will:

  • Pull the same Gemma model in Q4 and Q6 Ollama variants
  • Compare the quantization labels and model behavior across those variants
  • Observe how lower precision changes the model's behavior
  • Build intuition for when a smaller quant may or may not be worth it
Lab Flow Guide
Explore sections focus on comparison and trade-off analysis.
Execute sections require collecting evidence from each quantized model.

Objective 1: Understand the Model and the Quantizations

For this lab, we will use two Ollama-published variants of Gemma 4 E2B that represent distinct precision bands:

Precision band Ollama model tag Why we are using it
Q4 batiai/gemma4-e2b:q4 Faster, smaller quant
Q6 batiai/gemma4-e2b:q6 Higher-quality quant in this lab

Even though the Ollama tags differ, these are all variants of the same underlying Gemma 4 E2B model family. The main variable we are changing is how the weights are stored.

When we say these files are the same model, we mean that the overall neural network is still the same:

  • The same architecture
  • The same layer count
  • The same tokenizer
  • The same training and instruction tuning
  • The same general behavior the model learned during training

What changes is the numeric representation of the learned weights.

Imagine one learned weight in the original model is:

0.156347

That number came from training. It is one of many values the model uses while computing each next token. Quantization does not invent a new model from scratch. Instead, it takes that trained value and asks:

How can we store a close-enough version of this number using fewer bits?

If we use a simplified integer-style quantization scheme, the math looks like this:

scale = max(|w|) / (2^(bits - 1) - 1)
q = round(w / scale)
w_hat = q * scale

Where:

  • w is the original weight
  • q is the stored integer bucket
  • scale maps integers back into the original numeric range
  • w_hat is the reconstructed approximation used at inference time

So if the original trained value was 0.156347, a lower-bit quantized file may not store that exact number anymore. It may store an integer bucket like 1, 5, or 22, plus a scale, and reconstruct an approximation such as:

  • 0.000000
  • 0.130029
  • 0.146806
  • 0.157782

Those are not identical to the original weight, but they may still be close enough for useful inference.

Explore: Interactive precision viewer

The viewer below zooms out from one weight and instead shows a toy layer with 16 stored values. Real GGUF schemes such as Q4_K_M and Q6_K are more sophisticated than this toy example, but the core idea is the same:

  • Fewer bits means fewer representable values
  • More weights get pushed into the same small set of stored buckets
  • The layer becomes more compressed as precision drops

Explore: Compare the same prompts through the hosted chat widget

By default, the widget below points to the courseware-managed Ollama service and the Lab 2 model tags above. You can still switch to another endpoint if your instructor provides one.

  • Use the preloaded managed endpoint or replace it with another compatible endpoint
  • Optionally add an API key if your chosen endpoint requires one
  • Switch between the configured Q4 and Q6 Gemma variants
  • Re-run the same prompt so you can compare coherence, stability, and SVG output
  • Try a visual prompt such as Draw a pelican riding a bicycle.

The widget keeps the transcript in your browser so you can switch models without losing your place. Refresh the page to clear the chat history.

Objective 6: Reflect on the Tradeoff

By this point, you should have:

  • Compared two quantized versions of the same model
  • Measured the storage savings directly
  • Verified that the core model metadata remains largely the same
  • Observed where output quality begins to degrade

The important takeaway is not that one quant is always "best." The important takeaway is that quantization is a deployment decision. The right choice depends on your hardware limits, acceptable quality loss, and the task you need the model to perform.

Conclusion

This lab isolates quantization as the main variable. By comparing Gemma 4 E2B in Q4 and Q6 Ollama variants, you can directly observe one of the most important tradeoffs in local inference: balancing model quality against efficiency and resource constraints.