This commit is contained in:
2026-04-23 14:48:07 -06:00
parent f74575277a
commit 431e667c5e
9 changed files with 505 additions and 228 deletions
+15 -18
View File
@@ -1,7 +1,7 @@
---
order: 2
title: "Lab 2 - Quantization Tradeoffs: Comparing 2-bit, 4-bit, and 8-bit"
description: Download Gemma 4 E2B in three GGUF quantizations and compare size, metadata, and output quality.
description: Compare Gemma 4 E2B in three Ollama quantizations and study how lower precision changes behavior.
---
<!-- breakout-style: instruction-rails -->
@@ -10,8 +10,8 @@ description: Download Gemma 4 E2B in three GGUF quantizations and compare size,
In this lab, we will:
- Download the same Gemma model in `UD-IQ2_M`, `Q4_K_M`, and `Q8_0`
- Compare file size and GGUF metadata across those quantizations
- Pull the same Gemma model in Q2, Q4, and Q8 Ollama variants
- Compare the quantization labels and model behavior across those variants
- Observe how lower precision changes the model's behavior
- Build intuition for when a smaller quant may or may not be worth it
@@ -23,19 +23,15 @@ In this lab, we will:
## Objective 1: Understand the Model and the Quantizations
For this lab, we will use the Hugging Face repository for **Unsloth's GGUF release of Gemma 4 E2B Instruct**:
For this lab, we will use three Ollama-published variants of **Gemma 4 E2B** that represent distinct precision bands:
<https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF>
| Precision band | Ollama model tag | Why we are using it |
| -------------- | ----------------------------------- | --------------------------------------- |
| Q2 | `cajina/gemma4_e2b-q2_k_xl:v01` | Most aggressive compression in this lab |
| Q4 | `batiai/gemma4-e2b:q4` | Common middle-ground quant |
| Q8 | `bjoernb/gemma4-e2b-fast:latest` | Highest-quality quant in this lab |
This repository currently exposes multiple GGUF variants of the same base model. We will focus on one file from each of these precision bands:
| Precision band | GGUF file | Why we are using it | File Size |
| -------------- | ------------------------------ | --------------------------------------- |-----------|
| 2-bit | `gemma-4-E2B-it-UD-IQ2_M.gguf` | Most aggressive compression in this lab | 2.4 GB |
| 4-bit | `gemma-4-E2B-it-Q4_K_M.gguf` | Common middle-ground quant | 3.17 GB |
| 8-bit | `gemma-4-E2B-it-Q8_0.gguf` | Highest-quality quant in this lab | 5.05 GB |
Even though the filenames differ, these are all the same underlying instruction-tuned Gemma 4 E2B model. The main variable we are changing is how the weights are stored.
Even though the Ollama tags differ, these are all variants of the same underlying Gemma 4 E2B model family. The main variable we are changing is how the weights are stored.
When we say these files are the same model, we mean that the overall neural network is still the same:
@@ -97,10 +93,11 @@ The viewer below zooms out from one weight and instead shows a toy layer with 16
### Explore: Compare the same prompts through the hosted chat widget
If your instructor provides an OpenAI-compatible endpoint, you can compare the same prompts through the embedded chat tool below:
By default, the widget below points to the courseware-managed Ollama service and the three Lab 2 model tags above. You can still switch to another endpoint if your instructor provides one.
- Paste the lab endpoint and API key into the settings row
- Switch between `Q8_0`, `Q4_K_M`, and `UD-IQ2_M`
- Use the preloaded managed endpoint or replace it with another compatible endpoint
- Optionally add an API key if your chosen endpoint requires one
- Switch between the configured Q2, Q4, and Q8 Gemma variants
- Re-run the same prompt so you can compare coherence, stability, and SVG output
- Try a visual prompt such as `Draw a pelican riding a bicycle.`
@@ -121,4 +118,4 @@ The important takeaway is not that one quant is always "best." The important tak
## Conclusion
This lab isolates quantization as the main variable. By downloading **Gemma 4 E2B Instruct** in `UD-IQ2_M`, `Q4_K_M`, and `Q8_0`, you can directly observe one of the most important tradeoffs in local inference: balancing model quality against disk usage and resource constraints.
This lab isolates quantization as the main variable. By comparing **Gemma 4 E2B** in Q2, Q4, and Q8 Ollama variants, you can directly observe one of the most important tradeoffs in local inference: balancing model quality against efficiency and resource constraints.