Update
This commit is contained in:
@@ -1,7 +1,7 @@
|
||||
---
|
||||
order: 2
|
||||
title: "Lab 2 - Quantization Tradeoffs: Comparing 2-bit, 4-bit, and 8-bit"
|
||||
description: Download Gemma 4 E2B in three GGUF quantizations and compare size, metadata, and output quality.
|
||||
description: Compare Gemma 4 E2B in three Ollama quantizations and study how lower precision changes behavior.
|
||||
---
|
||||
|
||||
<!-- breakout-style: instruction-rails -->
|
||||
@@ -10,8 +10,8 @@ description: Download Gemma 4 E2B in three GGUF quantizations and compare size,
|
||||
|
||||
In this lab, we will:
|
||||
|
||||
- Download the same Gemma model in `UD-IQ2_M`, `Q4_K_M`, and `Q8_0`
|
||||
- Compare file size and GGUF metadata across those quantizations
|
||||
- Pull the same Gemma model in Q2, Q4, and Q8 Ollama variants
|
||||
- Compare the quantization labels and model behavior across those variants
|
||||
- Observe how lower precision changes the model's behavior
|
||||
- Build intuition for when a smaller quant may or may not be worth it
|
||||
|
||||
@@ -23,19 +23,15 @@ In this lab, we will:
|
||||
|
||||
## Objective 1: Understand the Model and the Quantizations
|
||||
|
||||
For this lab, we will use the Hugging Face repository for **Unsloth's GGUF release of Gemma 4 E2B Instruct**:
|
||||
For this lab, we will use three Ollama-published variants of **Gemma 4 E2B** that represent distinct precision bands:
|
||||
|
||||
<https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF>
|
||||
| Precision band | Ollama model tag | Why we are using it |
|
||||
| -------------- | ----------------------------------- | --------------------------------------- |
|
||||
| Q2 | `cajina/gemma4_e2b-q2_k_xl:v01` | Most aggressive compression in this lab |
|
||||
| Q4 | `batiai/gemma4-e2b:q4` | Common middle-ground quant |
|
||||
| Q8 | `bjoernb/gemma4-e2b-fast:latest` | Highest-quality quant in this lab |
|
||||
|
||||
This repository currently exposes multiple GGUF variants of the same base model. We will focus on one file from each of these precision bands:
|
||||
|
||||
| Precision band | GGUF file | Why we are using it | File Size |
|
||||
| -------------- | ------------------------------ | --------------------------------------- |-----------|
|
||||
| 2-bit | `gemma-4-E2B-it-UD-IQ2_M.gguf` | Most aggressive compression in this lab | 2.4 GB |
|
||||
| 4-bit | `gemma-4-E2B-it-Q4_K_M.gguf` | Common middle-ground quant | 3.17 GB |
|
||||
| 8-bit | `gemma-4-E2B-it-Q8_0.gguf` | Highest-quality quant in this lab | 5.05 GB |
|
||||
|
||||
Even though the filenames differ, these are all the same underlying instruction-tuned Gemma 4 E2B model. The main variable we are changing is how the weights are stored.
|
||||
Even though the Ollama tags differ, these are all variants of the same underlying Gemma 4 E2B model family. The main variable we are changing is how the weights are stored.
|
||||
|
||||
When we say these files are the same model, we mean that the overall neural network is still the same:
|
||||
|
||||
@@ -97,10 +93,11 @@ The viewer below zooms out from one weight and instead shows a toy layer with 16
|
||||
|
||||
### Explore: Compare the same prompts through the hosted chat widget
|
||||
|
||||
If your instructor provides an OpenAI-compatible endpoint, you can compare the same prompts through the embedded chat tool below:
|
||||
By default, the widget below points to the courseware-managed Ollama service and the three Lab 2 model tags above. You can still switch to another endpoint if your instructor provides one.
|
||||
|
||||
- Paste the lab endpoint and API key into the settings row
|
||||
- Switch between `Q8_0`, `Q4_K_M`, and `UD-IQ2_M`
|
||||
- Use the preloaded managed endpoint or replace it with another compatible endpoint
|
||||
- Optionally add an API key if your chosen endpoint requires one
|
||||
- Switch between the configured Q2, Q4, and Q8 Gemma variants
|
||||
- Re-run the same prompt so you can compare coherence, stability, and SVG output
|
||||
- Try a visual prompt such as `Draw a pelican riding a bicycle.`
|
||||
|
||||
@@ -121,4 +118,4 @@ The important takeaway is not that one quant is always "best." The important tak
|
||||
|
||||
## Conclusion
|
||||
|
||||
This lab isolates quantization as the main variable. By downloading **Gemma 4 E2B Instruct** in `UD-IQ2_M`, `Q4_K_M`, and `Q8_0`, you can directly observe one of the most important tradeoffs in local inference: balancing model quality against disk usage and resource constraints.
|
||||
This lab isolates quantization as the main variable. By comparing **Gemma 4 E2B** in Q2, Q4, and Q8 Ollama variants, you can directly observe one of the most important tradeoffs in local inference: balancing model quality against efficiency and resource constraints.
|
||||
|
||||
Reference in New Issue
Block a user