Files

T

c4ch3c4d3 1e9f6fc0cf Update lab content instructions

2026-04-27 10:37:43 -06:00

8.2 KiB

Raw Blame History

order, title, description

order	title	description
1	Lab 1 - Model Structure, Tokenization, and Confidence Visualization	Explore GGUF model structure in Netron, inspect tokenization interactively, and visualize token confidence with a local Ollama model.

Lab 1 - Model Structure, Tokenization, and Confidence Visualization

In this lab, we will:

Visualize two small GGUF models in Netron
Observe how text is split into tokens and token IDs
Inspect the confidence of a local model one token at a time

Lab Flow Guide
Explore sections focus on observation and interpretation.
Execute steps require performing actions in the lab environment.

Objective 1: Visualize Tokenization and Token IDs

Execute: Use the Tokenizer Playground

The below embedded tool below allows you to enter raw text and observe how it's converted into model tokens. Tokenization is the critical first step that enables a Large Language Model to process and understand user input, accomplished by transforming words into numerical values.

Explore: Try Multiple Inputs

Enter several different inputs and compare how the tokenization changes. Use at least these three examples:

The quick brown fox jumps over the lazy dog
cybersecurity analyst
printf("hello");

Then try a few of your own. Short English phrases, punctuation, code, and unusual spacing are all good choices.

Explore: Compare the Two Tokenization Views

This tool is especially useful because it shows both:

The visual split of the text into tokens
The underlying token ID values

Those are two views of the same process.

The visual split helps us see where the model grouped characters or subwords together. The token ID view reminds us that the model never consumes English directly. It consumes numeric identifiers that point into the tokenizer vocabulary.

As you work through your examples, ask:

Which full words remain intact?
Which words get split into subwords or punctuation chunks?
When spacing changes, do the token IDs change too?

Lastly, experiment with how different tokenizeres can change how inputs are split into different numerical values. How might this affect the next steps in the transformation process?

Objective 2: Open Netron and Download the Lab Models

Execute: Launch Netron

For this lab, model visualization now happens in Netron, a lightweight browser tool for inspecting model structure.

Use the launch panel below to open the local Netron service on port 8338.

Execute: Download the GGUF File

You will work with a small GGUF model in this objective. Download the provided file:

Llama-3.2-1B.Q4_K_M.gguf for Llama 3.2 1B

This file is intentionally small enough to make architecture exploration practical in a classroom lab. Save it to a convenient location such as your Downloads folder. Once you've downloaded the file, you can open it using the "Open Model" button on the Netron home page.

Once Netron is open:

Select Open Model or drag a GGUF file directly into the browser window.
Open Llama 3.2 1B.

Netron will display the model as a graph of tensors, operators, and named blocks. This is a more literal view than the simplified lecture diagrams, but it is still showing the same fundamental idea: the model is a large stack of numeric values, each serving a different purpose to model language.

Explore: What to Look For

As you move around the graph, focus on these three recurring structures. Each of these grouping of these individual layers is what defines a block:

Tokenization: Converts textual input into numeric values, a requirement to allow the Machine to understand a user's input.
Embedding: Takes tokenized ID values, and converts them into positional vectors the model can perform transformation against.
Multi-head attention: "Attends" to the Query (What am I looking for?), Key (What do I contain?), and Value (What do I pass on?) of each token. .
Feed-forward / mulmat: Applies learned "transformations" after attention to further refine each token representation.

Notably, even small local models are composed of many repeated blocks. The exact count varies by model family, size, and export format, but the important pattern is the repeated attention and feed-forward structure.

Lastly, you may see labels such as MatMul, Mul, or mulmat, depending on how the graph was exported and named. In practice, these are often part of the feed-forward path that expands and reshapes the model's internal representation before passing it onward.

Inspect the Small Model

This model is small compared to modern production systems, but it is still large enough to reveal repeating architectural patterns.

As you inspect it, ask:

Where do the repeating blocks begin to stand out?
Which names remain stable across repeated blocks?
How many Attention Heads does the model have? How might this affect transformations predicted by the model?

Objective 3: Visualize Prediction Confidence

The widget below talks to the preloaded Gemma 4 E2B Q4 model through Ollama. Enter any prompt you like, generate a response, and then hover over the output tokens.

Explore: Interpret the Color Coding

Each token in the output is colored by the model's confidence in that selected token.

In general:

Greener tokens indicate the model was more confident in that choice
Warmer yellow or orange tokens indicate a weaker preference
Hovering over a token reveals the selected token's percentage and the strongest alternate predictions

This is useful because it shows us that model output is not magic or certainty. Each generated token is chosen from a probability distribution over many possible next tokens.

Conclusion

In this lab, we explored three foundational views of an LLM.

First, we opened two GGUF model files in Netron and inspected the architecture directly. Then we used a tokenizer playground to see how plain text becomes tokens and token IDs. Finally, we used a local confidence visualizer to watch a small model generate output token by token while exposing how certain it was about each choice.

Together, these three perspectives give us a much more grounded picture of what an LLM actually is: a structured file of learned weights, a tokenizer that converts text into IDs, and a prediction engine that selects the next token from a probability distribution.

8.2 KiB Raw Blame History