Refactor lab 1 for Netron and local confidence views

2026-04-16 11:15:39 -06:00
parent a97c8a7694
commit e4621ca65b
20 changed files with 1634 additions and 280 deletions
@@ -1,19 +1,20 @@
 ---
 order: 1
-title: Lab 1 - Visualizing LLMs in TransformerLab
-description: Explore model structure, tokenization, and next-token prediction inside TransformerLab.
+title: Lab 1 - Model Structure, Tokenization, and Confidence Visualization
+description: Explore GGUF model structure in Netron, inspect tokenization interactively, and visualize token confidence with a local Ollama model.
 ---

 <!-- breakout-style: instruction-rails -->
 <!-- step-style: underline -->
 <!-- objective-style: divider -->

-# Lab 1 - Visualizing LLMs in TransformerLab
+# Lab 1 - Model Structure, Tokenization, and Confidence Visualization

 In this lab, we will:

- Download and Visualize LLama-3.2-1B-Instruct
- Visualize Tokenization & Prediction with LLama-3.2-1B-Instruct
+- Visualize two small GGUF models in Netron
+- Observe how text is split into tokens and token IDs
+- Inspect the confidence of a local model one token at a time

 <div class="lab-callout lab-callout--info">
  <strong>Lab Flow Guide</strong><br />
@@ -21,258 +22,181 @@ In this lab, we will:
  <strong>Execute</strong> steps require performing actions in the lab environment.
 </div>

-## Objective 1: Starting TransformerLab
+## Objective 1: Visualize Tokenization and Token IDs

-### Execute: Access the Lab Environment
+### Execute: Use the Tokenizer Playground

-To start Lab 1, ensure you've received a WireGuard configuration and system IP from your instructor. If you're unfamiliar with WireGuard, assistance will be provided to ensure you can access the lab environment for the duration of class.
+The below embedded tool below allows you to enter raw text and observe how it's converted into model tokens. Tokenization is the critical first step that enables a Large Language Model to process and understand user input, accomplished by transforming words into numerical values.

-All systems use the default username and password of `student`. All labs are located in the student home folder. To start Lab 1, run
+<div data-tokenizer-playground></div>

-```bash
-~/lab1/lab1_start.sh
-```
+### Explore: Try Multiple Inputs

-using the `lab1_start.sh` script in the `lab1` folder.
+Enter several different inputs and compare how the tokenization changes. Use at least these three examples:

-Lastly, if necessary, you can `su -` to root at any time. No password will be required.
+1. `The quick brown fox jumps over the lazy dog`
+2. `cybersecurity analyst`
+3. `printf("hello");`

-Once started, you can reach TransformerLab on port 8338 of your Lab VM (http://<IP>:8338).
+Then try a few of your own. Short English phrases, punctuation, code, and unusual spacing are all good choices.

-## Objective 2: Visualizing a LLM
+### Explore: Compare the Two Tokenization Views

-### Explore: Understand the Model and Runtime
+This tool is especially useful because it shows both:

-The next steps will guide us through the process of deploying and interacting with a pre-trained LLM, `LLama-3.2-1B-Instruct`. To do this, we’ll be utilizing an inference engine – software designed to execute LLM models and generate token predictions. You'll encounter models packaged in the **GGUF** format, a file format designed for efficient storage and loading of quantized LLMs, enabling them to run on a wider range of hardware. Don't worry if these terms are new to you – the specifics of inference engines and the details of **GGUF** quantized LLMs will be thoroughly explained in the following section of this course.
+- The **visual split** of the text into tokens
+- The underlying **token ID values**

-Normally to start, we'll need to install an **inference engine** capable of running **GGUF** files.
+Those are two views of the same process.

-### Execute: Verify the FastChat Plugin
+The visual split helps us see where the model grouped characters or subwords together. The token ID view reminds us that the model never consumes English directly. It consumes numeric identifiers that point into the tokenizer vocabulary.

-Navigate to **Plugins**, and in the search bar type `Fastchat`. Note that it has already been installed for you!
+As you work through your examples, ask:

-<figure style="text-align: center;">
-  <a href="https://imgur.com/9Waj8VG.png" target="_blank">
-    <img 
-      src="https://imgur.com/9Waj8VG.png" 
-      style="width: 90%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
-  </a>
-  <figcaption style="margin-top: 8px; font-size: 1.1em;">
-    Plugins
-  </figcaption>
-</figure>
-<br>
+- Which full words remain intact?
+- Which words get split into subwords or punctuation chunks?
+- When spacing changes, do the token IDs change too?

-### Execute: Find and Load `LLama-3.2-1B-Instruct`
+Lastly, experiment with how different tokenizeres can change how inputs are split into different numerical values.  How might this affect the next steps in the transformation process?

-Next, navigate to **Model Registry**. You should see `LLama-3.2-1B-Instruct` right away on your screen, but if not, please start searching for this model using the search bar.
-
-<figure style="text-align: center;">
-  <a href="https://i.imgur.com/UyWdnMR.png" target="_blank">
-    <img 
-      src="https://i.imgur.com/UyWdnMR.png" 
-      style="width: 90%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
-  </a>
-  <figcaption style="margin-top: 8px; font-size: 1.1em;">
-    Model Registry Selection.
-  </figcaption>
-</figure>
-<br>
-
-Once downloaded, Select **Foundation** & our newly downloaded `LLama-3.2-1B-Instruct` model.
-
-<figure style="text-align: center;">
-  <a href="https://i.imgur.com/Aez94RU.png" target="_blank">
-    <img 
-      src="https://i.imgur.com/Aez94RU.png" 
-      style="width: 90%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
-  </a>
-  <figcaption style="margin-top: 8px; font-size: 1.1em;">
-    Model Selection
-  </figcaption>
-</figure>
-<br>
-
-Once selected, click **Run**. Give TransformerLab a moment to successfully load the model.
-
-<figure style="text-align: center;">
-  <a href="https://i.imgur.com/f4YcA8P.png" target="_blank">
-    <img 
-      src="https://i.imgur.com/f4YcA8P.png" 
-      style="width: 90%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
-  </a>
-  <figcaption style="margin-top: 8px; font-size: 1.1em;">
-    Starting a Model
-  </figcaption>
-</figure>
-<br>
-
-### Explore: Inspect the Architecture View
-
-To start, lets navigate to the **Interact** page, and then select **Model Architecture** from the Chat drop down.
-
-<figure style="text-align: center;">
-  <a href="https://i.imgur.com/X0CM31h.png" target="_blank">
-    <img 
-      src="https://i.imgur.com/X0CM31h.png" 
-      style="width: 90%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
-  </a>
-  <figcaption style="margin-top: 8px; font-size: 1.1em;">
-    Model Architecture Dropdown
-  </figcaption>
-</figure>
-<br>
-
-This page allows us to visualize the actively loaded model, in this case our downloaded `LLama-3.2-1B-Instruct-`. This interactive view is equivalent to the greatly simplified version shown on the slide “Transformation: Multylayer Perceptron” from our lecture. We can explore this view by:
-
- Holding down both right and left mouse buttons and dragging will move the entire model.
- Holding down just the left mouse button will allow you to rotate the view.
-
-<figure style="text-align: center;">
-  <a href="https://i.imgur.com/8hXTGlt.png" target="_blank">
-    <img 
-      src="https://i.imgur.com/8hXTGlt.png" 
-      style="width: 90%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
-  </a>
-  <figcaption style="margin-top: 8px; font-size: 1.1em;">
-    Model Visualization
-  </figcaption>
-</figure>
-<br>
-
-### Explore: Interpret Layers, Blocks, and Parameters
-
-Each layer of the model performs a specific task, taking the input provided, and transforming it into the statistically most likely completion of text, token by token. This format of Llama 3.1 1B is made up of 372 **layers**. Each layer will transform the input of the layer above it, until eventually, we end up with the statically likely completion.
-You have likely also noticed that the colors repeat. Each set of repeating **layers** is organized into **blocks**. Each **block** is a grouping of **layers** that perform the same functions, but with a slightly different focus. For example, one **block** may focus on nouns, and another may focus on adjectives, and so on.
-
-The **layers** within Llama 3.1 1B are as follows:
-
-<ul class="concept-pill-list">
-  <li>
-    <span class="concept-pill-label">Attention:</span>
-    <span>Focuses the model on specific parts of an input sequence to more accurately predict the next token.</span>
-  </li>
-  <li>
-    <span class="concept-pill-label">Weights:</span>
-    <span>The core learnable parameters of the network.</span>
-  </li>
-  <li>
-    <span class="concept-pill-label">Biases:</span>
-    <span>Additional parameters added after the weighted sum to shift (transform) the output.</span>
-  </li>
-  <li>
-    <span class="concept-pill-label">Scale:</span>
-    <span>Normalizes the output of previous <strong>layers</strong> to prepare the next round of transformation.</span>
-  </li>
-</ul>
-
-Each of these **layers** also has a different type, corresponding to Q, K, V, and much more. 5. The **layers** between the small “Attention” **layers** are all considered to make up a single “block.”
-To the side, we can see the actual number values of each weight within each layer.
-
-Fundamentally, the LLM itself is this stack of numbers. Those numbers allow us to transform tokenized input (such as English), and transform that into a useful output. The more **layers** & **blocks**, the bigger the model, the more accurate and “intelligent” the model will behave. This 1B parameter model is incredibly small however, so the “truthfulness” of generated predictions is likely to be suspect (aka Hallucinated). The model will at least sound very confident however!
-
-<br>
+   <figure style="text-align:center;">
+   <a href="https://i.imgur.com/kc8W4gU.png" target="_blank">
+   <img src="https://i.imgur.com/kc8W4gU.png" width="800" style="border:5px solid black;">
+   </a>
+   <figcaption>Tokenization - GPT3</figcaption>
+   </figure>
+   <br>
+   <figure style="text-align:center;">
+   <a href="https://i.imgur.com/xMKEBwB.png" target="_blank">
+   <img src="https://i.imgur.com/xMKEBwB.png" width="800" style="border:5px solid black;">
+   </a>
+   <figcaption>Tokenization - GPT4</figcaption>
+   </figure>

 ---

-## Objective 3: Tokenization & Prediction with LLama-3.2-1B-Instruct

-### Execute: Interactive Chat
+## Objective 2: Open Netron and Download the Lab Models

-Lets next move on to active conversation with the model. Navigate to the **Chat** tab from the dropdown menu.
+### Execute: Launch Netron

-<figure style="text-align: center;">
-  <a href="https://i.imgur.com/e40Jrku.png" target="_blank">
-    <img 
-      src="https://i.imgur.com/e40Jrku.png" 
-      style="width: 90%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
-  </a>
-  <figcaption style="margin-top: 8px; font-size: 1.1em;">
-    Select Chat
-  </figcaption>
-</figure>
-<br>
+For this lab, model visualization now happens in **Netron**, a lightweight browser tool for inspecting model structure.

-Once loaded, feel free to type any message and interact with the model in any way. To speed up the pace of our lab, I recommend setting your maximum output length to 64 tokens.
+Use the launch panel below to open the local Netron service on port `8338`.

-<figure style="text-align: center;">
-  <a href="https://i.imgur.com/MdAIKLn.png" target="_blank">
-    <img 
-      src="https://i.imgur.com/MdAIKLn.png" 
-      style="width: 90%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
-  </a>
-  <figcaption style="margin-top: 8px; font-size: 1.1em;">
-    Maximum Length - 64
-  </figcaption>
-</figure>
-<br>
+<div data-lab1-netron-panel></div>

-If text generation fails, or acts weird (such as merely repeating your input back to you), unload and reload the model using the previous Foundation screen from the last Objective.
+### Execute: Download the Two GGUF Files

-### Execute: View Tokenization
+You will work with two small GGUF models in this objective:

-If everything is in working order, review the **Tokenize** view. This allows us to visually see how Llama 3.2 will convert our input text into “tokens,” or numbers that represent the input English. Feel free to input any sentence into the box to review what the final tokenized version will be.
+- [Qwen 3 0.6B](/api/lab1/models/qwen3-0.6b-q8_0.gguf)
+- [Llama 3.2 1B](/api/lab1/models/llama-3.2-1b-q4_k_m.gguf)

-<figure style="text-align: center;">
-  <a href="https://i.imgur.com/I9tU8jK.png" target="_blank">
-    <img 
-      src="https://i.imgur.com/I9tU8jK.png" 
-      style="width: 90%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
-  </a>
-  <figcaption style="margin-top: 8px; font-size: 1.1em;">
-    Tokenize View
-  </figcaption>
-</figure>
-<br>
+These files are intentionally small enough to make architecture exploration practical in a classroom lab. Download both files to a convenient location such as your `Downloads` folder.  Once you've downloaded your files, you can open them using the "Open Model" Button on the Netron Homepage.

-### Execute: Visualize Next-Token Activations
+   <figure style="text-align:center;">
+   <a href="https://i.imgur.com/Y7QpGpG.png" target="_blank">
+   <img src="https://i.imgur.com/Y7QpGpG.png" width="800" style="border:5px solid black;">
+   </a>
+   <figcaption>Netron Start Page</figcaption>
+   </figure>

-Next, select Model Activations. By entering “The quick brown fox” and selecting visualize, we can see how the model selects the next word, and the models level of confidence. Also feel free to redo this process with alternative sentences.
+Once Netron is open:

-<figure style="text-align: center;">
-  <a href="https://i.imgur.com/JeWpoqV.png" target="_blank">
-    <img 
-      src="https://i.imgur.com/JeWpoqV.png" 
-      style="width: 90%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
-  </a>
-  <figcaption style="margin-top: 8px; font-size: 1.1em;">
-    Next Word Prediction
-  </figcaption>
-</figure>
-<br>
+1. Select **Open Model** or drag a GGUF file directly into the browser window.
+2. Start with `Qwen 3 0.6B`.

-### Execute: Compare Confidence Views
+Netron will display the model as a graph of tensors, operators, and named blocks. This is a more literal view than the simplified lecture diagrams, but it is still showing the same fundamental idea: the model is a large stack of numeric values, each serving a different purpose to model language.

-Note how confident the model is about the word jumps in this famous phrase. For an alternative view of the same output, you can also select the **Visualize Logprobes** option from the menu, which will show the same information but by color.
+### Explore: What to Look For

-<figure style="text-align: center;">
-  <a href="https://i.imgur.com/PvkgQUr.png" target="_blank">
-    <img 
-      src="https://i.imgur.com/PvkgQUr.png" 
-      style="width: 90%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
-  </a>
-  <figcaption style="margin-top: 8px; font-size: 1.1em;">
-    Green is Confident.  Red is less confident.
-  </figcaption>
-</figure>
-<br>
+As you move around the graph, focus on these three recurring structures.  Each of these grouping of these individual *layers* is what defines a *block*:

-### Explore: Continue Exploring TransformerLab Features
+<ul class="concept-pill-list">
+  <li>
+    <span class="concept-pill-label">Tokenization:</span>
+    <span>Converts textual input into numeric values, a requirement to allow the Machine to understand a user's input.</span>
+  </li>
+  <li>
+    <span class="concept-pill-label">Embedding:</span>
+    <span>Takes tokenized ID values, and converts them into positional vectors the model can perform transformation against.</span>
+  </li>
+  <li>
+    <span class="concept-pill-label">Multi-head attention:</span>
+    <span>"Attends" to the Query (What am I looking for?), Key (What do I contain?), and Value (What do I pass on?) of each token.  .</span>
+  </li>
+  <li>
+    <span class="concept-pill-label">Feed-forward / mulmat:</span>
+    <span>Applies learned "transformations" after attention to further refine each token representation.</span>
+  </li>
+</ul>

-Please continue to explore Transformers Lab until you’re ready to move on. While we will utilize many different tools other than Transformers Lab throughout this course due to its beta nature, this software is improving all the time and is worth watching! Transformers lab supports many advanced features, in various stages of development, such as:
+Notably, Qwen 3 0.6B is composed of 28 of these blocks! This is signifigantly more than GPT-2 (12 blocks), despite this model being 1/3rd the size! 

- Batch Text Generation
- LLM Fine Tuning
- LLM Evaluation
- Retrieval Augmented Generation (RAG)
-  We will discuss these topics and more throughout the course.
+Lastly, you may see labels such as **MatMul**, **Mul**, or **mulmat**, depending on how the graph was exported and named. In practice, these are often part of the feed-forward path that expands and reshapes the model's internal representation before passing it onward.

-<br>
+**Compare the Two Small Models**
+
+Both models are small compared to modern production systems, but they are still large enough to reveal repeating architectural patterns.
+
+As you compare them, ask:
+
+- Where do the repeating blocks begin to stand out?
+- Which names remain stable between the two models?
+- How many *Attention Heads* does each model have?  How might this affect transformations predicted by the model?
+
+   <figure style="text-align:center;">
+   <a href="https://i.imgur.com/WhnFZss.png" target="_blank">
+   <img src="https://i.imgur.com/WhnFZss.png" width="600" style="border:5px solid black;">
+   </a>
+   <figcaption>Netron Qwen 3 0.6B Layers 1 & 2</figcaption>
+   </figure>
+
+---
+
+
+## Objective 3: Visualize Prediction Confidence
+
+### Execute: Run the Local Confidence Widget
+
+The widget below talks to the preloaded local Lab 1 model through Ollama. Enter any prompt you like, generate a response, and then hover over the output tokens.
+
+<div data-lab1-confidence></div>
+
+### Explore: Interpret the Color Coding
+
+Each token in the output is colored by the model's confidence in that selected token.
+
+In general:
+
+- Greener tokens indicate the model was more confident in that choice
+- Warmer yellow or orange tokens indicate a weaker preference
+- Hovering over a token reveals the selected token's percentage and the strongest alternate predictions
+
+This is useful because it shows us that model output is not magic or certainty. Each generated token is chosen from a probability distribution over many possible next tokens.
+
+### Explore: Try Different Prompt Styles
+
+To make the confidence view more interesting, compare:
+
+1. A common phrase such as `The quick brown fox`
+2. A factual question
+3. A short cybersecurity prompt
+
+Notice where the model appears highly certain and where it becomes less stable. Small local models often produce text that sounds very confident even when the underlying prediction distribution is more fragile than it first appears.
+
+<div class="lab-screenshot-placeholder">
+  <strong>Screenshot Placeholder</strong>
+  Confidence heatmap and hover tooltip view.
+</div>

 ---

 ## Conclusion

-In this lab, we observed the foundational concepts of all LLMs in action using TransformerLab. Through hands-on exploration, we observed the process of tokenization – how text is converted into numerical representations for the model – and visualized the model's prediction process, including its confidence levels for different token selections. By navigating the model’s layers and blocks, we gained an appreciation for the sheer scale and complexity inherent in modern LLMs.
+In this lab, we explored three foundational views of an LLM.

-This initial experience provides a crucial stepping stone for further exploration of LLMs, laying the groundwork for future labs focused on fine-tuning, evaluation, and advanced techniques like Retrieval Augmented Generation.
+First, we opened two GGUF model files in Netron and inspected the architecture directly. Then we used a tokenizer playground to see how plain text becomes tokens and token IDs. Finally, we used a local confidence visualizer to watch a small model generate output token by token while exposing how certain it was about each choice.
+
+Together, these three perspectives give us a much more grounded picture of what an LLM actually is: a structured file of learned weights, a tokenizer that converts text into IDs, and a prediction engine that selects the next token from a probability distribution.
@@ -179,7 +179,7 @@ We should then see:

 A text listing of all of the model's tensors, and the precision of each. Because we have merely converted the model's format, and not performed quantization, the model is still in **FP16**.

- This is a text view of the previous graphical view we saw in **Lab 1, Objective 2: Visualizing a LLM**. While **TransformerLab** calls tensors **layers**, terms such as **tensors**, **layers**, and **blocks** can all be used semi-interchangeably, depending on the tool in question. We will further confuse these topics when we get to the Ollama objective below.
+- This is a text view of the previous graphical view we saw in **Lab 1, Objective 2: Visualizing a LLM**. While tools such as **Netron** may expose tensors, operators, and repeating blocks with different labels, terms such as **tensors**, **layers**, and **blocks** can still be used semi-interchangeably at this level of discussion. We will further confuse these topics when we get to the Ollama objective below.
  - Pedantically, the proper definitions are:
    - Tensor - A multi-dimensional array of vectors to store data
    - Layer - A base computational unit in a neural network