--- order: 1 title: Lab 1 - Visualizing LLMs in TransformerLab description: Explore model structure, tokenization, and next-token prediction inside TransformerLab. --- # Lab 1 - Visualizing LLMs in TransformerLab In this lab, we will: - Download and Visualize LLama-3.2-1B-Instruct - Visualize Tokenization & Prediction with LLama-3.2-1B-Instruct

Lab Flow Guide
Explore sections focus on observation and interpretation.
Execute steps require performing actions in the lab environment.

## Objective 1: Starting TransformerLab ### Execute: Access the Lab Environment To start Lab 1, ensure you've received a WireGuard configuration and system IP from your instructor. If you're unfamiliar with WireGuard, assistance will be provided to ensure you can access the lab environment for the duration of class. All systems use the default username and password of `student`. All labs are located in the student home folder. To start Lab 1, run ```bash ~/lab1/lab1_start.sh ``` using the `lab1_start.sh` script in the `lab1` folder. Lastly, if necessary, you can `su -` to root at any time. No password will be required. Once started, you can reach TransformerLab on port 8338 of your Lab VM (http://:8338). ## Objective 2: Visualizing a LLM ### Explore: Understand the Model and Runtime The next steps will guide us through the process of deploying and interacting with a pre-trained LLM, `LLama-3.2-1B-Instruct`. To do this, we’ll be utilizing an inference engine – software designed to execute LLM models and generate token predictions. You'll encounter models packaged in the **GGUF** format, a file format designed for efficient storage and loading of quantized LLMs, enabling them to run on a wider range of hardware. Don't worry if these terms are new to you – the specifics of inference engines and the details of **GGUF** quantized LLMs will be thoroughly explained in the following section of this course. Normally to start, we'll need to install an **inference engine** capable of running **GGUF** files. ### Execute: Verify the FastChat Plugin Navigate to **Plugins**, and in the search bar type `Fastchat`. Note that it has already been installed for you!

### Execute: Find and Load `LLama-3.2-1B-Instruct` Next, navigate to **Model Registry**. You should see `LLama-3.2-1B-Instruct` right away on your screen, but if not, please start searching for this model using the search bar.

Once downloaded, Select **Foundation** & our newly downloaded `LLama-3.2-1B-Instruct` model.

Once selected, click **Run**. Give TransformerLab a moment to successfully load the model.

### Explore: Inspect the Architecture View To start, lets navigate to the **Interact** page, and then select **Model Architecture** from the Chat drop down.

This page allows us to visualize the actively loaded model, in this case our downloaded `LLama-3.2-1B-Instruct-`. This interactive view is equivalent to the greatly simplified version shown on the slide “Transformation: Multylayer Perceptron” from our lecture. We can explore this view by: - Holding down both right and left mouse buttons and dragging will move the entire model. - Holding down just the left mouse button will allow you to rotate the view.

### Explore: Interpret Layers, Blocks, and Parameters Each layer of the model performs a specific task, taking the input provided, and transforming it into the statistically most likely completion of text, token by token. This format of Llama 3.1 1B is made up of 372 **layers**. Each layer will transform the input of the layer above it, until eventually, we end up with the statically likely completion. You have likely also noticed that the colors repeat. Each set of repeating **layers** is organized into **blocks**. Each **block** is a grouping of **layers** that perform the same functions, but with a slightly different focus. For example, one **block** may focus on nouns, and another may focus on adjectives, and so on. The **layers** within Llama 3.1 1B are as follows:

Attention: Focuses the model on specific parts of an input sequence to more accurately predict the next token.
Weights: The core learnable parameters of the network.
Biases: Additional parameters added after the weighted sum to shift (transform) the output.
Scale: Normalizes the output of previous layers to prepare the next round of transformation.

Each of these **layers** also has a different type, corresponding to Q, K, V, and much more. 5. The **layers** between the small “Attention” **layers** are all considered to make up a single “block.” To the side, we can see the actual number values of each weight within each layer. Fundamentally, the LLM itself is this stack of numbers. Those numbers allow us to transform tokenized input (such as English), and transform that into a useful output. The more **layers** & **blocks**, the bigger the model, the more accurate and “intelligent” the model will behave. This 1B parameter model is incredibly small however, so the “truthfulness” of generated predictions is likely to be suspect (aka Hallucinated). The model will at least sound very confident however!
--- ## Objective 3: Tokenization & Prediction with LLama-3.2-1B-Instruct ### Execute: Interactive Chat Lets next move on to active conversation with the model. Navigate to the **Chat** tab from the dropdown menu.

Once loaded, feel free to type any message and interact with the model in any way. To speed up the pace of our lab, I recommend setting your maximum output length to 64 tokens.

If text generation fails, or acts weird (such as merely repeating your input back to you), unload and reload the model using the previous Foundation screen from the last Objective. ### Execute: View Tokenization If everything is in working order, review the **Tokenize** view. This allows us to visually see how Llama 3.2 will convert our input text into “tokens,” or numbers that represent the input English. Feel free to input any sentence into the box to review what the final tokenized version will be.

### Execute: Visualize Next-Token Activations Next, select Model Activations. By entering “The quick brown fox” and selecting visualize, we can see how the model selects the next word, and the models level of confidence. Also feel free to redo this process with alternative sentences.

### Execute: Compare Confidence Views Note how confident the model is about the word jumps in this famous phrase. For an alternative view of the same output, you can also select the **Visualize Logprobes** option from the menu, which will show the same information but by color.

Green is Confident. Red is less confident.

### Explore: Continue Exploring TransformerLab Features Please continue to explore Transformers Lab until you’re ready to move on. While we will utilize many different tools other than Transformers Lab throughout this course due to its beta nature, this software is improving all the time and is worth watching! Transformers lab supports many advanced features, in various stages of development, such as: - Batch Text Generation - LLM Fine Tuning - LLM Evaluation - Retrieval Augmented Generation (RAG) We will discuss these topics and more throughout the course.
--- ## Conclusion In this lab, we observed the foundational concepts of all LLMs in action using TransformerLab. Through hands-on exploration, we observed the process of tokenization – how text is converted into numerical representations for the model – and visualized the model's prediction process, including its confidence levels for different token selections. By navigating the model’s layers and blocks, we gained an appreciation for the sheer scale and complexity inherent in modern LLMs. This initial experience provides a crucial stepping stone for further exploration of LLMs, laying the groundwork for future labs focused on fine-tuning, evaluation, and advanced techniques like Retrieval Augmented Generation.