12 KiB
Lab 1 - Visualizing LLMs in TransformerLab
In this lab, we will:
- Download and Visualize LLama-3.2-1B-Instruct
- Visualize Tokenization & Prediction with LLama-3.2-1B-Instruct
Explore sections focus on observation and interpretation.
Execute steps require performing actions in the lab environment.
Objective 1: Starting TransformerLab
Execute: Access the Lab Environment
To start Lab 1, ensure you've received a WireGuard configuration and system IP from your instructor. If you're unfamiliar with WireGuard, assistance will be provided to ensure you can access the lab environment for the duration of class.
All systems use the default username and password of student. All labs are located in the student home folder. To start Lab 1, run
~/lab1/lab1_start.sh
using the lab1_start.sh script in the lab1 folder.
Lastly, if necessary, you can su - to root at any time. No password will be required.
Once started, you can reach TransformerLab on port 8338 of your Lab VM (http://:8338).
Objective 2: Visualizing a LLM
Explore: Understand the Model and Runtime
The next steps will guide us through the process of deploying and interacting with a pre-trained LLM, LLama-3.2-1B-Instruct. To do this, we’ll be utilizing an inference engine – software designed to execute LLM models and generate token predictions. You'll encounter models packaged in the GGUF format, a file format designed for efficient storage and loading of quantized LLMs, enabling them to run on a wider range of hardware. Don't worry if these terms are new to you – the specifics of inference engines and the details of GGUF quantized LLMs will be thoroughly explained in the following section of this course.
Normally to start, we'll need to install an inference engine capable of running GGUF files.
Execute: Verify the FastChat Plugin
Navigate to Plugins, and in the search bar type Fastchat. Note that it has already been installed for you!
Execute: Find and Load LLama-3.2-1B-Instruct
Next, navigate to Model Registry. You should see LLama-3.2-1B-Instruct right away on your screen, but if not, please start searching for this model using the search bar.
Once downloaded, Select Foundation & our newly downloaded LLama-3.2-1B-Instruct model.
Once selected, click Run. Give TransformerLab a moment to successfully load the model.
Explore: Inspect the Architecture View
To start, lets navigate to the Interact page, and then select Model Architecture from the Chat drop down.
This page allows us to visualize the actively loaded model, in this case our downloaded LLama-3.2-1B-Instruct-. This interactive view is equivalent to the greatly simplified version shown on the slide “Transformation: Multylayer Perceptron” from our lecture. We can explore this view by:
- Holding down both right and left mouse buttons and dragging will move the entire model.
- Holding down just the left mouse button will allow you to rotate the view.
Explore: Interpret Layers, Blocks, and Parameters
Each layer of the model performs a specific task, taking the input provided, and transforming it into the statistically most likely completion of text, token by token. This format of Llama 3.1 1B is made up of 372 layers. Each layer will transform the input of the layer above it, until eventually, we end up with the statically likely completion. You have likely also noticed that the colors repeat. Each set of repeating layers is organized into blocks. Each block is a grouping of layers that perform the same functions, but with a slightly different focus. For example, one block may focus on nouns, and another may focus on adjectives, and so on.
The layers within Llama 3.1 1B are as follows:
- Attention: Focuses the model on specific parts of an input sequence to more accurately predict the next token.
- Weights: The core learnable parameters of the network.
- Biases: Additional parameters added after the weighted sum to shift (transform) the output.
- Scale: Normalizes the output of previous layers to prepare the next round of transformation.
Each of these layers also has a different type, corresponding to Q, K, V, and much more. 5. The layers between the small “Attention” layers are all considered to make up a single “block.” To the side, we can see the actual number values of each weight within each layer.
Fundamentally, the LLM itself is this stack of numbers. Those numbers allow us to transform tokenized input (such as English), and transform that into a useful output. The more layers & blocks, the bigger the model, the more accurate and “intelligent” the model will behave. This 1B parameter model is incredibly small however, so the “truthfulness” of generated predictions is likely to be suspect (aka Hallucinated). The model will at least sound very confident however!
Objective 3: Tokenization & Prediction with LLama-3.2-1B-Instruct
Execute: Interactive Chat
Lets next move on to active conversation with the model. Navigate to the Chat tab from the dropdown menu.
Once loaded, feel free to type any message and interact with the model in any way. To speed up the pace of our lab, I recommend setting your maximum output length to 64 tokens.
If text generation fails, or acts weird (such as merely repeating your input back to you), unload and reload the model using the previous Foundation screen from the last Objective.
Execute: View Tokenization
If everything is in working order, review the Tokenize view. This allows us to visually see how Llama 3.2 will convert our input text into “tokens,” or numbers that represent the input English. Feel free to input any sentence into the box to review what the final tokenized version will be.
Execute: Visualize Next-Token Activations
Next, select Model Activations. By entering “The quick brown fox” and selecting visualize, we can see how the model selects the next word, and the models level of confidence. Also feel free to redo this process with alternative sentences.
Execute: Compare Confidence Views
Note how confident the model is about the word jumps in this famous phrase. For an alternative view of the same output, you can also select the Visualize Logprobes option from the menu, which will show the same information but by color.
Explore: Continue Exploring TransformerLab Features
Please continue to explore Transformers Lab until you’re ready to move on. While we will utilize many different tools other than Transformers Lab throughout this course due to its beta nature, this software is improving all the time and is worth watching! Transformers lab supports many advanced features, in various stages of development, such as:
- Batch Text Generation
- LLM Fine Tuning
- LLM Evaluation
- Retrieval Augmented Generation (RAG) We will discuss these topics and more throughout the course.
Conclusion
In this lab, we observed the foundational concepts of all LLMs in action using TransformerLab. Through hands-on exploration, we observed the process of tokenization – how text is converted into numerical representations for the model – and visualized the model's prediction process, including its confidence levels for different token selections. By navigating the model’s layers and blocks, we gained an appreciation for the sheer scale and complexity inherent in modern LLMs.
This initial experience provides a crucial stepping stone for further exploration of LLMs, laying the groundwork for future labs focused on fine-tuning, evaluation, and advanced techniques like Retrieval Augmented Generation.