Polish Update

This commit is contained in:
c4ch3c4d3
2026-03-30 17:25:32 -06:00
parent 1aa9310bc8
commit 6bcebd55ee
6 changed files with 154 additions and 78 deletions
@@ -28,7 +28,9 @@ All systems use the default username and password of `student`. All labs are loc
using the `lab1_start.sh` script in the `lab1` folder.
Lastly, if necessary, you can `su -` to root at any time. No password will be required
Lastly, if necessary, you can `su -` to root at any time. No password will be required.
Once started, you can reach TransformerLab on port 8338 of your Lab VM (http://<IP>:8338).
## Objective 2: Visualizing a LLM
@@ -18,6 +18,11 @@ In this lab, we will:
<strong>Execute</strong> sections require running commands and producing output.
</div>
To start this lab, you'll need CLI access:
* SSH - <IP>:22
* All necessary artifacts are in the lab2 folder
## Objective 1: HuggingFace & LLaMa.cpp
### 1. What Is LLaMa.cpp?
+2 -2
View File
@@ -15,14 +15,14 @@ In this lab, we will:
<strong>Execute</strong> sections require running steps and validating output.
</div>
To start this lab, one web service has been preconfigured:
* Open WebUI - http://<IP>:8080
## Objective 1 Execute: Accessing Open WebUI
Your lab machine has been pre-installed with Open Webui. It is accessible on your provided system IP at port 8080 (http://<IP>:8080). You can log in or register with the following default credentials:
Username: student@openwebui.com
Password: student
<figure style="text-align: center;">
+81 -42
View File
@@ -6,24 +6,27 @@
In this lab, we will:
* Explore various chunking strategies
* Explore how embeddings & vectors allow similar concepts to "cluster" together within N-Dimensional spaces
* Explore a functional RAG application
* Explore how embeddings and vectors allow similar concepts to cluster together within n-dimensional spaces
* Connect chunking and embedding concepts to a functional RAG workflow
To start
<div class="lab-callout lab-callout--info">
<strong>Lab Flow Guide</strong><br />
<strong>Explore</strong> sections focus on comparison, observation, and reasoning about trade-offs.<br />
Pay close attention to how chunk size and embedding behavior influence later retrieval quality.
</div>
To start this lab, two web services have been preconfigured:
* ChunkViz - http://<IP>:3000
* Embedding Atlas - http://<IP>:5055
## Objective 1 Explore: Chunking Strategy
Chunking is the first step in any RAG pipeline. Chunking is the process of dividing our document into snippets that can then be stored within a database, paired with an embedded representation of that data. Because chunking occurs so early within the RAG process, the strategy chosen to create chunks of a document proves critical to the eventual embeddings which will be stored.
Chunking is the first step in any RAG pipeline. It is the process of dividing a document into smaller snippets that can later be stored in a database and paired with an embedded representation of that data. Because chunking happens so early in the RAG process, the strategy chosen to create those chunks has an outsized impact on the quality of the embeddings that follow.
Successful chunking is hyper specific to the kinds of documents we wish to chunk. In real RAG pipeline production level development, we'd likely execute across a number of strategies against documents that we've analyzed for quality, bucketing them into various processing processes. However, we can at least get a rough idea of what affects chunking will have with a basic visualization.
Successful chunking is highly dependent on the type of document being processed. In production-grade RAG systems, teams often evaluate multiple strategies across different document types, then route content through the processing path that produces the strongest retrieval results. For this lab, we will use a visualization tool to build intuition for those trade-offs.
First, ensure we've started our lab:
```bash
~/lab1/lab4_start.sh
```
And then, in a web browser, navigate to http://<STUDENT ASSIGNED SYSTEM IP>:3000. Once loaded, you should see the ChunkViz homepage.
In a web browser, navigate to http://<STUDENT ASSIGNED SYSTEM IP>:3000. Once loaded, you should see the ChunkViz homepage.
<figure style="text-align: center;">
<a href="https://i.imgur.com/PG6fp1V.png" target="_blank">
@@ -37,7 +40,10 @@ And then, in a web browser, navigate to http://<STUDENT ASSIGNED SYSTEM IP>:3000
</figure>
<br>
Already, ChunkViz is populated with some example text. Additionally, the text has already been "chunked" according to a default, character based splitting strategy. In this case, every 200 characters is considered one chunk. We can modify chunk sizes by playing with the "Chunk Size" and "Chunk Overlap" sliders. Try changing those to 256 & 20 respectively.
ChunkViz starts with example text that has already been split using a default character-based strategy. In this view, every 200 characters is treated as a chunk. Modify the sliders to set the following values:
* `Chunk Size` - `256`
* `Chunk Overlap` - `20`
<figure style="text-align: center;">
<a href="https://i.imgur.com/9SDyh7I.png" target="_blank">
@@ -46,23 +52,23 @@ Already, ChunkViz is populated with some example text. Additionally, the text h
style="width: 50%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Chunk Size & Overlap
Chunk Size and Overlap
</figcaption>
</figure>
<br>
Note how the colors in the text below dynamically change. Each color is a single chunk, with the "green" between each unique color the overlap. This overlap helps to increase the liklyhood that any given chunk will be properly selected.
Notice how the colors in the text below dynamically change. Each color represents a single chunk, while the green text between unique colors represents the overlap. That overlap increases the likelihood that critical context appears in more than one chunk, improving retrieval resilience.
Next, lets explore different chunking strategies. The major ones that we will cover are:
Next, explore the major chunking strategies available in ChunkViz:
| Strategy | Description |
|---|---|
| Character Splitter | This default view splits chunks into characters of words. |
| Token Splitter | Split chunks based on their tokenization values (tokenization done by **tiktoken**). |
| Sentence Splitter | Split chunks into rough sizes based on the interpretation of what is a "sentence". |
| Recursive Character | Split chunks based on multiple possible separators, such as new lines (`\n`), periods (`.`), commas (`,`), or other relevant language section signifiers. |
| Character Splitter | Splits text into chunks based on a fixed number of characters. |
| Token Splitter | Splits chunks based on tokenization values using **tiktoken**. |
| Sentence Splitter | Splits chunks into rough sizes based on what the tool interprets as a sentence. |
| Recursive Character | Splits chunks using multiple separators, such as new lines (`\n`), periods (`.`), commas (`,`), or other language-aware section boundaries. |
Select each option, and observe some peculiarities in how ChonkViz breaks text into chunks.
Select each option and observe the different ways ChunkViz breaks text into chunks.
<figure style="text-align: center;">
<a href="https://i.imgur.com/jWY4nOd.png" target="_blank">
@@ -76,9 +82,9 @@ Select each option, and observe some peculiarities in how ChonkViz breaks text i
</figure>
<br>
Each strategy comes with its own unique benefits and drawbacks. Character based splitting is often one of the easiest strategies to implement, as all input will utilize text characters for OCR. Token based splitting is useful when consistency in chunk size is imperative. Sentence & Recursive splitting are often better for preserving "complete thoughts", as humans often write in complete sentences, but not always.
Each strategy comes with its own benefits and drawbacks. Character-based splitting is often one of the easiest strategies to implement because OCR and text extraction ultimately produce characters. Token-based splitting is useful when keeping chunk sizes consistent for a specific model matters most. Sentence and recursive strategies are often better at preserving complete thoughts, although real-world documents do not always follow clean sentence boundaries.
Lets explore one more facet of chunking, this time through the process of how chunking might present itself against a novel. Open your provided copy of "Blindsight" by Peter Watts, in txt format. Paste the contents into ChonkViz. Once again, play with the sliders (anywhere from 64 up to 1024 chunk sizes) and strategies. Note how different chunk sizes split the novel in different ways.
Explore one more chunking example using a larger document. Open your provided copy of *Blindsight* by Peter Watts in `.txt` format, paste its contents into ChunkViz, and then continue experimenting with chunk sizes from `64` up to `1024` using different strategies. Notice how different chunk sizes and separators change the resulting structure.
<figure style="text-align: center;">
<a href="https://i.imgur.com/M51ASNK.png" target="_blank">
@@ -87,20 +93,22 @@ Lets explore one more facet of chunking, this time through the process of how ch
style="width: 50%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Chapter 1 - 1024 Chunks, Recursive Character. This strategy nicely breaks paragraphs up.
Chapter 1 - 1024 Chunks, Recursive Character. This strategy breaks paragraphs apart cleanly.
</figcaption>
</figure>
<br>
Imagine how precise / difficult it may be to find specific sets of information depending on chunk size!
Imagine how difficult it would be to retrieve the right information if your chunks were too small, too large, or split in unnatural locations.
---
## Objective 2 Explore: Embedding Space
Now that weve seen some of the different trade-offs when chunking, we can move to the next major step of a RAG pipeline, embedding. As discussed during lecture, embedding is the process of converting text into a numerical representation that captures the "meaning" of the content. Instead of treating text as raw strings, embedding models map each chunk into a N-dimensional space where semantically similar content is vectored closer together.
Now that we have seen some of the trade-offs involved in chunking, we can move to the next major step in a RAG pipeline: embedding. As discussed during lecture, embedding is the process of converting text into a numerical representation that captures the meaning of the content. Instead of treating text as raw strings, embedding models map each chunk into an n-dimensional space where semantically similar content ends up closer together.
This allows the system to perform similarity search efficiently: when a user submits a query, the query is also embedded into the same vector space, and the system retrieves the chunks whose embeddings are most closest together. This is in contrast to how embedding vectors are utilized within an LLM itself, I.E. for Attention and transformation via the Feed Forward network. In conclusion, this step is what enables a RAG system to move beyond simple keyword matching and instead retrieve information based on meaning and context.
This allows a system to perform similarity search efficiently. When a user submits a query, the query is embedded into the same vector space, and the system retrieves the chunks whose embeddings are closest to it. This differs from how embeddings are used internally by an LLM for attention and transformation, but it is the key step that allows a RAG system to retrieve information based on meaning rather than simple keyword matching.
Lets explore a real embedding space. Navigate to http://<STUDENT ASSIGNED SYSTEM IP>:5055. Here, we've started a project called Embedding Atlas. Embedding Atlas is a tool that provides interactive visualizations for datasets in parquet format. Each "chunk" in this case is one row in the dataset. It allows for us to visualize, cross-filter, and search embeddings and metadata in an interactive, manual way.
Navigate to http://<STUDENT ASSIGNED SYSTEM IP>:5055. Here, we have started a project called Embedding Atlas. Embedding Atlas is a tool that provides interactive visualizations for datasets stored in parquet format. Each chunk in this case is one row in the dataset, allowing us to visualize, cross-filter, and search embeddings and metadata interactively.
<figure style="text-align: center;">
<a href="https://i.imgur.com/8PvcZBP.png" target="_blank">
@@ -114,7 +122,7 @@ Lets explore a real embedding space. Navigate to http://<STUDENT ASSIGNED SYSTE
</figure>
<br>
The lab4_start.sh script will have automatically started Embedding Atlas, as well as have performed embedding against each "Scenario" in our dataset. Scenarios in this case are 1-3 sentence snippets describing an action taken by an attacker.
The `lab4_start.sh` script automatically starts Embedding Atlas and generates embeddings for each `Scenario` in our dataset. In this lab, each scenario is a one-to-three-sentence description of an attacker action.
<figure style="text-align: center;">
<a href="https://i.imgur.com/9bGQce8.png" target="_blank">
@@ -123,14 +131,16 @@ The lab4_start.sh script will have automatically started Embedding Atlas, as wel
style="width: 50%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Embedding Atlas CLI (Backend, EXAMPLE ONLY)
Embedding Atlas CLI Backend Example
</figcaption>
</figure>
<br>
Our Embedding Atlas has already been pre-loaded with the main dataset we'll be using throughout the rest of today. Specifically, this is a dataset that matches "hacker scenarios" with MITRE ATT&CK Tactics, Technique, and Procedural IDs. If you're unfamiliar with ATT&CK, it is primarily a project that attempts to categorize and organize the possible ways a hacker might attempt to execute malware, pivot throughout a network, and eventually, act on their objectives (often ransomware). ATT&CK also provides us with a rich example and corpus of data that we can use to visualize the embedding process.
Our Embedding Atlas instance has already been preloaded with the primary dataset we will use throughout the rest of the day. Specifically, it pairs hacker scenarios with MITRE ATT&CK tactics, techniques, and procedural IDs. If you are unfamiliar with ATT&CK, it is a framework for categorizing the ways attackers execute malware, move through networks, and act on their objectives. It also provides a rich example corpus for visualizing the embedding process.
To help us visualize groups more clearly, before we start, please be sure to select "TTP_Name" from the dropdown in the top left.
<div class="lab-callout lab-callout--info">
<strong>Tip:</strong> Before you begin exploring the visualization, select <code>TTP_Name</code> from the dropdown in the upper-left corner so the clusters are easier to interpret.
</div>
<figure style="text-align: center;">
<a href="https://i.imgur.com/996ukgZ.png" target="_blank">
@@ -144,12 +154,13 @@ To help us visualize groups more clearly, before we start, please be sure to sel
</figure>
<br>
Each color is a semantically similar concept, as defined by the embeddings generated during test processing. We can dynamically explore this embedding space through a few options:
Each color represents a semantically similar concept as defined by the generated embeddings. Explore the embedding space using the following interactions:
1. Select the text categories on the right side. This will visually show only entries that match that category's organization
2. Alternatively, select any of the categories in the column on the right. This will perform the same function, exclusively showing only entries for the relevant ID
1. Select text categories on the right side to isolate a subset of related entries.
2. Alternatively, select any category label in the right-hand column to show only entries associated with that ID.
3. Select any single dot and click `Nearest Neighbor` to surface the datapoints that embed closest to that example.
Note: You can use your mouse wheel to zoom in and out. Additionally, click and drag the map around with your left click to center areas you deem of interest.
Note: You can use the mouse wheel to zoom in and out. You can also click and drag the map to center the area you want to inspect.
<figure style="text-align: center;">
<a href="https://i.imgur.com/YkSqT4v.png" target="_blank">
@@ -163,11 +174,7 @@ Note: You can use your mouse wheel to zoom in and out. Additionally, click and
</figure>
<br>
Explore how the various categories naturally cluster together within the embedding space. If we, as a user, were to use this embedding space as a part of a RAG pipeline, an LLM could embed the words in our query in a similar manner, and surface the semantically similar ideas within our dataset back to us.
Lets visualize similarity in one other way:
3. Select any single dot, and click "Nearest Neighbor". Embedding Atlas will show us the specific datapoints that embed the closest to our selected datapoint. Notice how some of the nearest datapoints appear very distant! Think about why this might be. We'll discuss in review of this lab why.
Observe how categories naturally cluster together in the embedding space. In a real RAG pipeline, an LLM can embed a user query in a similar way and retrieve semantically related chunks from the dataset.
<figure style="text-align: center;">
<a href="https://i.imgur.com/zKa6GxD.png" target="_blank">
@@ -181,6 +188,38 @@ Lets visualize similarity in one other way:
</figure>
<br>
If you'd like to continue to explore alternative datasets and see how embeddings can flexibly cluster raw data, feel free to take a look at [Embedding Atlas' Examples Page](https://apple.github.io/embedding-atlas/examples/). In particular, take a look at the Wine dataset until class resumes.
When using `Nearest Neighbor`, notice that some of the closest datapoints may still look far apart visually. Think about why that might happen when a high-dimensional space is projected into a lower-dimensional visualization.
If you would like to continue exploring alternative datasets and see how embeddings can flexibly cluster raw data, take a look at [Embedding Atlas' Examples Page](https://apple.github.io/embedding-atlas/examples/). The Wine dataset is a particularly useful example to review before class resumes.
---
## Objective 3 Explore: Full RAG Exploration
At this point, you have seen the two major stages that make retrieval-augmented generation possible:
1. Documents are split into chunks.
2. Chunks are embedded into a vector space.
3. A user query is embedded into that same space.
4. The most relevant chunks are retrieved and passed back to a model as context.
Use what you observed in ChunkViz and Embedding Atlas to reason through the following questions:
* How would a chunk that is too small affect retrieval quality?
* How would a chunk that is too large dilute the meaning of an embedding?
* Why might a semantically similar result appear visually distant on a 2D projection?
* How do chunking strategy and embedding quality work together to improve downstream answers?
This objective is meant to connect the lab tools back to the full RAG workflow. The better your chunking choices and embeddings are, the more useful the retrieved context will be for the model that answers the user.
---
## Conclusion
In this lab, we explored three connected ideas that sit at the heart of a RAG system:
1. **Chunking Strategy** - We compared multiple ways to divide text into retrievable units.
2. **Embedding Space** - We visualized how semantically similar content clusters together.
3. **RAG Workflow** - We connected chunking and embeddings to the retrieval step that powers grounded answers.
You should now have a clearer sense of how early design decisions in a RAG pipeline can dramatically influence retrieval quality and final model responses.
@@ -9,6 +9,17 @@ In this lab, we will:
* Generate a dataset with Kiln.ai
* Fine-tune Gemma3 with Unsloth Studio
<div class="lab-callout lab-callout--info">
<strong>Lab Flow Guide</strong><br />
<strong>Explore</strong> sections focus on understanding dataset choices and trade-offs.<br />
<strong>Execute</strong> sections focus on building, reviewing, and preparing data for fine-tuning workflows.
</div>
To start this lab, one web service has been preconfigured:
* Unsloth - http://<IP>:8888
You'll need to install Kiln from the following URL - https://github.com/Kiln-AI/Kiln/releases/tag/v0.18.1
## Objective 1 Explore: Public Datasets
While fine tunes may not have the same level of impact as in the early days of LLMs, they can still provide hyper specialized capabilities to enable small LLMs such as those we've used throughout the course to compete with large, closed LLMs such as ChatGPT and Gemini. For use cases where data needs to be private, where the costs of a closed model are too high, or we want a model that is focused for a specific RAG dataset.
@@ -58,16 +69,10 @@ Navigate to [GSM8K](https://huggingface.co/datasets/openai/gsm8k). Much like ho
At the heart of each data set is the pairing of *input* and *result*. In the case of math, this is relatively easy, as these are quite literally *question* and *answer* pairs to math problems.
Larger datasets, such as [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), utilize more complicated structures, but all still fundamentally follow this same principle. In the case of [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), the inputs are titles and summaries of web pages, with links to the precise web page as scraped from the internet. Feel free to explore a subset of this **15 Trillion Token** dataset below:
Larger datasets, such as [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), utilize more complicated structures, but all still fundamentally follow this same principle. In the case of [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), the inputs are titles and summaries of web pages, with links to the precise web page as scraped from the internet.
<div style="text-align: center; width: 100%;">
<iframe
src="https://huggingface.co/datasets/HuggingFaceFW/fineweb/embed/viewer/sample-10BT/train"
frameborder="0"
width="100%"
height="600px"
style="max-width: 100%; border: 1px solid #ddd; border-radius: 4px;"
></iframe>
<div class="lab-callout lab-callout--info">
<strong>Explore:</strong> Open the <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/viewer/sample-10BT/train" target="_blank" rel="noreferrer">Fineweb sample viewer</a> in a new tab and inspect a subset of this <strong>15 trillion token</strong> dataset directly on Hugging Face.
</div>
#### Openweight vs. opensource
@@ -5,28 +5,44 @@
# Lab 6 - Evaluation and Red Teaming
In this lab, we will:
* Perform Prompt Injection against three layers of model protection
* Use PromptFoo to programmatically evaluate a model's security protections
* Perform prompt injection against three layers of model protection
* Use Promptfoo to programmatically evaluate a model's security protections
<div class="lab-callout lab-callout--info">
<strong>Lab Flow Guide</strong><br />
<strong>Explore</strong> sections focus on manually probing a model, then scaling that same thinking into repeatable evaluation workflows.<br />
Expect this lab to move from hands-on experimentation into structured testing.
</div>
To start this lab, one web service has been preconfigured:
* Promptfoo - http://<IP>:15500
You'll also need to access:
* Open WebUI - https://ai.zuccaro.me/
## Objective 1 Explore: Direct Prompt Injection
For part 1 of our lab, we're going to explore Direct Prompt Injection. There are three levels for this lab:
For the first part of this lab, we are going to explore direct prompt injection. There are three levels for this challenge:
1. System Prompt Instructional Guardrail
2. System Prompt + Regex
3. System Prompt + LLM Evaluation
1. **System Prompt Instructional Guardrail**
2. **System Prompt + Regex**
3. **System Prompt + LLM Evaluation**
Each level will be more difficult than the last, based on how the protection interacts with the generated output.
<div class="lab-callout lab-callout--warning">
<strong>Warning:</strong> Due to the limitations of Open WebUI, you will see generated outputs BEFORE safety evaluation. A passing answer involves the protection missing the final output.
<strong>Warning:</strong> Due to the limitations of Open WebUI, you will see generated outputs before safety evaluation. A successful jailbreak means the protection missed the final output.
</div>
To access the lab, navigate to https://ai.zuccaro.me. You can log in with the following information:
### Explore: Access the hosted challenge
To access the lab, navigate to https://ai.zuccaro.me and log in with the following credentials:
* `Username` - `student@zuccaro.me`
* `Password` - `Student9205!`
* `Username` - student@zuccaro.me
* `Password` - Student9205!
<br>
<figure style="text-align: center;">
<a href="https://i.imgur.com/YSgw3wq.png" target="_blank">
<img
@@ -34,22 +50,28 @@ To access the lab, navigate to https://ai.zuccaro.me. You can log in with the f
style="width: 50%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Chapter 1 - 1024 Chunks, Recursive Character. This strategy nicely breaks paragraphs up.
Open WebUI Outside Lab Hosted Challenge
</figcaption>
</figure>
<br>
Good luck and have fun!
Good luck and have fun.
<div class="lab-callout lab-callout--info">
<strong>Tip:</strong> Conversations for this Open WebUI instance will not be saved. Ensure you take steps to save any interactions you wish to keep!
<strong>Tip:</strong> Conversations for this Open WebUI instance will not be saved. Ensure you save any interactions you want to keep.
</div>
## Objective 2 Explore: PromptFoo
As you test each protection level, pay attention to how the model behaves before and after the safety check. The goal is not just to trigger unsafe output, but to understand how each layer attempts to prevent it.
While manual interaction with a model is often required for a successful jailbreak, it is often unnecessary for a quick "Vulnerability Scan" style of red team. Often, we're concerned about ensuring our model won't respond poorly during typical user interactions. For testing a wide set of prompts against a model or application, Promptfoo is a great open source project to empower us with the ability to test a wide set of mutated prompts.
---
Promptfoo is available on our lab machine at https://<YOUR STUDENT IP>:15500. We can start working with Promptfoo by creating a new Red Team configuration.
## Objective 2 Explore: Promptfoo
While manual interaction with a model is often required for a successful jailbreak, it is often unnecessary for a quick vulnerability-scan-style red team. More often, we want confidence that a model will not respond poorly during routine user interactions. For testing a wide set of prompts against a model or application, Promptfoo is an excellent open-source framework for generating and evaluating large sets of mutated prompts.
### Explore: Promptfoo red-team workflow
Promptfoo is available on our lab machine at http://<YOUR STUDENT IP>:15500. We can start by creating a new red-team configuration.
<figure style="text-align: center;">
<a href="https://i.imgur.com/YyP8mwB.png" target="_blank">
@@ -58,15 +80,15 @@ Promptfoo is available on our lab machine at https://<YOUR STUDENT IP>:15500. W
style="width: 50%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Embedding Atlas Flow Diagram
Promptfoo Home Page
</figcaption>
</figure>
<br>
Promptfoo is designed to be easy to use for both beginners and practitioners. It's wizard will guide us through the process of configuration the tool for our target, selection of datasets and mutations, and track execution.
Promptfoo is designed to be approachable for both beginners and practitioners. Its wizard guides you through configuring the target, selecting datasets and mutation strategies, and tracking execution.
<div class="lab-callout lab-callout--info">
<strong>Tip:</strong> Although the Promptfoo WebUI is convenient, it unfortunately hides a critical configuration option within .yaml for our lab. As such, please use this provided configuration - [./content/labs/lab-6-evaluationn-and-red-teaming/promptfoo.yaml](Promptfoo.yaml). You can upload it with the "Load Config" button in the bottom left corner, and then proceed with the following screenshot steps.
<strong>Tip:</strong> Although the Promptfoo WebUI is convenient, it hides a critical configuration option for this lab inside the YAML file. Please use the provided configuration file: [lab-6-evaluation-and-red-teaming/promptfoo.yaml](content/labs/lab-6-evaluation-and-red-teaming/promptfoo.yaml). Upload it with the <strong>Load Config</strong> button in the lower-left corner, then proceed with the following screenshot steps.
</div>
<figure style="text-align: center;">
@@ -118,7 +140,7 @@ Promptfoo is designed to be easy to use for both beginners and practitioners. I
<br>
Once we select start, Promptfoo handles the rest! Mutations, tests, and results are all tracked by the WebUI. Promptfoo tests can take a significant period of time! Once done, we'll be provided a new results screen.
Once we select `Start`, Promptfoo handles the rest. Mutations, tests, and results are all tracked by the WebUI. Promptfoo runs can take a significant amount of time, but when they finish you will be presented with a new results screen.
<figure style="text-align: center;">
<a href="https://i.imgur.com/2UopUGj.png" target="_blank">
@@ -132,10 +154,12 @@ Once we select start, Promptfoo handles the rest! Mutations, tests, and results
</figure>
<br>
Promptfoo is supremely flexible! Anything that involves mass evaluation of prompts against a model can be easily performed using the Promptfoo framework. Likewise, we can run an evaluation against a direct HuggingFace dataset. Once again, PromptFoo provides a WebUI, but providing the direct .yaml is often easier.
Promptfoo is highly flexible. Anything that involves mass evaluation of prompts against a model can be performed with the framework. Likewise, we can run an evaluation against a direct Hugging Face dataset. Once again, Promptfoo provides a WebUI, but supplying the direct YAML is often easier.
### Explore: Promptfoo evaluation workflow
<div class="lab-callout lab-callout--info">
<strong>Tip:</strong> Please use this provided configuration - [./content/labs/lab-6-evaluation-and-red-teaming/mmlu-promptfoo-config.yaml](MMLU-Promptfoo-Config.yaml). You can upload it with the "Load Config" button in the bottom left corner, and then proceed with the following screenshot steps.
<strong>Tip:</strong> Please use the provided evaluation configuration file: [lab-6-evaluation-and-red-teaming/mmlu-promptfoo-config.yaml](content/labs/lab-6-evaluation-and-red-teaming/mmlu-promptfoo-config.yaml). Upload it with the <strong>Load Config</strong> button in the lower-left corner, then proceed with the following screenshot steps.
</div>
@@ -151,7 +175,9 @@ Promptfoo is supremely flexible! Anything that involves mass evaluation of prom
</figure>
<br>
Often times, running an Evaluation of a known publicly tested dataset against a copy of your local model can be a more quantitative way to determine the precision loss of your configuration. This can be useful when trying to squeeze the maximum possible performance possible out of your hardware!
Often, running an evaluation against a known public benchmark provides a more quantitative way to measure the precision loss in your local configuration. This can be especially useful when you are trying to squeeze the best possible performance out of limited hardware.
---
## Conclusion
@@ -162,4 +188,3 @@ In this lab, we performed red team evaluations against a target model:
3. **Promptfoo Evaluation** - We used Promptfoo for benchmarking a model against a popular public benchmark, giving us a local point of comparison.
We should now have a better sense of what our next round of fine-tuning should be, or if we need to explore additional protections for our model!