Lab 6 - Evaluation and Red Teaming

In this lab, we will:

Perform Prompt Injection against three layers of model protection
Use PromptFoo to programmatically evaluate a model's security protections

Objective 1 Explore: Direct Prompt Injection

For part 1 of our lab, we're going to explore Direct Prompt Injection. There are three levels for this lab:

System Prompt Instructional Guardrail
System Prompt + Regex
System Prompt + LLM Evaluation

Each level will be more difficult than the last, based on how the protection interacts with the generated output.

Warning: Due to the limitations of Open WebUI, you will see generated outputs BEFORE safety evaluation. A passing answer involves the protection missing the final output.

To access the lab, navigate to https://ai.zuccaro.me. You can log in with the following information:

* `Username` - student@zuccaro.me
* `Password` - Student9205!

Chapter 1 - 1024 Chunks, Recursive Character. This strategy nicely breaks paragraphs up.

Good luck and have fun!

Tip: Conversations for this Open WebUI instance will not be saved. Ensure you take steps to save any interactions you wish to keep!

Objective 2 Explore: PromptFoo

While manual interaction with a model is often required for a successful jailbreak, it is often unnecessary for a quick "Vulnerability Scan" style of red team. Often, we're concerned about ensuring our model won't respond poorly during typical user interactions. For testing a wide set of prompts against a model or application, Promptfoo is a great open source project to empower us with the ability to test a wide set of mutated prompts.

Promptfoo is available on our lab machine at https://:15500. We can start working with Promptfoo by creating a new Red Team configuration.

Promptfoo is designed to be easy to use for both beginners and practitioners. It's wizard will guide us through the process of configuration the tool for our target, selection of datasets and mutations, and track execution.

Tip: Although the Promptfoo WebUI is convenient, it unfortunately hides a critical configuration option within .yaml for our lab. As such, please use this provided configuration - [./content/labs/lab-6-evaluationn-and-red-teaming/promptfoo.yaml](Promptfoo.yaml). You can upload it with the "Load Config" button in the bottom left corner, and then proceed with the following screenshot steps.

Application Details (Direct LLM Testing)

Once we select start, Promptfoo handles the rest! Mutations, tests, and results are all tracked by the WebUI. Promptfoo tests can take a significant period of time! Once done, we'll be provided a new results screen.

Promptfoo is supremely flexible! Anything that involves mass evaluation of prompts against a model can be easily performed using the Promptfoo framework. Likewise, we can run an evaluation against a direct HuggingFace dataset. Once again, PromptFoo provides a WebUI, but providing the direct .yaml is often easier.

Tip: Please use this provided configuration - [./content/labs/lab-6-evaluation-and-red-teaming/mmlu-promptfoo-config.yaml](MMLU-Promptfoo-Config.yaml). You can upload it with the "Load Config" button in the bottom left corner, and then proceed with the following screenshot steps.

Often times, running an Evaluation of a known publicly tested dataset against a copy of your local model can be a more quantitative way to determine the precision loss of your configuration. This can be useful when trying to squeeze the maximum possible performance possible out of your hardware!

Conclusion

In this lab, we performed red team evaluations against a target model:

Direct Injections - We explored different ways to bypass common LLM controls.
Promptfoo Red Teaming - We used Promptfoo for red teaming a large number of prompts against our target model.
Promptfoo Evaluation - We used Promptfoo for benchmarking a model against a popular public benchmark, giving us a local point of comparison.

We should now have a better sense of what our next round of fine-tuning should be, or if we need to explore additional protections for our model!

7.4 KiB Raw Blame History

Lab 6 - Evaluation and Red Teaming

Objective 1 Explore: Direct Prompt Injection

Objective 2 Explore: PromptFoo

Conclusion

7.4 KiB

Raw Blame History