LLM-Labs/content/labs/lab-3-oi-prompting.md

<!-- breakout-style: instruction-rails -->
<!-- step-style: underline -->
<!-- objective-style: divider -->

# Lab 3 - Open WebUI & Prompting
In this lab, we will:
* Run Open WebUI
* Using an Ollama Model within Open WebUI
* Experimenting with Inference Parameters
* Experimenting with Prompting Techniques

<div class="lab-callout lab-callout--info">
  <strong>Lab Flow Guide</strong><br />
  <strong>Explore</strong> sections focus on investigation and comparison.<br />
  <strong>Execute</strong> sections require running steps and validating output.
</div>


## Objective 1 Execute: Accessing Open WebUI

Your lab machine has been pre-installed with Open Webui.  It is accessible on your provided system IP at port 8080 (http://<IP>:8080). You can log in or register with the following default credentials:

Username: student@openwebui.com

Password: student

<figure style="text-align: center;">
  <a href="https://i.imgur.com/nwk73eW.png" target="_blank">
    <img
      src="https://i.imgur.com/nwk73eW.png"
      style="width: 50%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
  </a>
  <figcaption style="margin-top: 8px; font-size: 1.1em;">
    Initial Registration
  </figcaption>
</figure>
<br>

## Objective 2 Execute: Downloading Our First Model through Open WebUI (OUI)

Locate, pull, and run **Qwen3.5 4B** using the **Open WebUI**.  By defualt, Open WebUI comes pre-configured to talk to a local install of Ollama, a legacy configuration from this projects original intent (it originally released as Ollama-WebUI).  By the end of this section you should be able to start a model with a single click and generate a response in the UI.

### Execute: Download Qwen 3.5 4B

1. **Open the Ollama model registry**
   * Go to <https://ollama.com> in your web browser.
   * Locate the search box at the top of the page.

   <figure style="text-align:center;">
   <a href="https://i.imgur.com/btkT9IH.png" target="_blank">
   <img src="https://i.imgur.com/btkT9IH.png" width="600"
        style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
   </a>
   <figcaption>Ollama homepage – use the search bar to look for “Qwen 3.5”.</figcaption>
   </figure>

2. **Find the Qwen 3.5 family**
   * Type **`Qwen 3.5`** and press **Enter**.
   * The results page lists several parameter sizes (1 B → 27 B).

3. **Navigate to the list of tags**
   * Click the **`Tags`** link beneath the model description.

   <figure style="text-align:center;">
   <a href="https://i.imgur.com/TuUbK7O.png" target="_blank">
   <img src="https://i.imgur.com/TuUbK7O.png" width="600"
        style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
   </a>
   <figcaption>Tag view – each entry shows the model size and a short description.</figcaption>
   </figure>

4. **Select the 4B variant**
   * Locate **`Qwen3.5:4b`** in the table.
   * The size column reads **`3.4 GB`**, indicating the VRAM required for inference.

   <figure style="text-align:center;">
   <a href="https://i.imgur.com/eaRaqnq.png" target="_blank">
   <img src="https://i.imgur.com/eaRaqnq.png" width="600"
        style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
   </a>
   <figcaption>Model size for `Qwen3.5:4b` (≈ 3.3 GB VRAM).</figcaption>
   </figure>

5. **Copy the model tag**
   * Click the **copy‑to‑clipboard** icon next to the tag (or highlight the text and press **Ctrl +C**).

6. **Open the Open WebUI interface**
   * In a new browser tab, navigate to the URL where your Open WebUI instance is running (e.g., `http://localhost:8080`).

7. **Pull the model through the UI**
   * In the **“Select a model”** dropdown, paste the copied tag into the text field.
   * Click **`Pull`**.  The UI will display a progress bar while Ollama downloads the GGUF file.

   <figure style="text-align:center;">
   <a href="https://i.imgur.com/Sf8sSs3.png" target="_blank">
   <img src="https://i.imgur.com/Sf8sSs3.png" width="600"
        style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
   </a>
   <figcaption>Open WebUI – paste the tag and press “Pull”.</figcaption>
   </figure>

8. **Verify the model works**
   * Once the download finishes, type a prompt in the chat window (e.g., “Tell me a short, funny story about a cat that learns to code”).
   * Press **Enter** and watch the response appear.

   <figure style="text-align:center;">
   <a href="https://i.imgur.com/30OMNsk.png" target="_blank">
   <img src="https://i.imgur.com/30OMNsk.png" width="600"
        style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
   </a>
   <figcaption>Successful inference – the model returns a coherent answer.</figcaption>
   </figure>

9. **Download Gemma3n e2B**
* While we're downloading models, let us download one more.  You can either repeat the process from the previous steps to find and download **Gemma3n e2B**, or just use the following model tag to download the model via the Open WebUI search bar:

```bash
ollama pull gemma3n:e2b
```

Google has designed gemma 3n models designed for efficient execution on resource constrained devices such as laptops, tablets, phones, or Nvidia 2080 Super GPUs.

---


## Objective 3: Inference Settings

### Explore: OUI Inference Parameter Valves

Prior to this lab, we discussed inference settings such as Top K, Top P, and Temperature.  Let's quickly review the most common settings to customize:

* `Context Length` - The amount of tokens the model is allowed to keep in active memory
* `Temperature` - Changes the score of low probability token generation
* `Top K` - Limits the possible tokens selection during inference to the most likely `K` selections
* `Top P` - Limits the possible tokens to those whose cumulative probability exceeds `P`

Open WebUI allows us to easily modify these parameters on the fly through the chat controls, found on the right hand side next to your user's icon.

<figure style="text-align: center;">
  <a href="https://i.imgur.com/Tp4LqGs.png" target="_blank">
    <img
      src="https://i.imgur.com/Tp4LqGs.png"
      style="width: 600px; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
  </a>
  <figcaption style="margin-top: 8px; font-size: 1.1em;">
    Chat Controls
  </figcaption>
</figure>
<br>

By default, Open WebUI selects the following generically sound options, with the expectation that users have access to modest hardware:

* `Context Length` - 2048
* `Temperature` - .8
* `Top K` - 40
* `Top P` - .9

While we won't play with `Context Length`, this parameter is critical for successfully accomplishing more complicated tasks using local models.  With only the small default context length value, the model will quickly forget your instructions and interactions, rendering the results the model generates less useful.  Unfortunately, just increasing this value is not always an option, as your selected model + `Context Length` must fit within your available memory. As with many challenges in AI, a key to solving issues with `Context Length` is often scaling your hardware to meet the demands of the task. This generally means utilizing hardware with larger amounts of VRAM or unified memory – either by purchasing it or renting access.

Additionally, these defaults can be overruled by the Ollama model file, which can specify its own "preferred" default Hyperparameters.  Below are the defaults that come with the model we've downloaded, or feel free to interactively explore the `params` page for the model at this link: [qwen3.5:4b-q4_K_M](https://ollama.com/library/qwen3.5:4b-q4_K_M/blobs/9371364b27a5).

<br>

<figure style="text-align: center;">
  <a href="https://i.imgur.com/HfnH17e.png" target="_blank">
    <img
      src="https://i.imgur.com/HfnH17e.png"
      style="width: 600px; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
  </a>
  <figcaption style="margin-top: 8px; font-size: 1.1em;">
    Modelfile Defaults
  </figcaption>
</figure>
<br>

The best model makers will often override the defaults with their own preferred ones, as we've just seen. These Qwen selected defaults were the values they found to produce the best outputs for most tasks.  When possible, it is likely that you'll want to stick with these defaults unless you have a very good reason to change them.

Thankfully, our lab gives us just such a reason! We can manually modify these options with the aforementioned chat controls options.  Depending on our end goal, we can either help the model to write more "creatively" or "precisely" through setting `Temperature`, `Top K`, and `Top P`.

Lets test this with a series of interactions, themed around Magic the Gathering. Qwen is considered a multi-modal model, meaning we're not just limited to inputing text!  Input the following image, and ask `What is this?  What does it do?`


Next, set our inference parameters to the following:

* `Temperature` - 1.1
* `Top K` - 100
* `Top P` - .95

Repeat your first interaction, noting the differences in model output.  Less "likely" or common words were hopefully selected!

When satisfied, lets next set our inference parameters to the following:

* `Temperature` - 2
* `Top K` - 400
* `Top P` - .95

The model this time likely has gone off the rails, answering for an extended period of time, and trailing off incoherently.  This is due to us increasing the likelihood of improbable tokens far beyond the expected performance thresholds google has set for us.  Lets next test the opposite:

* `Temperature` - Default
* `Top K` - 1
* `Top P` - Default

Feel free to continue to explore with other topics or images.  Note how each time we restart our conversation, the model gives us the exact same answer.  This is because Top K limits the model to select only the single most likely token for the provided input! Even with this restriction however, note that the model can still provide different answers based on GPU differences, random fluctuations in the GPU hardware, or other similarly improbable events.  Never forget that LLMs are deterministic, and even when highly restricted, can output unexpected results.

<br>

---

## Objective 4: Prompting Techniques

### Explore: Prompt Engineering & System Prompting

<div class="lab-callout lab-callout--warning">
  <strong>Warning:</strong> As you explore chat via Open WebUI, ensure you turn  <code>think (Ollama)</code> to OFF. <strong>Qwen3.5 4B</strong> is likely to enter an infinite thinking loop for these tasks otherwise, which will require a VM reboot.

<br><br>

  Alternatively, choose to perform these steps with **Gemma3n e2B**, which can handle tight environments more gracefully.
</div>

Next, lets review different ways we can coax a model to perform better, without having to perform fine tuning or parameter customization.  We can do this by "priming" the model with our first prompt in a number of ways:

<br>

* Few Shot Prompting - Providing examples of our desired outcome up front
* Meta Prompting - Providing a guide to reach the desired outcome
* Chain of Thought - Providing the model guidance to think through its response
* Self Criticism - Asking the model to play "devil's' advocate" against itself

<br>

Each of these tools can be combined to help achieve a greater effect.  Below is a suggested list of Magic the Gathering game design challenges which we can task Qwen 3.5 with, but each will require either some luck, or great prompt engineering.  If you have a different topic you're more familiar with, feel free to first use Qwen 3.5 to adapt these challenges to a more familiar theme:

<br>

* Design a black rare creature card that fits thematically and mechanically into a Graveyard Matters Magic the Gathering set.  Provide a few existing cards to help give the model a template.
* Design the same card, but this time outline the type, mechanics, tone, and identity
* Invent a new keyword.  Have the model reason step by step how the keyword will work within the game
* Review your new keyword for game balance.  Have the model challenge its decisions.

<br>

There is one final prompting tool that we have yet to deep dive, which is system prompting.  While the `chat controls` menu provides the option to override the default system prompt, Open WebUI provides a powerful flow for "creating" new models with saved system prompts and inference parameters. This is also a great convenience feature, as changing Hyperparameters via Chat Controls for every chat becomes tedious. This is especially useful once we have created a system prompt that we especially prefer, or would like to set inference parameters once, and reuse them many times.

Let's create a new model by selecting the `Workspace` link, and then selecting the `+` button to create a new model:

<figure style="text-align: center;">
  <a href="https://i.imgur.com/TjNyWNa.png" target="_blank">
    <img
      src="https://i.imgur.com/TjNyWNa.png"
      style="width: 600px; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
  </a>
  <figcaption style="margin-top: 8px; font-size: 1.1em;">
    Create custom model
  </figcaption>
</figure>
<br>

In the new model window, we can customize many different options for our model, even beyond the previously used chat specific controls.  Create a new model named `Qwen 3.5 LLM Demo` by performing the following steps:

1. Set the name to `Qwen 3.5 LLM Demo`
2. Set the Base Model to `Qwen3.5:4b`
3. Provide a system prompt.  You can set this to be any task you'd like model to focus on, or we can stick with our Magic the Gathering theme.  Utilize the following prompt, or for bonus points, have Qwen 3.5 generate one for you.
```text
"You are a creative designer for Magic: The Gathering, tasked with generating new Sliver creature cards. Follow these guidelines to ensure the cards align with the game's mechanics and lore:
 Card Outline Structure:
 * Name: Give the Sliver a unique name that reflects its abilities or traits (e.g., 'Predatory Sliver', 'Aetherwing Sliver').
 * Mana Cost: Assign a mana cost appropriate for the card’s power level and complexity. Use standard Magic symbols (e.g., {1}{G}{U}).
 * Type Line: Always include 'Creature — Sliver' in the type line.
 * Power/Toughness: Set values that balance the card’s abilities.
 * Abilities: Include one or more keyword abilities, triggered abilities, or static effects. Ensure they synergize with existing Sliver mechanics.
 * Flavor Text (optional): Add a short, thematic quote or description to enhance the card's lore.

 Sliver Mechanics:
 Slivers are a tribe of creatures that share abilities among themselves. Include the phrase 'All Slivers have...' in the ability text to reflect this tribal synergy.
 Abilities should be consistent with existing Sliver themes, such as combat enhancements, adaptability, or swarm tactics.

 Balance and Creativity:
 Ensure the card is balanced for gameplay while introducing innovative mechanics or flavor.

 Example:
    Name: Swiftwing Sliver
    Mana Cost: {2}{W}
    Type Line: Creature — Sliver
    Power/Toughness: 2/2
    Abilities: Flying, All Slivers have flying.
    Flavor Text: 'The skies belong to the swift and the bold.'

When provided a name, generate a new Sliver card following this structure."
```

<figure style="text-align: center;">
  <a href="https://i.imgur.com/ZtLpw9y.png" target="_blank">
    <img
      src="https://i.imgur.com/ZtLpw9y.png"
      style="width: 600px; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
  </a>
  <figcaption style="margin-top: 8px; font-size: 1.1em;">
    System Prompt Creation
  </figcaption>
</figure>
<br>


4. To ensure only the best card generation, show the `Advanced Params` and set the following to add creativity:
    * `Temperature` - 1.1
    * `Top K` - 100
    * `Top P` - .95
    * `Ollama (Think)` - Off

    Note: While we haven't actively discussed them as a part of this lab, as you play with more advanced inference problems, you may also find the following parameters of interest:
     * `Max Tokens` - Limit the possible length of a response to the desired number of tokens
     * `num_gpu` - Manually override Ollama's built in layer offload determination.  Useful for increasing performance on mixed GPU setups.
     * `use_mlock` - Manually force Ollama to ensure all model components are kept within active memory.  Useful for smaller systems.

<figure style="text-align: center;">
  <a href="https://i.imgur.com/9RcJVjK.png" target="_blank">
    <img
      src="https://i.imgur.com/9RcJVjK.png"
      style="width: 600px; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
  </a>
  <figcaption style="margin-top: 8px; font-size: 1.1em;">
    Custom Parameters
  </figcaption>
</figure>
<br>

5. When done, hit save.  We can now test creating new Sliver cards!  Select our newly created model from the chat drop down, and try inventing a few names.

<br>

---

## Conclusion

Throughout this lab, we've explored the fascinating world of Open WebUI and prompt engineering. Let's summarize the key topics we've covered:

1. **Model Selection and Management**: We explored how to download and manage models like Qwen 3.5, understanding their resource requirements and capabilities. This taught us about the practical considerations of working with different model sizes.

2. **Inference Parameters**: We experimented with critical inference parameters including:
   - Temperature: Controls randomness in output
   - Top K: Limits token selection to top K most likely options
   - Top P: Uses nucleus sampling based on cumulative probability

3. **Prompting Techniques**: We examined various prompting strategies:
   - Few Shot Prompting: Providing examples of desired outputs
   - Meta Prompting: Giving guidance to reach outcomes
   - Chain of Thought: Encouraging step-by-step reasoning
   - Self Criticism: Having the model evaluate its own responses

4. **System Prompting**: We created custom models with specific system prompts and parameter settings, learning how to tailor LLM behavior for specialized tasks.

These concepts are foundational for effectively working with large language models in real-world applications. Remember that prompt engineering is both an art and a science - it requires understanding both the capabilities of the model and the nuances of human language. As you continue your journey with LLMs, don't hesitate to experiment with different approaches and parameters to find what works best for your specific use cases.