483 lines
28 KiB
Markdown
483 lines
28 KiB
Markdown
---
|
||
order: 7
|
||
title: Lab 7 - Dataset Generation and Fine Tuning
|
||
description: Review dataset options, generate examples with Kiln.ai, and fine-tune a model in Unsloth.
|
||
---
|
||
|
||
<!-- breakout-style: instruction-rails -->
|
||
<!-- step-style: underline -->
|
||
<!-- objective-style: divider -->
|
||
|
||
# Lab 7 - Dataset Generation and Fine Tuning
|
||
|
||
In this lab, we will:
|
||
|
||
- Explore public datasets
|
||
- Generate a dataset with Kiln.ai
|
||
- Fine-tune Gemma3 with Unsloth Studio
|
||
|
||
<div class="lab-callout lab-callout--info">
|
||
<strong>Lab Flow Guide</strong><br />
|
||
<strong>Explore</strong> sections focus on understanding dataset choices and trade-offs.<br />
|
||
<strong>Execute</strong> sections focus on building, reviewing, and preparing data for fine-tuning workflows.
|
||
</div>
|
||
|
||
To start this lab, one web service has been preconfigured:
|
||
|
||
- Unsloth - {{service-url:unsloth}}
|
||
|
||
You'll need to install Kiln from the following URL - https://github.com/Kiln-AI/Kiln/releases/tag/v0.18.1
|
||
|
||
## Objective 1 Explore: Public Datasets
|
||
|
||
While fine tunes may not have the same level of impact as in the early days of LLMs, they can still provide hyper specialized capabilities to enable small LLMs such as those we've used throughout the course to compete with large, closed LLMs such as ChatGPT and Gemini. For use cases where data needs to be private, where the costs of a closed model are too high, or we want a model that is focused for a specific RAG dataset.
|
||
|
||
There are multiple ways to generate a useful dataset, including but not limited to:
|
||
|
||
| # | Method | Typical use‑case | Key advantage |
|
||
| --- | ----------------------------- | ------------------------------------------------------------ | --------------------------------------------- |
|
||
| 1 | **Manual data collection** | Surveys, interviews, domain‑expert annotation | Highest specificity; fully controlled quality |
|
||
| 2 | **Web scraping** | Harvesting public articles, forum posts, code snippets | Scalable; leverages existing web content |
|
||
| 3 | **APIs & databases** | Accessing structured resources (e.g., Wikipedia API, PubMed) | Structured data; often well‑documented |
|
||
| 4 | **Crowdsourcing** | Large‑scale labeling (e.g., image bounding boxes) | Cost‑effective for repetitive tasks |
|
||
| 5 | **Data augmentation** | Expanding a small set of images or text | Improves diversity without new collection |
|
||
| 6 | **Public datasets** | Ready‑made corpora from repositories like HuggingFace | Immediate availability; often pre‑processed |
|
||
| 7 | **Synthetic data generation** | Simulated sensor readings, procedurally generated text | Useful when real data is scarce or sensitive |
|
||
|
||
Let's at least quickly touch on option 6, **Public Datasets**. While they may vary in quality, they're a great way to jumpstart a particular focus for a fine tune. Many are found on https://huggingface.co/datasets, and we can see there are over 400k datasets readily accessible for many different tasks, from many different providers, including [OpenAI](https://huggingface.co/datasets/openai/gsm8k), [Nvidia](https://huggingface.co/datasets/nvidia/Nemotron-CrossThink), and more. Much like with models, there are numerous tools we can utilize to filter these datasets, such as on format, modality, or license.
|
||
|
||
<figure style="text-align: center;">
|
||
<img
|
||
src="https://i.imgur.com/kdnBCyL.png"
|
||
width="600"
|
||
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
|
||
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
|
||
Example Datasets.
|
||
</figcaption>
|
||
</figure>
|
||
|
||
#### Explore a dataset (GSM8K)
|
||
|
||
Navigate to [GSM8K](https://huggingface.co/datasets/openai/gsm8k). Much like how models have **model cards**, datasets have **dataset cards**. These perform a similar job, providing:
|
||
|
||
1. Tags
|
||
2. Example data & a _Data Studio_ button for interacting with the dataset on **HuggingFace** directly.
|
||
3. Easy Download Links (although we can also use `git clone`)
|
||
4. The Description
|
||
|
||
<figure style="text-align: center;">
|
||
<img
|
||
src="https://i.imgur.com/Y55FAPV.png"
|
||
width="600"
|
||
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
|
||
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
|
||
Dataset Model Card Contents.
|
||
</figcaption>
|
||
</figure>
|
||
|
||
At the heart of each data set is the pairing of _input_ and _result_. In the case of math, this is relatively easy, as these are quite literally _question_ and _answer_ pairs to math problems.
|
||
|
||
Larger datasets, such as [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), utilize more complicated structures, but all still fundamentally follow this same principle. In the case of [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), the inputs are titles and summaries of web pages, with links to the precise web page as scraped from the internet.
|
||
|
||
<div class="lab-callout lab-callout--info">
|
||
<strong>Explore:</strong> Open the <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/viewer/sample-10BT/train" target="_blank" rel="noreferrer">Fineweb sample viewer</a> in a new tab and inspect a subset of this <strong>15 trillion token</strong> dataset directly on Hugging Face.
|
||
</div>
|
||
|
||
#### Open‑weight vs. open‑source
|
||
|
||
One last note on public datasets. A common misconception is that _open weight_ models are **open source**.
|
||
|
||
<br>
|
||
|
||
- _Open‑weight_ models (e.g., Gemma, DeepSeek R1, Qwen) provide publicly released checkpoints but **do not** include permissive source‑code licenses.
|
||
- True **open‑source** LLMs remain rare; there are very few models that freely share their Dataset and Training pipeline. Examples are **INTELLECT‑2**, which was built via a distributed "SETI@Home‑style" effort, or Nvidia's **Nemotron 3** family of models.
|
||
|
||
<br>
|
||
|
||
Unfortunately, **INTELLECT‑2** does not favorably compare to existing _open weight_ models such as **Gemma**, **DeepSeek R1**, **Qwen**, or other bleeding edge models. **Nemotron 3** also is behind the State of the Art (SOTA) models, but instead serves as a showcase on how anyone can train models using Nvidia hardware.
|
||
|
||
Regardless of model type though, when using any _open weight_ model for corporate purposes, review the license for allowed use!
|
||
|
||
<br>
|
||
|
||
---
|
||
|
||
## Objective 2: Synthetic Dataset Generation
|
||
|
||
If you can, I strongly encourage you to try and find ready made, or easily massaged datasets that do not require synthetic data. You'll often obtain better results with less effort this way. After all, the original frontier ChatGPT family of models merely scraped the entire internet, every book, scientific papers, and other "pre made" raw data to help generate their first dataset. However, this is often unrealistic, as at minimum, we need **1000** input-output pairs in order to begin fine tuning, so...
|
||
|
||
### Why Use Synthetic Data?
|
||
|
||
| Reason | Explanation |
|
||
| ------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| **Data scarcity** | Niche domains (e.g., MITRE ATT&CK classification) often lack ≥ 1 000 labeled examples. |
|
||
| **Scalability** | A single large model can produce thousands of examples in minutes, saving manual effort. |
|
||
| **Quality control** | By generating with a _larger_ model than the target (e.g., Gemma‑12B qat → Gemma‑4B), you can distill richer responses within specific domains. |
|
||
| **Iterative refinement** | Kiln lets you rate or repair each pair, turning noisy outputs into a clean training set. |
|
||
|
||
<div class="lab-callout lab-callout--warning">
|
||
<strong>Rule of Thumb:</strong> Never generate data with a model that is smaller than the model you plan to fine-tune.
|
||
</div>
|
||
|
||
---
|
||
|
||
### Execute: Install & Launch Kiln AI
|
||
|
||
### 1. Install & Launch Kiln AI
|
||
|
||
If you haven't yet, download [Kiln AI](https://github.com/Kiln-AI/Kiln/releases/tag/v0.18.1) and run the installer for your OS.
|
||
|
||
<div class="lab-callout lab-callout--info">
|
||
<strong>Tip:</strong> These steps were designed for <strong>Kiln v0.18</strong>. While compatible with newer versions, v0.18 features a polished, simplified UI ideal for this lab. Note that Kiln undergoes active development with frequent UI changes across versions.
|
||
</div>
|
||
|
||
1. **Open Kiln**. It should automatically go to `http://localhost:3000` in your machine's browser.
|
||
2. Click **`Get Started`**.
|
||
|
||
<figure style="text-align:center;">
|
||
<img src="https://i.imgur.com/hJNehuE.png" width="400"
|
||
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
|
||
<figcaption>Welcome screen – click "Get Started".</figcaption>
|
||
</figure>
|
||
|
||
3. Choose **`Continue`** (or **`Skip Tour`** if you prefer).
|
||
4. Dismiss the newsletter prompt (optional).
|
||
|
||
Kiln is now ready for configuration.
|
||
|
||
### 2. Connect Kiln to Ollama
|
||
|
||
1. In Kiln's left‑hand **Providers** panel, click **`Connect`** under the Ollama entry.
|
||
|
||
<div class="lab-callout lab-callout--warning">
|
||
Use your Ollama instance IP to connect (I.E. http://<STUDENT IP>:11434). You must be connected to the VPN for this to work.
|
||
</div>
|
||
|
||
<figure style="text-align:center;">
|
||
<img src="https://i.imgur.com/vEwUszl.png" width="600"
|
||
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
|
||
<figcaption>Connect to a local or remote Ollama instance.</figcaption>
|
||
</figure>
|
||
|
||
2. Click **`Continue`** to confirm the connection.
|
||
|
||
<div class="lab-callout lab-callout--info">
|
||
<strong>Tip:</strong> If you have access to a commercial LLM (for example, OpenAI GPT-4o), you can point Kiln to that endpoint for higher-quality synthetic data by replacing the Ollama URL in <strong>Providers → Connect</strong>.
|
||
</div>
|
||
---
|
||
|
||
### 3. Create a Kiln Project
|
||
|
||
1. Kiln will prompt you to **Create a Project**. Enter any descriptive name (e.g., `MITRE‑ATTACK‑FineTune`).
|
||
|
||
<figure style="text-align:center;">
|
||
<img src="https://i.imgur.com/8CLEp9s.png" width="400"
|
||
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
|
||
<figcaption>Name your project.</figcaption>
|
||
</figure>
|
||
|
||
2. Press **`Create`**. You are now inside the project workspace.
|
||
|
||
---
|
||
|
||
### 4. Define the Fine‑Tuning Task
|
||
|
||
1. Click **`Add Task`** and fill out the form with the details below.
|
||
- **Task name:** `ATT&CK Classification`
|
||
- **Goal:** "Given a description of an attack technique, tactic, or procedure, return only an accurate MITRE ATT&CK ID and Name in the format: "ID# - Technique". "
|
||
- **System prompt (auto‑filled):** Kiln will prepend this text to every generation request.
|
||
|
||
<figure style="text-align:center;">
|
||
<img src="https://i.imgur.com/43o2s0Y.png" width="400"
|
||
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
|
||
<figcaption>Task definition screen.</figcaption>
|
||
</figure>
|
||
|
||
2. Click **`Save Task`**. The task now appears in the left‑hand **Tasks** list.
|
||
|
||
---
|
||
|
||
### 5. Kiln Main Interface Overview
|
||
|
||
| Sidebar item | Primary use |
|
||
| ------------------ | ---------------------------------------------------------------------------- |
|
||
| **Run** | Manually generate one input‑output pair at a time (useful for quick checks). |
|
||
| **Dataset** | View, edit, export, or import the entire collection of pairs. |
|
||
| **Synthetic Data** | Bulk‑generate pairs using a model of your choice. |
|
||
| **Evals** | Run automatic evaluation against a held‑out test set. |
|
||
| **Settings** | Project‑level configuration (e.g., default model, output format). |
|
||
|
||
When you first open a project, Kiln lands on the **Run** page.
|
||
|
||
---
|
||
|
||
## 6 Manual Generation (Run Page)
|
||
|
||
1. In the **Run** view, set the parameters as shown below (you may substitute a larger model if your hardware permits).
|
||
|
||
<figure style="text-align:center;">
|
||
<img src="https://i.imgur.com/vvW0wjk.png" width="600"
|
||
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
|
||
<figcaption>Configure the Run settings.</figcaption>
|
||
</figure>
|
||
|
||
2. Type a **scenario description** (e.g., "An attacker dumps LSASS memory using Mimikatz") and click **`Run`**.
|
||
3. Kiln sends the prompt to the selected Ollama model (by default `gemma3:12b‑it‑qat`).
|
||
4. When the model returns an answer, you can **rate** it from 1 ★ to 5 ★.
|
||
|
||
_5 ★_ → Accept and click **`Next`**.
|
||
_< 5 ★_ → Click **`Attempt Repair`**, edit the response, then **`Accept Repair`** or **`Reject`**.
|
||
|
||
<figure style="text-align:center;">
|
||
<img src="https://i.imgur.com/wqVsYMk.png" width="600"
|
||
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
|
||
<figcaption>Rate a correct response with 5 ★.</figcaption>
|
||
</figure>
|
||
|
||
5. Repeat until you have a handful of high‑quality pairs. This manual step is optional but useful for seeding the dataset with "gold‑standard" examples.
|
||
|
||
---
|
||
|
||
### 7 Bulk Synthetic Data Generation
|
||
|
||
#### 7.1 Open the Generator
|
||
|
||
1. In the sidebar, click **`Synthetic Data` → `Generate Fine-Tuning Data`**.
|
||
|
||
<figure style="text-align:center;">
|
||
<img src="https://i.imgur.com/l6OiUeP.png" width="600"
|
||
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
|
||
<figcaption>Enter the bulk‑generation workflow.</figcaption>
|
||
</figure>
|
||
|
||
#### 7.2 Generate Top‑Level Topics
|
||
|
||
1. Click **`Add Topics`**. This will generate top level topics that follow broad MITRE ATT&CK categories.
|
||
2. Choose **`Gemma-3n-2B`**.
|
||
3. Set **Number of topics** to **8** and click **`Generate`**.
|
||
|
||
<figure style="text-align:center;">
|
||
<img src="https://i.imgur.com/SHh8v0y.png" width="400"
|
||
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
|
||
<figcaption>Select model & number of topics.</figcaption>
|
||
</figure>
|
||
|
||
4. Review the generated list. Delete any unsatisfactory topics (hover → click the trash icon) or click **`Add Topics`** again to generate more. Alternatively, if additoinal depth is required, click **`Add Subtopics`** to drill down deeper into any of the high level topics created by Gemma initially.
|
||
|
||
<figure style="text-align:center;">
|
||
<img src="https://i.imgur.com/wHNv3Om.png" width="800"
|
||
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
|
||
<figcaption>Final set of 8 topics.</figcaption>
|
||
</figure>
|
||
|
||
#### 7.3 Create Input Scenarios for All Topics
|
||
|
||
1. With the topics selected, click **`Generate Model Inputs`**. Ensure **`Gemma-3n-2B`** is still chosen, and then affirm your selection.
|
||
Kiln now asks the model to produce a short _scenario description_ for each topic.
|
||
2. After the model finishes, review the generated inputs. You may edit any that look off.
|
||
|
||
#### 7.4 Generate Corresponding Outputs
|
||
|
||
1. Click **`Save All Model Outputs`**. Kiln now runs the model a second time—this time using each generated input as the prompt—to produce the _output_ (the ATT&CK technique label).
|
||
|
||
<figure style="text-align:center;">
|
||
<img src="https://i.imgur.com/A47GRVr.png" width="800"
|
||
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
|
||
<figcaption>Produce the "output" side and store the pair.</figcaption>
|
||
</figure>
|
||
|
||
2. The full input‑output pairs are automatically added to the project's dataset.
|
||
|
||
#### 7.5 Review the Completed Dataset
|
||
|
||
1. Switch to the **`Dataset`** tab.
|
||
2. You should see a table of 64 (8 topics × 8 samples) pairs. Clicking any row opens the same **Run** view, where you can **rate**, **repair**, or **delete** the pair.
|
||
|
||
<figure style="text-align:center;">
|
||
<img src="https://i.imgur.com/DnyXYJO.png" width="800"
|
||
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
|
||
<figcaption>Dataset overview with generated pairs.</figcaption>
|
||
</figure>
|
||
|
||
---
|
||
|
||
### 8. Dataset Export (Create a Fine-Tune)
|
||
|
||
1. Once you are satisfied with the dataset, you can export it to numerous forms of JSONL via the **Fine Tune → Create a Fine Tune** button.
|
||
|
||
2. Kiln will first ask what format it would like our data to be exported to. We can leave the default setting of *Download: OpenAI chat format (JSONL). Next, select *Create a New Fine-Tuning Dataset.\*
|
||
3. Kiln supports splitting our generated data into a number of buckets, including _`Training`_ _`Test`_ and _`Validation`_. Each of these dataset segments is critical to a great fine tune, but at our generated 64 examples, we don't have the luxury of creating a split. As such, under **`Advanced Options`**, select _100% training_, and click _Create Dataset_.
|
||
|
||
<figure style="text-align:center;">
|
||
<img src="https://i.imgur.com/vp6jobS.png" width="400"
|
||
style="display:block; margin-left:auto; margin-right:auto; border:5px solid black;">
|
||
<figcaption>Dataset overview with generated pairs.</figcaption>
|
||
</figure>
|
||
|
||
4. We can ignore all further options, and select _Download Split_. A new .jsonl file will be saved!
|
||
|
||
---
|
||
|
||
## Objective 3: Fine Tuning with Unsloth Studio
|
||
|
||
There are many popular options for performing fine tunes, although many have their drawbacks:
|
||
|
||
- [Unsloth](https://unsloth.ai) is the most popular solution, but currently does not support multi-gpu setups without a commercial license.
|
||
- [Axoltl](https://axolotl.ai) is built off of Unsloth, and does support multi-gpu setups, but often lags behind Unsloth in features and capability, and does not feature any Web UI.
|
||
- [LLaMaFactory](https://github.com/hiyouga/LLaMA-Factory) is the most flexible of these options, supporting both Unsloth & Axlotle, as well as additional backends. However, this tool is daunting for a beginner to approach fine tuning, and is best left for later experimentation.
|
||
<br>
|
||
While I encourage you to explore all of these tools, they are unfortunately out of the scope for this lab. Instead, we're going to focus on **Unsloth**, as it provides the best web UI to easily navigate the fine tuning process.
|
||
|
||
### Explore: Touring Unsloth Studio
|
||
|
||
Although Unsloth Studio does its best to simplify the fine tuning process, there are still many dials and knobs to turn! Lets take a brief tour of the most important options:
|
||
|
||
1. Model Selection - This area allows us to select any model that we're interested in fine tuning. Unsloth Studio will handle downloading the FP16 version of the model from **HuggingFace** for us.
|
||
2. Quantization Selection - Without much better hardware, we will usually be training **LoRA**s (Low-Rank Adapters). These will slightly nudge the parameters of the model in the direction we're interested in. If we need additional headroom, we can instead **quantize the base model** (e.g., reduce its precision from 16-bit to 4-bit) and then apply **LoRA** to the quantized model, generating a **QLoRA** (Quantized LoRA). This approach combines the efficiency of quantization with the parameter-efficiency of LoRA. Unsloth will conveniently tell us its estimate for how well a given combination of _Model_ & **QLoRA** will fit in our system's available VRAM.
|
||
|
||
<figure style="text-align: center;">
|
||
<img
|
||
src="https://i.imgur.com/XwAdaKJ.png"
|
||
width="800"
|
||
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
|
||
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
|
||
Model & LoRA Type Selections. Note how models are labeled "OOM" or "Tight" based on hardware.
|
||
</figcaption>
|
||
</figure>
|
||
|
||
3. Dataset Selection - This is where we can utilize our custom made dataset. Unfortunately, while we've gone through the process of making a dataset, we had to use a very small model to simulate the process. Conveniently, Unsloth allows us to search for any dataset available publicly on HuggingFace. We can select conveniently select the sarahwei/cyber_MITRE_CTI_dataset_v15 for our purposes. You can select "View Dataset" if you'd like to see some of the raw contents of this data.
|
||
|
||
<figure style="text-align: center;">
|
||
<img
|
||
src="https://i.imgur.com/8xBdcnd.png"
|
||
width="400"
|
||
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
|
||
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
|
||
Dataset Selection
|
||
</figcaption>
|
||
</figure>
|
||
|
||
4. Train Settings - This is where we can configure exactly how our model will be trained. The majority of these settings can stay default, until you've a specific need that pushes you down the rabbit hole. In particular, we'll be interested in
|
||
- **Learning Rate** - Controls how large an adjustment to the model's weights are made during each step
|
||
- **Epoch** - Determines the number of times the training algorithm will iterate over the entire dataset (aka repeats training 3 times by default). Critical to help avoid under or over fitting.
|
||
- **Cutoff length** - Equivalent to Ollama's context. As always, larger context training requires more memory.
|
||
- **Batch Size** - Can speed up training, as long as we have the hardware to support.
|
||
- **Warmup Steps** - The number of initial training steps during which the learning rate gradually increases to the set target. Helps with stability.
|
||
|
||
<figure style="text-align: center;">
|
||
<img
|
||
src="https://i.imgur.com/fzSvggY.png"
|
||
width="400"
|
||
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
|
||
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
|
||
Fine Tuning Settings
|
||
</figcaption>
|
||
</figure>
|
||
|
||
### Execute: Unsloth Studio Fine Tuning
|
||
|
||
Set the following before we start to fine tune Gemma:
|
||
|
||
1. **Model**: `unsloth/gemma-3-270m-it`
|
||
2. **Max Steps**: `100` (NOTE: For real fine tuning, use Epochs, not Steps.)
|
||
3. **Learning Rate**: `0.00005`
|
||
4. **Dataset**: `sarahwei/cyber_MITRE_CTI_dataset_v15`
|
||
5. **Warmup Steps**: `100`
|
||
|
||
- Scroll to the bottom of the page, and click `Preview command`. The WebUI is merely a front end for constructuing `llamafactory-cli` commands, and this shows exactly what will be run.
|
||
- When done reviewing, next click `Start`. It will take some time for Unsloth Studio to start its process, as it will first need to download the full `FP16` raw `Gemma-3-4B` model files.
|
||
|
||
<figure style="text-align: center;">
|
||
<img
|
||
src="https://i.imgur.com/fzSvggY.png"
|
||
width="400"
|
||
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
|
||
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
|
||
Setting Max Steps, Learning Rate, and Warmup Steps
|
||
</figcaption>
|
||
</figure>
|
||
|
||
**Monitor the loss graph** | The graph is measuring **Loss** per **Training step** (roughly 8k steps, 2.5k examples \* 3 epochs), or put simply, how different the model's predicted answer is from our data. This should gradually, logarithmically slope downwards if training is stable.
|
||
|
||
#### What to Look For
|
||
|
||
- **Training Loss:** Decreasing smoothly → model is learning effectively and training is stable
|
||
- **Gradient Norm:** Drops then stabilizes → gradients are well-behaved (no major spikes)
|
||
- **Learning Rate:** Gradually increasing, then eventually decreasing → expected warmup behavior helping stable early training
|
||
|
||
<figure style="text-align: center;">
|
||
<img
|
||
src="https://i.imgur.com/Cue7afQ.png"
|
||
width="600"
|
||
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
|
||
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
|
||
Typical Training Run
|
||
</figcaption>
|
||
</figure>
|
||
|
||
Unfortunately, due to the time constraints of a live classroom, we'll be unable to pursue this training run to completion. On the lab provided GPUs, a full Epoch could take up to two hours! Feel free to cancel it at your leisure.
|
||
|
||
We can however chat with a version of Gemma 3 4B that was trained before this class. It was trained against roughly 60,000 examples, partially generated using kiln, partially harvested from various datasets throughout Huggingface. While not perfect, we can see that the model is signifigantly better than the default.
|
||
|
||
<figure style="text-align: center;">
|
||
<img
|
||
src="https://i.imgur.com/FKZXaV3.png"
|
||
width="600"
|
||
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
|
||
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
|
||
Load Model for Chat
|
||
</figcaption>
|
||
</figure>
|
||
|
||
To test this ourselves, select:
|
||
|
||
1. The chat button at the very top of the screen
|
||
2. Download our model. Its under my personal HuggingFace Account name, c4ch3c4d3
|
||
3. Set the system prompt to the one we selected when using **Kiln.ai** - "Given a description of an attack technique, tactic, or procedure, return only an accurate MITRE ATT&CK ID and Name in the format: "ID# - Technique".
|
||
|
||
<figure style="text-align: center;">
|
||
<img
|
||
src="https://i.imgur.com/GHExjE3.png"
|
||
width="600"
|
||
style="display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
|
||
<figcaption style="margin-top: 8px; font-size: 1.1em; ">
|
||
Test prompt
|
||
</figcaption>
|
||
</figure>
|
||
|
||
| Test Prompt | Expected Output Format |
|
||
| ---------------------------------------------------------------------------- | -------------------------------------------- |
|
||
| "A malicious actor uses PowerShell to download a file from a remote server." | `T1059.001 – PowerShell` |
|
||
| "The adversary exfiltrates data via a compressed archive sent over HTTP." | `T1567.001 – Exfiltration Over Web Services` |
|
||
| "Credential dumping is performed using Mimikatz." | `T1003.001 – LSASS Memory` |
|
||
|
||
The Unsloth chat view is relatively simplistic, but does provide options for changing inference perameters, such as Top-P or Temperature, as well as a location for us to input our system prompt. If we're looking to test the model's accuracy with our fine tune, we normally need to ensure these values match the desired endstate values as closely as possible.
|
||
|
||
### Export the Fine‑Tuned Model
|
||
|
||
<div class="lab-callout lab-callout--warning">
|
||
<strong>Skippable:</strong> These steps are provided for reference as we never successfully finished a fine tune within the lab time period.
|
||
</div>
|
||
|
||
1. Switch to the **Export** tab.
|
||
2. Select the training run of the model you've performed.
|
||
3. Select the latest checkpoint, or if you'd like to explore an alternative, the checkpoint desired.
|
||
4. We can export in a number of formats:
|
||
- **Merged Model** – A BF16 .safetensors format of the model which can be utilized in other projects
|
||
- **LORA** – Only export the LORA adapter layers generated during training. Useful if we wish to share only our new files with other users who already have the model downloaded, but not our fine tune.
|
||
- **GGUF** – A compact file ready for import into **Ollama** or other GGUF‑compatible runtimes.
|
||
|
||
<br>
|
||
|
||
---
|
||
|
||
## Conclusion
|
||
|
||
In this lab, we completed a LoRA fine-tuning workflow:
|
||
|
||
1. **Dataset Generation** - We explored public datasets on HuggingFace and used Kiln AI to generate a synthetic dataset for MITRE ATT&CK classification.
|
||
2. **Fine Tuning** - We used Unsloth Studio to fine-tune Gemma-3-4B on our generated dataset.
|
||
3. **Validation & Export** - We tested the model with sample prompts and exported the fine-tuned model in both FP16 and GGUF formats.
|
||
|
||
If all has gone well, then the model should be much more accurate at identifying MITRE ATT&CK codes from user input scenarios. If not, additional experimentation may be necessary to produce a good fine tune. Playing with the parameters we've discussed, improving and expanding our dataset, or even fine tuning a larger or better base model can also help affect our success rate.
|