25 KiB
Lab 5 - Dataset Generation and Fine Tuning
In this lab, we will:
- Explore public datasets
- Generate a dataset with Kiln.ai
- Fine-tune Gemma3 with LLaMA Factory
Objective 1 Explore: Public Datasets
While fine tunes may not have the same level of impact as in the early days of LLMs, they can still provide hyper specialized capabilities to enable small LLMs such as those we've used throughout the course to compete with large, closed LLMs such as ChatGPT and Gemini. For use cases where data needs to be private, where the costs of a closed model are too high, or we want a model that is focused for a specific RAG dataset.
There are multiple ways to generate a useful dataset, including but not limited to:
| # | Method | Typical use‑case | Key advantage |
|---|---|---|---|
| 1 | Manual data collection | Surveys, interviews, domain‑expert annotation | Highest specificity; fully controlled quality |
| 2 | Web scraping | Harvesting public articles, forum posts, code snippets | Scalable; leverages existing web content |
| 3 | APIs & databases | Accessing structured resources (e.g., Wikipedia API, PubMed) | Structured data; often well‑documented |
| 4 | Crowdsourcing | Large‑scale labeling (e.g., image bounding boxes) | Cost‑effective for repetitive tasks |
| 5 | Data augmentation | Expanding a small set of images or text | Improves diversity without new collection |
| 6 | Public datasets | Ready‑made corpora from repositories like HuggingFace | Immediate availability; often pre‑processed |
| 7 | Synthetic data generation | Simulated sensor readings, procedurally generated text | Useful when real data is scarce or sensitive |
Let's at least quickly touch on option 6, Public Datasets. While they may vary in quality, they're a great way to jumpstart a particular focus for a fine tune. Many are found on https://huggingface.co/datasets, and we can see there are over 400k datasets readily accessible for many different tasks, from many different providers, including OpenAI, Nvidia, and more. Much like with models, there are numerous tools we can utilize to filter these datasets, such as on format, modality, or license.
Explore a dataset (GSM8K)
Navigate to GSM8K. Much like how models have model cards, datasets have dataset cards. These perform a similar job, providing:
- Tags
- Example data & a Data Studio button for interacting with the dataset on HuggingFace directly.
- Easy Download Links (although we can also use
git clone) - The Description
At the heart of each data set is the pairing of input and result. In the case of math, this is relatively easy, as these are quite literally question and answer pairs to math problems.
Larger datasets, such as Fineweb, utilize more complicated structures, but all still fundamentally follow this same principle. In the case of Fineweb, the inputs are titles and summaries of web pages, with links to the precise web page as scraped from the internet. Feel free to explore a subset of this 15 Trillion Token dataset below:
Open‑weight vs. open‑source
One last note on public datasets. A common misconception is that open weight models are open source.
- Open‑weight models (e.g., Gemma, DeepSeek R1, Qwen) provide publicly released checkpoints but do not include permissive source‑code licenses.
- True open‑source LLMs remain rare; the only notable example at time of writing is INTELLECT‑2, which was built via a distributed "SETI@Home‑style" effort.
Unfortunately, INTELLECT‑2 does not favorably compare to existing open weight models such as Gemma, DeepSeek R1, Qwen, or other bleeding edge models. When using these open weight models for corporate purposes, review the license!
Objective 2: Synthetic Dataset Generation
If you can, I strongly encourage you to try and find ready made, or easily massaged datasets that do not require synthetic data. You'll often obtain better results with less effort this way. Afterall, the original frontier ChatGPT family of models merely scraped the entire internet, every book, scientific papers, and other "pre made" raw data to help generate their first dataset. However, this is often unrealistic, as at minimum, we need 1000 input-output pairs in order to begin fine tuning, so...
Why Use Synthetic Data?
| Reason | Explanation |
|---|---|
| Data scarcity | Niche domains (e.g., MITRE ATT&CK classification) often lack ≥ 1 000 labeled examples. |
| Scalability | A single large model can produce thousands of examples in minutes, saving manual effort. |
| Quality control | By generating with a larger model than the target (e.g., Gemma‑12B qat → Gemma‑4B), you can distill richer responses within specific domains. |
| Iterative refinement | Kiln lets you rate or repair each pair, turning noisy outputs into a clean training set. |
Execute: Install & Launch Kiln AI
1. Install & Launch Kiln AI
If you haven't yet, download Kiln AI and run the installer for your OS.
-
Open Kiln. It should automatically go to
http://localhost:3000in your browser. -
Click
Get Started.
Welcome screen – click "Get Started". -
Choose
Continue(orSkip Tourif you prefer). -
Dismiss the newsletter prompt (optional).
Kiln is now ready for configuration.
2. Connect Kiln to Ollama
-
In Kiln's left‑hand Providers panel, click
Connectunder the Ollama entry.
Connect to a local or remote Ollama instance. -
Click
Continueto confirm the connection.
3. Create a Kiln Project
-
Kiln will prompt you to Create a Project. Enter any descriptive name (e.g.,
MITRE‑ATTACK‑FineTune).
Name your project. -
Press
Create. You are now inside the project workspace.
4. Define the Fine‑Tuning Task
-
Click
Add Taskand fill out the form with the details below.- Task name:
ATT&CK Classification - Goal: "Fine‑tune Gemma‑3‑4B so it can map a textual scenario to the correct MITRE ATT&CK technique."
- System prompt (auto‑filled): Kiln will prepend this text to every generation request.
Task definition screen. - Task name:
-
Click
Save Task. The task now appears in the left‑hand Tasks list.
5. Kiln Main Interface Overview
| Sidebar item | Primary use |
|---|---|
| Run | Manually generate one input‑output pair at a time (useful for quick checks). |
| Dataset | View, edit, export, or import the entire collection of pairs. |
| Synthetic Data | Bulk‑generate pairs using a model of your choice. |
| Evals | Run automatic evaluation against a held‑out test set. |
| Settings | Project‑level configuration (e.g., default model, output format). |
When you first open a project, Kiln lands on the Run page.
6 Manual Generation (Run Page)
-
In the Run view, set the parameters as shown below (you may substitute a larger model if your hardware permits).
Configure the Run settings. -
Type a scenario description (e.g., "An attacker dumps LSASS memory using Mimikatz") and click
Run. -
Kiln sends the prompt to the selected Ollama model (by default
gemma3:12b‑it‑qat). -
When the model returns an answer, you can rate it from 1 ★ to 5 ★.
5 ★ → Accept and click
Next.
< 5 ★ → ClickAttempt Repair, edit the response, thenAccept RepairorReject.
Rate a correct response with 5 ★. -
Repeat until you have a handful of high‑quality pairs. This manual step is optional but useful for seeding the dataset with "gold‑standard" examples.
7 Bulk Synthetic Data Generation
7.1 Open the Generator
-
In the sidebar, click
Synthetic Data→Generate Fine-Tuning Data.
Enter the bulk‑generation workflow.
7.2 Generate Top‑Level Topics
-
Click
Add Topics. This will generate top level topics that follow broad MITRE ATT&CK categories. -
Choose
Gemma‑3:12b‑it‑qat(or any larger model you prefer). -
Set Number of topics to 8 and click
Generate.
Select model & number of topics. -
Review the generated list. Delete any unsatisfactory topics (hover → click the trash icon) or click
Add Topicsagain to generate more. Alternatively, if additoinal depth is required, clickAdd Subtopicsto drill down deeper into any of the high level topics created by Gemma initially.
Final set of 8 topics.
7.3 Create Input Scenarios for All Topics
- With the topics selected, click
Generate Model Inputs. EnsureGemma‑3:12b‑it‑qatis still chosen, and then affirm your selection. Kiln now asks the model to produce a short scenario description for each topic. - After the model finishes, review the generated inputs. You may edit any that look off.
7.4 Generate Corresponding Outputs
-
Click
Save All Model Outputs. Kiln now runs the model a second time—this time using each generated input as the prompt—to produce the output (the ATT&CK technique label).
Produce the "output" side and store the pair. -
The full input‑output pairs are automatically added to the project's dataset.
7.5 Review the Completed Dataset
-
Switch to the
Datasettab. -
You should see a table of 64 (8 topics × 8 samples) pairs. Clicking any row opens the same Run view, where you can rate, repair, or delete the pair.
Dataset overview with generated pairs.
8. Dataset Export (Create a Fine-Tune)
-
Once you are satisfied with the dataset, you can export it to numerous forms of JSONL via the Fine Tune → Create a Fine Tune button.
-
Kiln will first ask what format it would like our data to be exported to. We can leave the default setting of *Download: OpenAI chat format (JSONL). Next, select Create a New Fine-Tuning Dataset.
-
Kiln supports splitting our generated data into a number of buckets, including
TrainingTestandValidation. Each of these dataset segments is critical to a great fine tune, but at our generated 64 examples, we don't have the luxury of creating a split. As such, underAdvanced Options, select 100% training, and click Create Dataset.
Dataset overview with generated pairs. -
We can ignore all further options, and select Download Split. A new .jsonl file will be saved!
Objective 3: Fine Tuning with LLaMA Factory
There are many popular options for performing finetunes, although many have their drawbacks:
- Unsloth is the most popular solution, but currently does not support multi-gpu setups without a commercial license.
- Axoltl is built off of Unsloth, and does support multi-gpu setups, but often lags behind Unsloth in features and capability.
- Both these options are also CLI only. While not the end of the world, it does mean we need to learn how these tools work
While I encourage you to explore both of these tools, they are unfortunately out of the scope for this lab. Instead, we're going to use a project that tries to make these tools easier to use - LLaMaFactory. To do so, we'll need to perform some additional setup of our lab environment:
Explore: Touring LLaMa Factory
Although LLaMa Factory does its best to simplify the fine tuning process, there are still many dials and knobs to turn! Lets take a brief tour of the most important options:
- Model Selection - This area allows us to select any model that we're interested in finetuning. LLaMa factory will handle downloading the FP16 version of the model from HuggingFace for us. Note that for fine tuning, while you can fine tune an already quantized model, you'll often obtain a better result as measured by perplexity by starting with the "raw" model.
- Quantization Selection - Without much better hardware, we will usually be training LoRAs (Low-Rank Adapters). These will slightly nudge the parameters of the model in the direction we're interested in. If we need additional headroom, we can instead quantize the base model (e.g., reduce its precision from 16-bit to 4-bit) and then apply LoRA to the quantized model, generating a QLoRA (Quantized LoRA). This approach combines the efficiency of quantization with the parameter-efficiency of LoRA.
- Dataset Selection - This is where we can utilize our custom made dataset. Unfortunately, adding these datasets is a rather manual effort. This lab has already pre-loaded our dataset for us, but the steps are listed in COME UP WITH SOMEHWERE TO DO THAT.
- Train Settings - This is where we can configure exactly how our model will be trained. The majority of these settings can stay default, until you've a specific need that pushes you down the rabbit hole. In particular, we'll be interested in
- Learning Rate - Controls how large an adjustment to the model's weights are made during each step
- Epoch - Determines the number of times the training algorithm will iterate over the entire dataset (aka repeats training 3 times by default). Critical to help avoid under or over fitting.
- Cutoff length - Equivalent to Ollama's context. As always, larger context training requires more memory.
- Batch Size - Can speed up training, as long as we have the hardware to support.
- Warmup Steps - The number of initial training steps during which the learning rate gradually increases to the set target. Helps with stability.
Execute: LLaMa Factory Fine Tuning
Set the following before we start to fine tune Gemma:
- Model:
Gemma-3-4B - Chat template:
Gemma3 - Learning Rate:
5e-6 - Dataset:
mitre - Warmup Steps:
100
- Scroll to the bottom of the page, and click
Preview command. The WebUI is merely a front end for constructuingllamafactory-clicommands, and this shows exactly what will be run. - When done reviewing, next click
Start. It will take some time for LLaMa Factory to start its process, as it will first need to download the fullFP16rawGemma-3-4Bmodel files.
Monitor the loss graph | The graph is measuring Loss per Training step (roughly 8k steps, 2.5k examples * 3 epochs), or put simply, how different the model's predicted answer is from our data. This should gudually, logarithmically slope downwards if training is working.
What to Look for in the Loss Curve
- Steady decline → model is learning.
- Rapid flattening early → learning‑rate may be too low or the model is under‑parameterized.
- Very flat near the end → possible over‑fitting; consider reducing the number of epochs or adding regularization.
If the curve behaves unexpectedly, you can stop the job, adjust the learning‑rate or warm‑up steps, and restart from the latest checkpoint.
Once completed, we can scroll back up and
- Select Chat
- Select our newly trained LoRA checkpoint. This name of this checkpoint will match the date that you performed the lab.
- Click
Load Model
Scrolling down will show all the options for interaction with the model, as we'd expect in most other interfaces. We have options for changing inference perameters, such as Top-P or Temperature, as well as a location for us to input our system prompt. If we're looking to test the model's accuracy with our fine tune, we normally need to ensure these values match the desired endstate values as closely as possible, but we're only going to set the system prompt, as that is most critical for our finetune.
Set the system prompt to the one we selected when using Kiln.ai - "Given a description of an attack technique, tactic, or procedure, the model should return only a MITRE ATTACK ID and Name."
| Test Prompt | Expected Output Format |
|---|---|
| "A malicious actor uses PowerShell to download a file from a remote server." | T1059.001 – PowerShell |
| "The adversary exfiltrates data via a compressed archive sent over HTTP." | T1567.001 – Exfiltration Over Web Services |
| "Credential dumping is performed using Mimikatz." | T1003.001 – LSASS Memory |
If we're happy with our final model, lastly we can export the model for easy import into Ollama.
Export the Fine‑Tuned Model
-
Switch to the Export tab.
-
Choose a directory on your local machine (or a mounted drive) where you want the exported files to live.
-
Select one of the following output formats:
- FP16 Safetensors – a high‑quality checkpoint you can load again with LLaMA Factory or Hugging Face.
- GGUF (4‑bit) – a compact file ready for import into Ollama or other GGUF‑compatible runtimes.
Conclusion
In this lab, we completed a full fine-tuning workflow:
- Dataset Generation - We explored public datasets on HuggingFace and used Kiln AI to generate a synthetic dataset for MITRE ATT&CK classification.
- Fine Tuning - We used LLaMA Factory to fine-tune Gemma-3-4B on our generated dataset.
- Validation & Export - We tested the model with sample prompts and exported the fine-tuned model in both FP16 and GGUF formats.
If all has gone well, then the model should be much more accurate at identifying MITRE ATT&CK codes from user input scenarios. If not, additional experimentation may be necessary to produce a good fine tune. Playing with the parameters we've discussed, improving and expanding our dataset, or even fine tuning a larger or better base model can also help affect our success rate.



