Polish Update

This commit is contained in:
c4ch3c4d3
2026-03-30 17:25:32 -06:00
parent 1aa9310bc8
commit 6bcebd55ee
6 changed files with 154 additions and 78 deletions
@@ -9,6 +9,17 @@ In this lab, we will:
* Generate a dataset with Kiln.ai
* Fine-tune Gemma3 with Unsloth Studio
<div class="lab-callout lab-callout--info">
<strong>Lab Flow Guide</strong><br />
<strong>Explore</strong> sections focus on understanding dataset choices and trade-offs.<br />
<strong>Execute</strong> sections focus on building, reviewing, and preparing data for fine-tuning workflows.
</div>
To start this lab, one web service has been preconfigured:
* Unsloth - http://<IP>:8888
You'll need to install Kiln from the following URL - https://github.com/Kiln-AI/Kiln/releases/tag/v0.18.1
## Objective 1 Explore: Public Datasets
While fine tunes may not have the same level of impact as in the early days of LLMs, they can still provide hyper specialized capabilities to enable small LLMs such as those we've used throughout the course to compete with large, closed LLMs such as ChatGPT and Gemini. For use cases where data needs to be private, where the costs of a closed model are too high, or we want a model that is focused for a specific RAG dataset.
@@ -58,16 +69,10 @@ Navigate to [GSM8K](https://huggingface.co/datasets/openai/gsm8k). Much like ho
At the heart of each data set is the pairing of *input* and *result*. In the case of math, this is relatively easy, as these are quite literally *question* and *answer* pairs to math problems.
Larger datasets, such as [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), utilize more complicated structures, but all still fundamentally follow this same principle. In the case of [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), the inputs are titles and summaries of web pages, with links to the precise web page as scraped from the internet. Feel free to explore a subset of this **15 Trillion Token** dataset below:
Larger datasets, such as [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), utilize more complicated structures, but all still fundamentally follow this same principle. In the case of [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), the inputs are titles and summaries of web pages, with links to the precise web page as scraped from the internet.
<div style="text-align: center; width: 100%;">
<iframe
src="https://huggingface.co/datasets/HuggingFaceFW/fineweb/embed/viewer/sample-10BT/train"
frameborder="0"
width="100%"
height="600px"
style="max-width: 100%; border: 1px solid #ddd; border-radius: 4px;"
></iframe>
<div class="lab-callout lab-callout--info">
<strong>Explore:</strong> Open the <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/viewer/sample-10BT/train" target="_blank" rel="noreferrer">Fineweb sample viewer</a> in a new tab and inspect a subset of this <strong>15 trillion token</strong> dataset directly on Hugging Face.
</div>
#### Openweight vs. opensource