Polish Update
This commit is contained in:
@@ -9,6 +9,17 @@ In this lab, we will:
|
||||
* Generate a dataset with Kiln.ai
|
||||
* Fine-tune Gemma3 with Unsloth Studio
|
||||
|
||||
<div class="lab-callout lab-callout--info">
|
||||
<strong>Lab Flow Guide</strong><br />
|
||||
<strong>Explore</strong> sections focus on understanding dataset choices and trade-offs.<br />
|
||||
<strong>Execute</strong> sections focus on building, reviewing, and preparing data for fine-tuning workflows.
|
||||
</div>
|
||||
|
||||
To start this lab, one web service has been preconfigured:
|
||||
* Unsloth - http://<IP>:8888
|
||||
|
||||
You'll need to install Kiln from the following URL - https://github.com/Kiln-AI/Kiln/releases/tag/v0.18.1
|
||||
|
||||
## Objective 1 Explore: Public Datasets
|
||||
|
||||
While fine tunes may not have the same level of impact as in the early days of LLMs, they can still provide hyper specialized capabilities to enable small LLMs such as those we've used throughout the course to compete with large, closed LLMs such as ChatGPT and Gemini. For use cases where data needs to be private, where the costs of a closed model are too high, or we want a model that is focused for a specific RAG dataset.
|
||||
@@ -58,16 +69,10 @@ Navigate to [GSM8K](https://huggingface.co/datasets/openai/gsm8k). Much like ho
|
||||
|
||||
At the heart of each data set is the pairing of *input* and *result*. In the case of math, this is relatively easy, as these are quite literally *question* and *answer* pairs to math problems.
|
||||
|
||||
Larger datasets, such as [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), utilize more complicated structures, but all still fundamentally follow this same principle. In the case of [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), the inputs are titles and summaries of web pages, with links to the precise web page as scraped from the internet. Feel free to explore a subset of this **15 Trillion Token** dataset below:
|
||||
Larger datasets, such as [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), utilize more complicated structures, but all still fundamentally follow this same principle. In the case of [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), the inputs are titles and summaries of web pages, with links to the precise web page as scraped from the internet.
|
||||
|
||||
<div style="text-align: center; width: 100%;">
|
||||
<iframe
|
||||
src="https://huggingface.co/datasets/HuggingFaceFW/fineweb/embed/viewer/sample-10BT/train"
|
||||
frameborder="0"
|
||||
width="100%"
|
||||
height="600px"
|
||||
style="max-width: 100%; border: 1px solid #ddd; border-radius: 4px;"
|
||||
></iframe>
|
||||
<div class="lab-callout lab-callout--info">
|
||||
<strong>Explore:</strong> Open the <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/viewer/sample-10BT/train" target="_blank" rel="noreferrer">Fineweb sample viewer</a> in a new tab and inspect a subset of this <strong>15 trillion token</strong> dataset directly on Hugging Face.
|
||||
</div>
|
||||
|
||||
#### Open‑weight vs. open‑source
|
||||
|
||||
Reference in New Issue
Block a user