Files
LLM-Labs/content/labs/lab-6-embedding-and-chunking.md

233 lines
13 KiB
Markdown

---
order: 6
title: Lab 6 - Embedding and Chunking
description: Explore chunking strategies and embeddings, then connect them to retrieval workflows.
---
<!-- breakout-style: instruction-rails -->
<!-- step-style: underline -->
<!-- objective-style: divider -->
# Lab 6 - Embedding and Chunking
In this lab, we will:
- Explore various chunking strategies
- Explore how embeddings and vectors allow similar concepts to cluster together within n-dimensional spaces
- Connect chunking and embedding concepts to a functional RAG workflow
<div class="lab-callout lab-callout--info">
<strong>Lab Flow Guide</strong><br />
<strong>Explore</strong> sections focus on comparison, observation, and reasoning about trade-offs.<br />
Pay close attention to how chunk size and embedding behavior influence later retrieval quality.
</div>
To start this lab, two web services have been preconfigured:
- ChunkViz - {{service-url:chunkviz}}
- Embedding Atlas - {{service-url:embedding-atlas}}
## Objective 1 Explore: Chunking Strategy
Chunking is the first step in any RAG pipeline. It is the process of dividing a document into smaller snippets that can later be stored in a database and paired with an embedded representation of that data. Because chunking happens so early in the RAG process, the strategy chosen to create those chunks has an outsized impact on the quality of the embeddings that follow.
Successful chunking is highly dependent on the type of document being processed. In production-grade RAG systems, teams often evaluate multiple strategies across different document types, then route content through the processing path that produces the strongest retrieval results. For this lab, we will use a visualization tool to build intuition for those trade-offs.
In a web browser, navigate to {{service-url:chunkviz}}. Once loaded, you should see the ChunkViz homepage.
<figure style="text-align: center;">
<a href="https://i.imgur.com/PG6fp1V.png" target="_blank">
<img
src="https://i.imgur.com/PG6fp1V.png"
style="width: 50%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
ChunkViz Default Page
</figcaption>
</figure>
<br>
ChunkViz starts with example text that has already been split using a default character-based strategy. In this view, every 200 characters is treated as a chunk. Modify the sliders to set the following values:
- `Chunk Size` - `256`
- `Chunk Overlap` - `20`
<figure style="text-align: center;">
<a href="https://i.imgur.com/9SDyh7I.png" target="_blank">
<img
src="https://i.imgur.com/9SDyh7I.png"
style="width: 50%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Chunk Size and Overlap
</figcaption>
</figure>
<br>
Notice how the colors in the text below dynamically change. Each color represents a single chunk, while the green text between unique colors represents the overlap. That overlap increases the likelihood that critical context appears in more than one chunk, improving retrieval resilience.
Next, explore the major chunking strategies available in ChunkViz:
| Strategy | Description |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| Character Splitter | Splits text into chunks based on a fixed number of characters. |
| Token Splitter | Splits chunks based on tokenization values using **tiktoken**. |
| Sentence Splitter | Splits chunks into rough sizes based on what the tool interprets as a sentence. |
| Recursive Character | Splits chunks using multiple separators, such as new lines (`\n`), periods (`.`), commas (`,`), or other language-aware section boundaries. |
Select each option and observe the different ways ChunkViz breaks text into chunks.
<figure style="text-align: center;">
<a href="https://i.imgur.com/jWY4nOd.png" target="_blank">
<img
src="https://i.imgur.com/jWY4nOd.png"
style="width: 50%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Chunking Strategies
</figcaption>
</figure>
<br>
Each strategy comes with its own benefits and drawbacks. Character-based splitting is often one of the easiest strategies to implement because OCR and text extraction ultimately produce characters. Token-based splitting is useful when keeping chunk sizes consistent for a specific model matters most. Sentence and recursive strategies are often better at preserving complete thoughts, although real-world documents do not always follow clean sentence boundaries.
Explore one more chunking example using a larger document. Open the provided file: [Blindsight.md](/labs/lab-6-embedding-and-chunking/Blindsight.md). Copy the novel text, paste it into ChunkViz, and then continue experimenting with chunk sizes from `64` up to `1024` using different strategies. Notice how different chunk sizes and separators change the resulting structure, especially around paragraph breaks, scene breaks, and chapter headings.
<figure style="text-align: center;">
<a href="https://i.imgur.com/M51ASNK.png" target="_blank">
<img
src="https://i.imgur.com/M51ASNK.png"
style="width: 50%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Chapter 1 - 1024 Chunks, Recursive Character. This strategy breaks paragraphs apart cleanly.
</figcaption>
</figure>
<br>
Imagine how difficult it would be to retrieve the right information if your chunks were too small, too large, or split in unnatural locations.
---
## Objective 2 Explore: Embedding Space
Now that we have seen some of the trade-offs involved in chunking, we can move to the next major step in a RAG pipeline: embedding. As discussed during lecture, embedding is the process of converting text into a numerical representation that captures the meaning of the content. Instead of treating text as raw strings, embedding models map each chunk into an n-dimensional space where semantically similar content ends up closer together.
This allows a system to perform similarity search efficiently. When a user submits a query, the query is embedded into the same vector space, and the system retrieves the chunks whose embeddings are closest to it. This differs from how embeddings are used internally by an LLM for attention and transformation, but it is the key step that allows a RAG system to retrieve information based on meaning rather than simple keyword matching.
Navigate to {{service-url:embedding-atlas}}. Here, we have started a project called Embedding Atlas. Embedding Atlas is a tool that provides interactive visualizations for datasets stored in parquet format. Each chunk in this case is one row in the dataset, allowing us to visualize, cross-filter, and search embeddings and metadata interactively.
<figure style="text-align: center;">
<a href="https://i.imgur.com/8PvcZBP.png" target="_blank">
<img
src="https://i.imgur.com/8PvcZBP.png"
style="width: 50%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Embedding Atlas Flow Diagram
</figcaption>
</figure>
<br>
The `lab4_start.sh` script automatically starts Embedding Atlas and generates embeddings for each `Scenario` in our dataset. In this lab, each scenario is a one-to-three-sentence description of an attacker action.
<figure style="text-align: center;">
<a href="https://i.imgur.com/9bGQce8.png" target="_blank">
<img
src="https://i.imgur.com/9bGQce8.png"
style="width: 50%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Embedding Atlas CLI Backend Example
</figcaption>
</figure>
<br>
Our Embedding Atlas instance has already been preloaded with the primary dataset we will use throughout the rest of the day. Specifically, it pairs hacker scenarios with MITRE ATT&CK tactics, techniques, and procedural IDs. If you are unfamiliar with ATT&CK, it is a framework for categorizing the ways attackers execute malware, move through networks, and act on their objectives. It also provides a rich example corpus for visualizing the embedding process.
<div class="lab-callout lab-callout--info">
<strong>Tip:</strong> Before you begin exploring the visualization, select <code>TTP_Name</code> from the dropdown in the upper-left corner so the clusters are easier to interpret.
</div>
<figure style="text-align: center;">
<a href="https://i.imgur.com/996ukgZ.png" target="_blank">
<img
src="https://i.imgur.com/996ukgZ.png"
style="width: 50%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
TTP_Name Grouping
</figcaption>
</figure>
<br>
Each color represents a semantically similar concept as defined by the generated embeddings. Explore the embedding space using the following interactions:
1. Select text categories on the right side to isolate a subset of related entries.
2. Alternatively, select any category label in the right-hand column to show only entries associated with that ID.
3. Select any single dot and click `Nearest Neighbor` to surface the datapoints that embed closest to that example.
Note: You can use the mouse wheel to zoom in and out. You can also click and drag the map to center the area you want to inspect.
<figure style="text-align: center;">
<a href="https://i.imgur.com/YkSqT4v.png" target="_blank">
<img
src="https://i.imgur.com/YkSqT4v.png"
style="width: 50%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Single Visible Category - System Information Discovery
</figcaption>
</figure>
<br>
Observe how categories naturally cluster together in the embedding space. In a real RAG pipeline, an LLM can embed a user query in a similar way and retrieve semantically related chunks from the dataset.
<figure style="text-align: center;">
<a href="https://i.imgur.com/zKa6GxD.png" target="_blank">
<img
src="https://i.imgur.com/zKa6GxD.png"
style="width: 50%; display: block; margin-left: auto; margin-right: auto; border: 5px solid black;">
</a>
<figcaption style="margin-top: 8px; font-size: 1.1em;">
Nearest Neighbors
</figcaption>
</figure>
<br>
When using `Nearest Neighbor`, notice that some of the closest datapoints may still look far apart visually. Think about why that might happen when a high-dimensional space is projected into a lower-dimensional visualization.
If you would like to continue exploring alternative datasets and see how embeddings can flexibly cluster raw data, take a look at [Embedding Atlas' Examples Page](https://apple.github.io/embedding-atlas/examples/). The Wine dataset is a particularly useful example to review before class resumes.
---
## Objective 3 Explore: Full RAG Exploration
At this point, you have seen the two major stages that make retrieval-augmented generation possible:
1. Documents are split into chunks.
2. Chunks are embedded into a vector space.
3. A user query is embedded into that same space.
4. The most relevant chunks are retrieved and passed back to a model as context.
Use what you observed in ChunkViz and Embedding Atlas to reason through the following questions:
- How would a chunk that is too small affect retrieval quality?
- How would a chunk that is too large dilute the meaning of an embedding?
- Why might a semantically similar result appear visually distant on a 2D projection?
- How do chunking strategy and embedding quality work together to improve downstream answers?
This objective is meant to connect the lab tools back to the full RAG workflow. The better your chunking choices and embeddings are, the more useful the retrieved context will be for the model that answers the user.
---
## Conclusion
In this lab, we explored three connected ideas that sit at the heart of a RAG system:
1. **Chunking Strategy** - We compared multiple ways to divide text into retrievable units.
2. **Embedding Space** - We visualized how semantically similar content clusters together.
3. **RAG Workflow** - We connected chunking and embeddings to the retrieval step that powers grounded answers.
You should now have a clearer sense of how early design decisions in a RAG pipeline can dramatically influence retrieval quality and final model responses.