13 KiB
order, title, description
| order | title | description |
|---|---|---|
| 6 | Lab 6 - Embedding and Chunking | Explore chunking strategies and embeddings, then connect them to retrieval workflows. |
Lab 6 - Embedding and Chunking
In this lab, we will:
- Explore various chunking strategies
- Explore how embeddings and vectors allow similar concepts to cluster together within n-dimensional spaces
- Connect chunking and embedding concepts to a functional RAG workflow
Explore sections focus on comparison, observation, and reasoning about trade-offs.
Pay close attention to how chunk size and embedding behavior influence later retrieval quality.
To start this lab, two web services have been preconfigured:
- ChunkViz - http://:3000
- Embedding Atlas - http://:5055
Objective 1 Explore: Chunking Strategy
Chunking is the first step in any RAG pipeline. It is the process of dividing a document into smaller snippets that can later be stored in a database and paired with an embedded representation of that data. Because chunking happens so early in the RAG process, the strategy chosen to create those chunks has an outsized impact on the quality of the embeddings that follow.
Successful chunking is highly dependent on the type of document being processed. In production-grade RAG systems, teams often evaluate multiple strategies across different document types, then route content through the processing path that produces the strongest retrieval results. For this lab, we will use a visualization tool to build intuition for those trade-offs.
In a web browser, navigate to http://:3000. Once loaded, you should see the ChunkViz homepage.
ChunkViz starts with example text that has already been split using a default character-based strategy. In this view, every 200 characters is treated as a chunk. Modify the sliders to set the following values:
Chunk Size-256Chunk Overlap-20
Notice how the colors in the text below dynamically change. Each color represents a single chunk, while the green text between unique colors represents the overlap. That overlap increases the likelihood that critical context appears in more than one chunk, improving retrieval resilience.
Next, explore the major chunking strategies available in ChunkViz:
| Strategy | Description |
|---|---|
| Character Splitter | Splits text into chunks based on a fixed number of characters. |
| Token Splitter | Splits chunks based on tokenization values using tiktoken. |
| Sentence Splitter | Splits chunks into rough sizes based on what the tool interprets as a sentence. |
| Recursive Character | Splits chunks using multiple separators, such as new lines (\n), periods (.), commas (,), or other language-aware section boundaries. |
Select each option and observe the different ways ChunkViz breaks text into chunks.
Each strategy comes with its own benefits and drawbacks. Character-based splitting is often one of the easiest strategies to implement because OCR and text extraction ultimately produce characters. Token-based splitting is useful when keeping chunk sizes consistent for a specific model matters most. Sentence and recursive strategies are often better at preserving complete thoughts, although real-world documents do not always follow clean sentence boundaries.
Explore one more chunking example using a larger document. Open your provided copy of Blindsight by Peter Watts in .txt format, paste its contents into ChunkViz, and then continue experimenting with chunk sizes from 64 up to 1024 using different strategies. Notice how different chunk sizes and separators change the resulting structure.
Imagine how difficult it would be to retrieve the right information if your chunks were too small, too large, or split in unnatural locations.
Objective 2 Explore: Embedding Space
Now that we have seen some of the trade-offs involved in chunking, we can move to the next major step in a RAG pipeline: embedding. As discussed during lecture, embedding is the process of converting text into a numerical representation that captures the meaning of the content. Instead of treating text as raw strings, embedding models map each chunk into an n-dimensional space where semantically similar content ends up closer together.
This allows a system to perform similarity search efficiently. When a user submits a query, the query is embedded into the same vector space, and the system retrieves the chunks whose embeddings are closest to it. This differs from how embeddings are used internally by an LLM for attention and transformation, but it is the key step that allows a RAG system to retrieve information based on meaning rather than simple keyword matching.
Navigate to http://:5055. Here, we have started a project called Embedding Atlas. Embedding Atlas is a tool that provides interactive visualizations for datasets stored in parquet format. Each chunk in this case is one row in the dataset, allowing us to visualize, cross-filter, and search embeddings and metadata interactively.
The lab4_start.sh script automatically starts Embedding Atlas and generates embeddings for each Scenario in our dataset. In this lab, each scenario is a one-to-three-sentence description of an attacker action.
Our Embedding Atlas instance has already been preloaded with the primary dataset we will use throughout the rest of the day. Specifically, it pairs hacker scenarios with MITRE ATT&CK tactics, techniques, and procedural IDs. If you are unfamiliar with ATT&CK, it is a framework for categorizing the ways attackers execute malware, move through networks, and act on their objectives. It also provides a rich example corpus for visualizing the embedding process.
TTP_Name from the dropdown in the upper-left corner so the clusters are easier to interpret.
Each color represents a semantically similar concept as defined by the generated embeddings. Explore the embedding space using the following interactions:
- Select text categories on the right side to isolate a subset of related entries.
- Alternatively, select any category label in the right-hand column to show only entries associated with that ID.
- Select any single dot and click
Nearest Neighborto surface the datapoints that embed closest to that example.
Note: You can use the mouse wheel to zoom in and out. You can also click and drag the map to center the area you want to inspect.
Observe how categories naturally cluster together in the embedding space. In a real RAG pipeline, an LLM can embed a user query in a similar way and retrieve semantically related chunks from the dataset.
When using Nearest Neighbor, notice that some of the closest datapoints may still look far apart visually. Think about why that might happen when a high-dimensional space is projected into a lower-dimensional visualization.
If you would like to continue exploring alternative datasets and see how embeddings can flexibly cluster raw data, take a look at Embedding Atlas' Examples Page. The Wine dataset is a particularly useful example to review before class resumes.
Objective 3 Explore: Full RAG Exploration
At this point, you have seen the two major stages that make retrieval-augmented generation possible:
- Documents are split into chunks.
- Chunks are embedded into a vector space.
- A user query is embedded into that same space.
- The most relevant chunks are retrieved and passed back to a model as context.
Use what you observed in ChunkViz and Embedding Atlas to reason through the following questions:
- How would a chunk that is too small affect retrieval quality?
- How would a chunk that is too large dilute the meaning of an embedding?
- Why might a semantically similar result appear visually distant on a 2D projection?
- How do chunking strategy and embedding quality work together to improve downstream answers?
This objective is meant to connect the lab tools back to the full RAG workflow. The better your chunking choices and embeddings are, the more useful the retrieved context will be for the model that answers the user.
Conclusion
In this lab, we explored three connected ideas that sit at the heart of a RAG system:
- Chunking Strategy - We compared multiple ways to divide text into retrievable units.
- Embedding Space - We visualized how semantically similar content clusters together.
- RAG Workflow - We connected chunking and embeddings to the retrieval step that powers grounded answers.
You should now have a clearer sense of how early design decisions in a RAG pipeline can dramatically influence retrieval quality and final model responses.