About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Overview
In this learning path, learn how to use Data Prep Kit (DPK) to prepare data for large language model (LLM) applications.
Skill level
This learning path assumes basic Python skills as a prerequisite and uses google Colab as the cloud-based Jupyter notebook environment.
Estimated time to complete
Approximately 2 hours.
Learning objectives
With this learning path, you learn:
- The fundamental concepts and features of Data Prep Kit (DPK) for building LLM applications
- The practical aspects of data ingestion
- How to extract data from various sources like PDFs, HTML, and code, and convert the data into tokens suitable for LLMs and vector databases
- Ethical considerations for data preparation, and how trasnforms like license filtering, hate abuse profanity (HAP) detection, and PII redaction help users in preparing data
- How to build DPK transforms and integrate them into the RAG and fine tuning pipelines using DPK
By completing this learning path, you'll learn how to apply your knowledge and skills to real-world data preparation for LLM applications like RAG and fine tuning.