This is a cache of https://developer.ibm.com/tutorials/awb-train-open-source-llms-collected-knowledge-instructlab/. It is a snapshot of the page as it appeared on 2025-11-14T13:16:04.254+0000.
Train open source LLMs with collected knowledge with InstructLab - IBM Developer
IMPORTANT: The steps in this tutorial should be able to be run with InstructLab 0.17.1, the most current stable release. Some commands may have changed since this tutorial was published. If you run into issues, explore the InstructLab README file for help.
InstructLab is a model-agnostic open source AI project that facilitates contributions to large language models (LLMs). It is a new community-based approach to build truly open-source LLMs. To learn more about InstructLab, read this article, "What is InstructLab, and why do developers need it."
InstructLab uses a synthetic-data-based alignment tuning method to train LLMs. The InstructLab tuning method is driven by manually created taxonomies. InstructLab provides a process for optimizing and tuning LLMs by collecting knowledge and skills as part of a taxonomy tree.
In this tutorial, you learn how to train an open source LLM to know all about the movie Oppenheimer by creating a knowledge base and training the model on it in InstructLab. Knowledge is comprised of data and facts, which are supported by documents. The knowledge base that you create in this tutorial isn't accepted for contributing to the InstructLab taxonomy tree right now, but you can still try it out in your local and understand how to train the model with a knowledge base.
Prerequisites
You need to install the InstructLab cLI (ilab). You can follow the instructions from my previous tutorial, or you can refer to the instructions in the InstructLab repo to install the ilabcLI.
After you have installed the InstructLab cLI on your system, you can start by downloading the base model that you want to train. You can find the supported open source models from HuggingFace. The default is merlinite-7b-lab-Q4_K_M, which you need to use the 4 bit Quantized version of it for this tutorial.
In the terminal, run the following command to initialize the ilab cli:
ilab config init --non-interactive
copy codecopied!
To download the model, run this command:
ilab model download
copy codecopied!
To download a different base model you can run the following command:
ilab model download --repository <huggingface_repo> --filename <filename>.gguf
copy codecopied!
Step 2. create the required files
To train an open source model with InstructLab, you need to create a knowledge base in the taxonomy directory. When you initilized the ilab cli, it automatically cloned the InstructLab taxonomy repository, which is the source of truth for your model training.
To create a knowledge base, you need to create a qna.yaml file and an attribution.txt file. Then, you need to create a Public GitHub repo and load all the knowledge files in MD format.
In the qna.yaml file you must reference the supporting document by specifying the repo, the SHA of the commit to your repo, and the glob patterns specifying the markdown files (such as *.md).
Here is the template for creating a knowledge qna.yaml file:
Here is the template for creating a knowledge attribution.txt file:
create the qna.yaml and attribution.yaml in the taxonomy/knowledge/movies/oppenheimer/ directory of the cloned repo.
It is recommended to have 5 or more examples in the qna.yaml file for the knowledge base. copy the qna.yaml and attribution.txt file from my github repo in to the taxonomy/knowledge/movies/oppenheimer/qna.yaml directory of the cloned InstructLab repo.
Step 3. Serve the base model
Open two terminals and source into your ilab virtual environment. In the first terminal, run the following to serve the model:
ilab model serve
copy codecopied!
If you want to serve a different model, run this command:
ilab model serve --model-path <modelpath>.gguf
copy codecopied!
You should see an output similar to below:
INFO2024-05-3017:24:41,256 lab.py:320 Using model 'models/merlinite-7b-lab-Q4_K_M.gguf' with -1 gpu-layers and 4096 max context size.
INFO2024-05-3017:24:41,659 server.py:206 Starting server process, press cTRL+c to shutdown server...
INFO2024-05-3017:24:41,659 server.py:207 After application startup complete see http://127.0.0.1:8000/docs for API.
copy codecopied!
Keep this terminal open to generate synthetic data, train the model, and test the model.
Step 4. Test the base model output by chatting with it
Now in the second terminal, run the following command to chat with the base model and see if the model knows about the movie Oppenheimer.
ilab model chat -gm
copy codecopied!
An interactive shell will be presented where you can chat with the model. Go ahead and ask the model to tell you anything about the movie Oppenheimer, such as “Who starred in the movie Oppenheimer?” or “What are the release dates for Oppenheimer movie?”.
The model output without training for the “Who starred in the movie Oppenheimer?” query looks something like this:
The model output without training for the “What are the release dates for Oppenheimer movie?” query looks something like this:
As you can see from this output, the model was last updated on May 17th 2022 and doesn’t have the knowledge about the new events. You will train the model with the Oppenheimer movie details and evaluate the results.
Step 5. Generate synthetic data
In the same second terminal, run the following command:
ilab data generate
copy codecopied!
To generate more than 100 samples, run the following command:
ilab data generate --num-instructions <int>
copy codecopied!
You can see the new synthetic data sets getting generated in the output. If you are not satisfied with the generated data set, you can quit the process by pressing ctrl + c. Modify the examples in the qna.yaml file and then rerun the generatecommand.
This process will take some time depending upon your system. It took about 30min in my M1 Mac Pro. You can see the ETA in output.
Once the synthetic data is generated, you will see a summary of how many samples have been generated and how many have been discarded. Samples might be discarded by the critic model for format or by rogue score threshold.
Example Output:
100 instructions generated, 12 discarded due to format (see generated/discarded_merlinite-7b-lab-Q4_K_M_2024-05-30T17_24_56.log), 1 discarded due to rouge score
copy codecopied!
A directory will also be generated directory in the ilab directory. You can see four files:
Discarded data set (log file)
Generated data set (json file)
Train data set (jsonl file)
Test data set (jsonl file)
Step 6. Train the model
Once the synthetic data is ready, all you have to do is run the following command in your terminal to train the model:
ilab model train
copy codecopied!
If you want to use a GPU for training, you can run the following command:
ilabmodeltrain--device'cuda'
copy codecopied!
This process will take some time depending upon your system and the number of iterations. It took approximately 30 minutes on my M1 MacBook Pro to complete 100 iterations. You can see the ETA in output.
A new directory will be created in the ilab directory, with a name similar to this: instructlab-merlinite-7b-lab. This directory will have the new model weights and adapters.
Step 7. Test the model
The InstructLab can also run basic tests to ensure model correctness. In your terminal, run the following:
ilab model test
copy codecopied!
You can see the output where it shows model output before and after training.
If you are training on a MacOS computer, you need to quantize the model to run it on your Mac. In terminal, run the following:
ilab model convert
copy codecopied!
All the weights and adapters will be converted to a quantized gguf model after running the command. A directory will be created in the ilab directory, with a name similar to this: instructlab-merlinite-7b-lab-trained.
Step 8. Serve and chat with the trained model
Go back to the first terminal where you had served the base model, hit ctrl+c and stop the model serving. Run the following command to serve the newly trained model:
ilab model serve --model-path instructlab-granite-7b-lab-trained/instructlab-granite-7b-lab-Q4_K_M.gguf
copy codecopied!
In the second terminal where you generated synthetic dataset, trained the model, tested the model, run the following command to chat with the model:
ilab model chat -gm -m instructlab-granite-7b-lab-trained/instructlab-granite-7b-lab-Q4_K_M.gguf
copy codecopied!
You can now ask the model anything about the movie Oppenheimer and the model should be able to answer it!
The model output after training for the “Who starred in the movie Oppenheimer?” query looks something like this:
The model output after training for the “What are the release dates for Oppenheimer movie?” query looks something like this:
Summary and next steps
In this tutorial, you learned how to create a knoweldge base for the movie Oppenheimer. After you set up the InstructLab cLI, you downloaded the base model and trained it using the qna.yaml file. Then, you tested your fine-tuned model and chatted with it.
To get started, join the InstructLabcommunity in GitHub, and create other knoweldge bases and contribute them to the InstructLab taxonomy tree by raising a pull request. You can also explore IBM foundation models from IBM watsonx.ai studio that are designed to support knowledge and skills contributed by the open source community.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review yourcookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.