Contributing knowledge to the open source Granite model using InstructLab

InstructLab is a community-based approach to building truly open-source LLMs. The InstructLab community model will be updated with a periodic release cycle for models and data shared regularly on Hugging Face. Learn more about what InstructLab is and why developers need it in this article on IBM Developer.

In this step-by-step tutorial, you will learn how to contribute knowledge to open source large language models (LLMs) on your personal laptop, Linux, or Mac device using the latest version (v0.23) of InstructLab. This hands-on guide will walk you through the entire process, from setting up your InstructLab environment to adding your first knowledge to an open source LLM, such as the IBM Granite models. After completing this tutorial, you will have gained the necessary skills and knowledge to become a valuable contributor to the InstructLab community and the broader generative AI ecosystem.

While this guide has been tested and verified on Mac M-series (specifically Apple M1 Max with 32 GB RAM), it should work in a similar fashion in any other supported platform. Additionally, the steps in this tutorial are based on the steps in the InstructLab documentation but focus on showing step by step instructions for how to contribute knowledge using the taxonomy version 3 on Mac.

Prerequisites

Check the Requirements section of the InstructLab documentation for the operating system, hardware, and software prerequisites.

Steps

Step 1. Installing InstructLab on a Mac system or Linux System

Before you start, make sure that you have Python 3.10 or 3.11 installed. In my environment, I have used Python 3.11.x.
```
python3.11 --version
```
```
# Output
 Python 3.11.11
```
Important: Don't proceed further in this tutorial in case the output of the above command is not 3.10.x or 3.11.x. Python 3.12 and Python 3.13 are not yet supported.

If you are using a Mac, you can install Python 3.11 using Homebrew if you don't have the correct Python version. If you are using Red Hat Enterprise Linux, you can use the dnf package manager.
```
brew install python@3.11
```

Create a new directory called instructlab, and navigate to it.

mkdir instructlab
 cd instructlab

Install InstructLab using Python. This creates a Python virtual environment and activates it, and then installs InstructLab using the default Python package manager pip.

python3.11 -m venv --upgrade-deps venv
 source venv/bin/activate
 pip cache remove llama_cpp_python
 pip install 'instructlab==0.23.1'

The llama_cpp_python version in your device may not be the version InstructLab uses, so you need to remove them from the cache before you install the InstructLab package.

Make sure that InstructLab is installed. This tutorial is based on the latest version of InstructLab, which is version 0.23 at the time of writing this tutorial.
```
ilab --version
```
```
# Output
 ilab, version 0.23.1
```

Step 2. Fork the taxonomy so that you update it

The "lab" in InstructLab stands for Large-Scale Alignment for ChatBots, which is detailed in the LAB: Large-Scale Alignment for ChatBots research paper.

InstructLab uses a synthetic-data-based alignment tuning method for Large Language Models (LLMs.) InstructLab is driven by carefully created taxonomies, built into a taxonomy tree. The Taxonomy allows you to create models tuned with your additional skills or knowledge.

Fork https://github.com/instructlab/taxonomy into your own GitHub organization.

The forked repo will look something like this: https://github.com/ahmed-azraq/taxonomy.

Clone the forked repository.

# Replace <YOUR_ORG_NAME> with your GitHub organization name.
 git clone https://github.com/<YOUR_ORG_NAME>/taxonomy

Step 3. Initialize InstructLab and chat with the model

Make sure that you do not make any changes to the taxonomy before this step. You will update the taxonomy in subsequent steps with your added knowledge. You need first to initialize the environment of InstructLab with the original taxonomy without any updates to qna.yaml files, so that InstructLab can detect your changes to the taxonomy in subsequent steps.

Initialize InstructLab configurations with the initial forked taxonomy, and keep all the defaults by pressing Enter to the required configurations. With InstructLab v0.21 release, ilab now auto-detects profiles. The following code block shows the defaults for my local environment setup.
```
ilab config init --taxonomy-path ./taxonomy
```

Verify that the taxonomy is valid.

ilab taxonomy diff

#Output
 Taxonomy in ./taxonomy is valid :)

You must have an account at Hugging Face to be able to download models from Hugging Face. Create an access token by clicking on +Create new token on this page. In the Token type, select Read, then click on Create token, and copy the token.
Download compact pre-trained models from Hugging Face, replace <YOUR_HF_TOKEN> with the token you retrieved from the previous step.

This step by default, downloads the model granite-7b-lab-GGUF as a student model and inference, and downloads the model Mistral-7B-Instruct-GGUF as a teacher model for synthetic data generation and model training, which is suitable to run on a Mac M-series laptop. It also downloads merlinite-7b-lab in case you would like to use it for inference. This step downloads models of around 12 GB in size, so it might take some time depending on your network speed.
```
# Replace <YOUR_HF_TOKEN> with your Hugging Face token that you retrieved from the previous step
 ilab model download --hf-token <YOUR_HF_TOKEN>
```

Download the full safe-tensor version of IBM Granite model, you will be using that as a student model.

ilab model download --repository instructlab/granite-7b-lab --hf-token <YOUR_HF_TOKEN>

List the downloaded models to verify that the models are downloaded successfully.

ilab model list

#Output
 +--------------------------------------+---------------------+---------+
 | Model Name                           | Last Modified       | Size    |
 +--------------------------------------+---------------------+---------+
 | instructlab/granite-7b-lab           | 2024-12-02 15:44:39 | 12.6 GB |
 | merlinite-7b-lab-Q4_K_M.gguf         | 2024-12-02 13:54:24 | 4.1 GB  |
 | mistral-7b-instruct-v0.2.Q4_K_M.gguf | 2024-12-02 13:56:08 | 4.1 GB  |
 | granite-7b-lab-Q4_K_M.gguf           | 2024-12-02 13:51:35 | 3.8 GB  |
 +--------------------------------------+---------------------+---------+

On a Mac system, InstructLab data is stored in four locations for InstructLab:

~/.local/share/instructlab: Contains the generated synthetic data, and training data.
~/.config/instructlab: Contains the configurations.
~/.cache/instructlab/: Contains the downloaded models.
Current working directory: Contains the forked taxonomy, python virtual environment, and the trained models.

Serve and chat with the IBM Granite model. This step serves the model and makes it available for end users consumption; that is, it allows the end users to interact with the model (chat with it or even use REST APIs for integration). It also allows you to chat with the model.

It is preferred that users specify the model path to make sure that the model served is the intended model.

As mentioned above, the downloaded models are located in ~/.cache/instructlab/.
```
ilab model chat --model ~/.cache/instructlab/models/granite-7b-lab-Q4_K_M.gguf
```
Notice that the model name served above is mentioned in the output.
In this tutorial, the knowledge you are going to add is related to an Egyptian minister, Hikmat Abu Zayd. Now, in this step, you will chat with the existing model to see if the model produces accurate results. In the same terminal window, type an inquiry to the model Who is Hikmat Abu Zayd? as shown in the following screen capture.

Note that the existing model didn't identify that Hikmat Abu Zayd as the first female Egyptian cabinet minister. It's hallucinating because of the lack of knowledge, and sharing that Hikmat Abu Zayd is a male chief scientist instead. The model in this case is hallucinating, so you may receive different results. You can do prompt engineering to make sure that the model responds only if they have the knowledge, but that's a discussion for a different article.

Actually, a bit of personal note here, Hikmat Abu Zayd is my father's aunt, and we all consider her as our grandmother and our role model. So it was a bit of frustrating to see the model hallucinating when we ask about her, and decided to take an action to contribute the publicly available knowledge about her to the open source IBM Granite large language model through InstructLab.
On the terminal window, where you are serving and chatting with the model, write exit then press Enter to stop serving the model and get ready for the fun part of contributing new knowledge to the model.

Step 4. Add the new knowledge as Markdown on GitHub

In this section, you'll create a GitHub repository to hold the new added knowledge. This is the source knowledge file: Hikmat Abu Zayd from Wikipedia.

Create a new GitHub repository in your GitHub organization to hold the new knowledge, for example instructlab_knowledge.
Convert the Wikipedia article into markdown format. Try to make the MD file readable and do not use Markdown table formats. Commit the new markdown file (for example, hikmat-abu-zayd/hikmat.md) to your new GitHub repository. This is an example of the markdown file that I created in my GitHub repo in my GitHub organization.
You need to find and copy the commit ID for the committed markdown file. In your GitHub repository user interface, view the commit history at this URL https://github.com/ahmed-azraq/instructlab_knowledge/commits/main/. Replace ahmed-azraq with your GitHub organization details. Copy the commit ID for the latest commit.

The latest commit ID in the screenshot is f8c9621f2c80449093a790a8c7713bfbc1447bcb in case you would like to use the same new knowledge in your training instead of creating your own markdown file in your own GitHub repository.

Step 5. Update the forked taxonomy locally to add the new knowledge

In this section, you'll update your forked InstructLab taxonomy repository to add this new knowledge.

In a previous step, you cloned the forked taxonomy in a local directory in the instructlab/taxonomy folder. Create a new directory to hold your newly added knowledge.

mkdir -p taxonomy/knowledge/history/biography/egypt/hikmat_abu_zayd

In that directory, use your favorite editor (preferably one that understands YAML like Visual Studio Code) to create these new files: qna.yaml and attribution.txt.

The qna.yaml file includes details about the new knowledge that you are going to contribute. Review the main mandatory elements in the qna.yaml schema in the InstructLab repo. Most importantly, you must provide at least 5 seed examples, and each context must include at least 3 question and answer pairs. You also must use version 3 as this is the latest taxonomy version for knowledge submission as of the time of writing this tutorial. You can use the example of the qna.yaml that I created and stored in my GitHub organization and repo here.

The attribution.txt file includes the actual public link including the knowledge, license details of the knowledge being contributed, and the specific revision used. See the CONTRIBUTING.md for information about what needs to be specified in the attribution.txt file. You can use the example of the attribution.txt file I created and stored in my GitHub organization and repo located here.

Learn more in the Getting started with knowledge contributions section of the InstructLab taxonomy readme.
Use any tool to validate the yaml and fix it. You can use online tools (if suitable) like JSON formatter to validate and format the YAML.
Verify that InstructLab detects the new taxonomy change you created and has a valid syntax by using the same command used earlier.
```
ilab taxonomy diff
```
```
#Output
 knowledge/history/biography/egypt/hikmat_abu_zayd/qna.yaml
 Taxonomy in ./taxonomy is valid :)
```
Notice that it now detects the newly created qna.yaml and it also verified that it's with the right syntax. Make sure to remove all the trailing spaces in each line and fix any issue that appears in the above command before proceeding further.

The InstructLab community has developed an InstructLab UI to provide an easier method for creating and validating the qna.yaml and attribution.txt files. The following screen capture is from InstructLab UI. Check out this tutorial on how to contribute to open-source LLMs, such as Granite, using InstructLab UI.

InstructLab UI

Step 6. Generate synthetic data

The beauty of InstructLab is that you don't have to create tons of training data. In this step, you generate synthetic data for training using InstructLab. The teacher model generates question and answer pairs, and then reviews each question and answer pair for accuracy, and eliminates unacceptable outputs with HAP (Hate Abuse Profanity), and any redundancy. This defaults to mistral-7b-instruct model on your local device and it uses the full pipeline for synthetic data generation (SDG).

In case you want to focus on running the end-to-end workflow with a faster response, you can specify the flag --max-num-tokens with 512 instead of the default 4096. This flag controls the amount of tokens generation with each SDG run. Reducing the value, means less generated data.

Use --detached argument to run the data generation in the background, as it might take a long time. The command starts the SDG in a separate process, and generates a log file with the outcome of the process.
```
ilab data generate --max-num-tokens 512 --detached
```
Check the status of the running process, and copy the location of the log file from the output.
```
ilab process list
```

Tail the output file to monitor the status.

tail -f ~/.local/share/instructlab/logs/generation/generation*.log

Screen capture of the synthetic data generation output

This might take a bit of time. It has taken around an hour with the full pipeline to generate the synthetic data on Mac M1 Max with 32 GB RAM as shown in the below outcome of ilab process list.

Screen capture of the completed process

If you aren't satisfied with the generated instructions, try adjusting your qna.yaml file. Adding more examples may help. The generated synthetic data on Mac is stored in ~/.local/share/instructlab/datasets as mentioned previously.

Step 7. Train the model

Now, you train the model with the newly generated synthetic data. In this step, the student model (granite-7b-lab) is being trained and runs through the newly vetted synthetic data.

It runs multiple iterations and by default 10 epochs on the local device as part of training.

An iteration is the model training on a batch from the training data set; Once the entire data set has been traversed, it is called an epoch.

There are 3 different model fidelity piplines for training:

simple is used for rapid prototyping on laptop and produces low-fidelity models
full also works on laptop but requires high spec laptop of at least 32 GB RAM, takes more time and produces medium-fidelity models
accelerated uses GPU acceleration and produce high-fidelity models.

In short, it's a trade-off between time, quality and hardware resources available.

It took around 10 minutes on my Mac M1 Max with 32 GB RAM, using the simple model fidelity pipeline. You should remember that since you are using a quantized teacher model, and the simple pipeline, the outcome of the training will not be of high quality.

Important: For more accurate result, it is strongly advisable not to use the simple pipeline in production. You can use it only for a quick prototyping to see the full end-to-end flow.

ilab model train --model-path instructlab/granite-7b-lab --pipeline simple

Screen capture of the model training output

Step 8. Quantize the trained model into GGUF format

To run the model on your Mac device, you need to convert the newly trained model into quantized GGUF format. GGUFs are quantized models, which means they are not as precise as a full-fledged model. However, using GGUFs allow us to run LLMs on personal computers.

There is currently a bug with the default parameters for ilab model convert. Accordingly, you need to add the following parameters:

--adapter-file: The location of the LoRA adopter to fuse. This is the result of the previous training step, you can find it at: ~/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-100.npz
--model-dir: The location where the new trained model is stored. This is the root directory of the above folder including all the details about the new trained model: ~/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q. Make sure that no / at the end.

This task is usually quick and takes a couple of minutes.

ilab model convert --adapter-file ~/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-100.npz --model-dir ~/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q

Step 9. Serve the newly trained model and chat with it

In this section, you serve the newly trained model and chat with it.

Chat with the newly trained model. The previous conversion step creates the GGUF model in the same working directory.

ilab model chat --model instructlab-granite-7b-lab-trained/instructlab-granite-7b-lab-Q4_K_M.gguf

Screen capture of trained model chat output

Notice that the model name served above is mentioned in the output.

Now, the moment of the truth. Ask the new model question about the new knowledge contributed. In the same terminal window, type an inquiry to the model Who is Hikmat Abu Zayd? as shown below.

Notice that the newly trained model now identified Hikmat Abu Zayd as a female minister which is much more accurate than the non-trained model. You might also notice that the quality of the output is not 100% accurate, which is expected from being run on a laptop. Here are some of the reasons for this imprecision:

On your laptop, you are using a data generation pipeline using a quantized teacher model, which will affect the quality of the synthetic data. More high quality synthetic data could be produced by full fledged teacher model which demands GPUs not available on consumer laptop
The student model is also a quantized GGUF model and GGUFs are not precise.
The training is performed in limited epochs with a limited set of training data.
As described previously, there are 3 different model fidelity piplines for training. simple is used for rapid prototyping on laptop and produces low-fidelity models, full also works on laptop but requires high spec laptop of at least 16 GB RAM and produces medium-fidelity models, and accelerated utilizes GPU acceleration and produce high-fidelity models. In short, it's a trade-off between time, quality and hardware resources available.

This will change when the contributions are accepted and the actual training actually happens or if you run the entire process using GPUs.

Step 10. Optionally, contribute the knowledge to InstructLab

In this final section, you learn how to contribute the knowledge to InstructLab to improve large language models. This an optional step, so you should just read through it to understand the process so that you can follow it when you have an actual knowledge that you would like to contribute to InstructLab.

Navigate to the taxonomy directory.
```
cd taxonomy
```

Commit the change to your forked GitHub repository. Make sure to sign-off the commit using -s command.

git add -A
 git commit -m “First female minister in Egypt knowledge” -s
 git push

Submit the pull request.
- Navigate to your forked taxonomy GitHub repository.
- Make sure that you accepted all the changes already through Sync fork. You should see no commits behind before going to the next step.
- Click on Contribute to open the Pull Request, and then click on Open pull request.
- Input all the required details as per the form template while opening the pull request including:
  - Describe the contribution to the taxonomy: A concise description of what the contribution brings.
  - Input given at the prompt: Example question
  - Response from the original model: Response from the model before the training.
  - Response from the fine-tuned model: Response from the model after the training.
  - Review and mark the checklist items. If you have followed this article step-by-step, you should be ready to pass all the checklist requirements.
  - Click on Create pull request.
You can check this sample pull request.

Summary and next steps

In this tutorial, you learned how to set up your InstructLab environment on a Mac M series computer, add new knowledge by forking the InstructLab taxonomy repo to include that additional knowledge, generate synthetic data and train the model, quantize the trained model into GGUF format, and finally serve the newly trained model and chat with it. And, you learned how you can contribute your knowledge to the InstructLab taxonomy as a pull request.

Now that you've seen the power of InstructLab, check out Red Hat Enterprise Linux AI, which brings together the open source Granite family of LLMs, the InstructLab model alignment tools, a boatable image of Red Hat Enterprise Linux, including popular AI libraries such as PyTorch, and enterprise-grade technical support and open source assurance legal protections. Then, you can check out this tutorial for fine-tuning IBM Granite language models for enterprise applications using RHEL AI.

And, you can scale your AI workflows with Red Hat OpenShift AI and begin using IBM watsonx.ai, which provides additional capabilities for enterprise AI development, data management, and model governance.

Acknowledgments

This tutorial is produced as part of an IBM Open Innovation Community initiative.

The author deeply appreciates the support of Red Hat engineers (Jaideep Rao and Charlie Doern) in troubleshooting, and resolving some issues faced during the model training process. Additionally, the author would like to acknowledge the efforts of (Sumabala Nair, Lisa Waugh, and Keely Wright) for the guidance and expertise on reviewing and contributing to this tutorial.