Train open source LLMs with collected skills with InstructLab

IMPORTANT: The steps in this tutorial should be able to be run with InstructLab 0.17.1, the most current stable release. Some commands may have changed since this tutorial was published. If you run into issues, explore the InstructLab README file for help.

InstructLab is a model-agnostic open source AI project that facilitates contributions to large language models (LLMs). It is a new community-based approach to build truly open-source LLMs. To learn more about InstructLab, read this article, "What is InstructLab, and why do developers need it."

InstructLab uses a synthetic-data-based alignment tuning method to train LLMs. The InstructLab tuning method is driven by manually created taxonomies. InstructLab provides a process for optimizing and tuning LLMs by collecting knowledge and skills as part of a taxonomy tree.

In this tutorial, you learn how to train an open source LLM to generate test cases for a specific Python code snippet by creating a grounded compositional skill and training it in InstructLab. The skill that you create in this tutorial isn't accepted for contributing to the InstructLab taxonomy tree right now, but you can still try it out in your local and understand how to train the model with a skill.

Prerequisites

Currently, InstructLab CLI (ilab) supports only Apple Mac M1/M2/M3 processors or Linux systems.

Additionally, you need to have the following compilers installed in your system:

C ++ compiler
Python 3.11.9
xcode (for MacOS systems)

Finally, an approximate 60GB disk space is required for the entire process.

To install the InstructLab CLI (ilab), in a terminal, run the following commands:

Setup a python virtual environment:

python -m venv --upgrade-deps venv

Activate the virtual environment

source venv/bin/activate

Install the InstructLab cli

pip install instructlab==0.17.1

You can also refer to the instructions in the InstructLab repo to install the ilab CLI.

Steps

Step 1. Download the base LLM

After you have installed the InstructLab CLI on your system, you can start by downloading the base model that you want to train. You can find the supported open source models from HuggingFace. The default is merlinite-7b-lab-Q4_K_M, which you need to use the 4 bit Quantized version of it for this tutorial.

In the terminal, run the following command to initialize the ilab cli:

ilab config init --non-interactive

To download the model, run this command:

ilab model download

To download a different base model you can run the following command:

ilab model download --repository <huggingface_repo> --filename <filename>.gguf

Once the model is downloaded, you can chat with it.

Step 2. Serve the base model and check the output

Open two terminals, In the first terminal, run the following to serve the model:

ilab model serve

If you want to serve a different model, run this command:

ilab model serve --model-path <modelpath>.gguf

You should see an output similar to below:

INFO 2024-05-30 17:24:41,256 lab.py:320 Using model 'models/merlinite-7b-lab-Q4_K_M.gguf' with -1 gpu-layers and 4096 max context size.
INFO 2024-05-30 17:24:41,659 server.py:206 Starting server process, press CTRL+C to shutdown server...
INFO 2024-05-30 17:24:41,659 server.py:207 After application startup complete see http://127.0.0.1:8000/docs for API.

Keep this terminal open to generate synthetic data, train the model, and test the model.

Now in the second terminal, run the following command to chat with the base model and see if the model knows about test cases generation.

ilab model chat -gm

An interactive shell is presented where you can chat with the model. Go ahead and ask the model to generate a test case for something. For example, use one of these queries:

"generate test case for division of two integers in python"
"generate test case for multiplication of two integers in python".

The model output without training looks something like this:

before-training

You will see that model is not able to generate test cases. Now lets train the model and then check the results.

Step 3. Create a skill

To train an open source model with InstructLab, you need to create a skill or knowledge in the taxonomy directory. In this tutorial, you will create a skill. When you initilized the ilab cli, it automatically cloned the InstructLab taxonomy repository, which is the source of truth for your model training.

It is recommended to have 5 or more examples in the qna.yaml file for the skill. Copy the qna.yaml from my github repo in to the taxonomy/compositional_skills/coding/grounded/testcase-generation/qna.yaml directory of the cloned InstructLab repo.

To verify that InstructLab detects the new skill, in a terminal, run the following command:

ilab taxonomy diff

You should see the newly added qna.yaml file in the path.

The relative positioning of the skills in the tree do not affect anything at all, other than the human organization of the information.

Step 3. Generate synthetic data

Now that you have the skill ready, you can generate synthetic data for the skill. By default, InstructLab can generate 100 samples, but you can configure how much sample data you need to be generated.

In a terminal, run the following command:

ilab data generate

To generate more than 100 samples, run the following command:

ilab data generate --num-instructions <int>

You can see the new synthetic data sets that are being generated in the output. If you are not satisfied with the generated data set, you can quit the process by pressing ctrl + C. Modify the examples in the qna.yaml, and then run the generate command again.

This process will take some time depending on your system configuration. It took approximately 32 minutes on my M1 MacBook Pro. You can see the ETA in output.

Once the synthetic data is generated, you will see a summary of how many samples have been generated and how many have been discarded. Samples might be discarded by the critic model for format or by a rogue score threshold.

Example output:

100 instructions generated, 8 discarded due to format (see generated/discarded_merlinite-7b-lab-Q4_K_M_2024-05-24T13_00_21.log), 5 discarded due to rouge score

A directory will also be generated directory in the ilab directory. You can see four files:

Discarded data set (log file)
Generated data set (json file)
Train data set (jsonl file)
Test data set (jsonl file)

Step 4. Train the model

Once the synthetic data is ready, all you have to do is run the following command in your terminal to train the model:

ilab model train

If you want to use a GPU for training, you can run the following command:

ilab model train --device 'cuda'

This process will take some time depending upon your system and the number of iterations. It took approximately 30 minutes on my M1 MacBook Pro to complete 100 iterations. You can see the ETA in output.

A new directory will be created in the ilab directory, with a name similar to this: instructlab-merlinite-7b-lab. This directory will have the new model weights and adapters.

Step 5. Test the model

The InstructLab can also run basic tests to ensure model correctness. In your terminal, run the following:

ilab model test

You can see the output where it shows model output before and after training.

If you are training on a MacOS computer, you need to quantize the model to run it on your Mac. In terminal, run the following:

ilab model convert

All the weights and adapters will be converted to a quantized gguf model after running the command. A directory will be created in the ilab directory, with a name similar to this: instructlab-merlinite-7b-lab-trained.

Step 6. Serve and chat the model

Go back to the first terminal where you had served the base model, hit ctrl+c and stop the model serving. Run the following command to serve the newly trained model:

ilab model serve --model-path instructlab-granite-7b-lab-trained/instructlab-granite-7b-lab-Q4_K_M.gguf

In the second terminal where you generated synthetic dataset, trained the model, tested the model, run the following command to chat with the model:

ilab model chat -gm -m <model_filepath/model_name>.gguf

Here, the model file path and name will be something like this: instructlab-merlinite-7b-lab-trained/instructlab-merlinite-7b-lab-Q4_K_M.gguf.

You can ask the model to generate test cases now and observe the results.

For example, use one of these queries:

"generate test case for division of two integers in python"
"generate test case for multiplication of two integers in python".

The model output after tuning for the query "generate test case for division of two integers in python" looks something like this:

model-output-after-training-1

The model output after tuning for the query "generate test case for multiplication of two integers in python" looks something like this:

model-output-after-training-2

As you can see, the newly trained model has performed much better than the default model. It is also generating an explanation for the code that it wrote. If that is not something that you require, you can modify the qna.yaml to change the examples and try to train the model again.

Summary and next steps

In this tutorial you learned how to create a grounded compositional skill for generating test cases in Python. After you set up the InstructLab CLI, you downloaded the base model and trained it using the qna.yaml file. Then, you tested your fine-tuned model and chatted with it.

To get started, join the InstructLab community in GitHub, and create other compositional skills and contribute them to the InstructLab taxonomy tree by raising a pull request. You can also explore IBM foundation models from IBM watsonx.ai studio that are designed to support knowledge and skills contributed by the open source community.