Build a local AI co-pilot using IBM Granite 4, Ollama, and Continue

In this tutorial, I will show how to use a collection of open source components to run a feature-rich development agent in Visual Studio Code while meeting data privacy, licensing, and cost challenges common to enterprise users. The setup is powered by local large language models (LLMs) with IBM's latest open-source llm family, Granite 4. All components run on a developer's workstation and have business friendly licensing. For the quick version, just jump to the TL;DR script end-to-end setup script.

The developer world is quickly becoming the best place for AI developers to drink our own champagne with the promise of generative AI to accelerate our own work. There are numerous excellent AI assistant tools out there in the market (Claude Code, Codex, Cursor, Windsurf, GitHub Copilot, tabnine, Sourcegraph Amp, Project Bob to name a few). These tools offer in-editor chatbots, code completion, code explanation, test generation, auto-documentation, cli development agents, and a host of other developer-centric tools. Unfortunately, for many of us, these tools sit out of reach behind corporate data privacy policies (yes, we can access Project Bob here at IBM, but the rest are not available.)

There are three main barriers to adopting these tools in an enterprise setting:

Data Privacy: Many corporations have privacy regulations that prohibit sending internal code or data to third party services.
Generated Material Licensing: Many models, even those with permissive usage licenses, do not disclose their training data and therefore may produce output that is derived from training material with licensing restrictions.
Cost: Many of these tools are paid solutions which require investment by the organization. For larger organizations, this would often include paid support and maintenance contracts which can be extremely costly and slow to negotiate.

In this tutorial, I will show how I solved all of these problems using IBM's Granite 4 Models, Ollama, Visual Studio Code, and Continue. The following figure shows the architecture of a local co-pilot.

Architecture of a local co-pilot

Step 1. Install Ollama

The first problem to solve is avoiding the need to send code to a remote service. One of the most widely used tools in the AI world right now is Ollama which wraps the underlying model serving project llama.cpp. The ollama CLI makes it seamless to run LLMs on a developer's workstation, using the OpenAI API with the /completions and /chat/completions endpoints. Users can take advantage of available GPU resources and offload to CPU where needed. My workstation is a MacBook Pro with an Apple M3 Max and 64GB of shared memory which means I have roughly 45GB of usable VRAM to run models with! Users with less powerful hardware can still use ollama with smaller models and/or models with higher levels of quantization.

On a Mac workstation, the simplest way to install ollama is via their webpage: https://ollama.com/download. This will install a menu-bar app to run the ollama server in the background and keep you up-to-date with the latest releases. There is also a convenient chat GUI that can be used to chat directly with models.

Step 2. Fetch the Granite 4 models

The second problem to solve is choosing a model that gives high-quality output and was trained on enterprise safe data. There are numerous good code models available on the ollama library and huggingface. The IBM Granite 4 models achieved ISO 42001 Certification which establishes the trustworthiness and transparency benchmarks for enterprise AI systems, including training data licensing. Since generated material licensing is one of the primary concerns I've already identified, and since I work for IBM, I chose this family of models for my own use.

Granite comes in a range of sizes and architectures to fit your workstation's available resources. Generally, the bigger models perform best, but require more resources and will be slower. I chose the tiny-h option as my starting point for chat and the 350m-h option for autocomplete. Ollama offers a convenient pull feature to download models:

ollama pull granite4:tiny-h
ollama pull granite4:350m-h

In addition to the language models for chat and code generation, you will need a strong embedding model to enable the Retrieval Augmented Generation (RAG) capabilities of Continue. The Granite family also contains strong, lightweight embedding models. I chose granite-embedding:30m since my code is entirely in english and the 30m model performs well at a fraction of the weights of other leading models. You can pull it too!

ollama pull granite-embedding:30m

Step 3. Set up Continue

With the Granite models available and ollama running, it's time to start using them in your editor. The first step is to get Continue installed into Visual Studio Code. This can be done with a quick command line call:

code --install-extension continue.continue

Alternately, you can install continue using the extensions tab in VS Code:

Open the Extensions tab.
Search for "continue."
Click the Install button.

Next, you need to configure Continue to use your Granite models with Ollama. By default, Continue comes with a local config that you can edit through the side panel with the following steps:

Click Configure Models.
Click Configure next to your Chat model.

This will open the config.yaml file in your editor. You can also simply edit it directly (by default it lives in $HOME/.continue/config.yaml).
Configure your Chat model.

To enable your ollama Granite models, you'll need to edit the models section. Each model requires the following fields:
- name: The name you'll see in your UI when selecting models
- provider: The provider that will run the model (ollama in our case)
- model: The name that the provider uses for the model
- roles: The roles that the model can serve within Continue
  
  Here's what the entry for Granite 4 Tiny looks like:
```
models:
# Chat
- name: Granite 4 Tiny
  provider: ollama
  model: granite4:tiny-h
  defaultCompletionOptions:
    contextLength: 131072
  roles:
    - chat
    - edit
    - apply
  capabilities:
    - tool_use
```
  There are a couple of key fields to note here:
- The tool_use capability enables the model to be used in Agent Mode
- The contextLength field determines how much context the model can handle at once. The Granite 4 hybrid models (denoted with -h) are extremely efficient, so you can run with long context even on lighter weight workstations!

Configure your Autocomplete model.

Just like chat, the model for autocomplete is configured under the models section:

models:
   ...
     # Autocomplete
   - name: Granite 4 350m
     provider: ollama
     model: granite4:350m-h
     roles:
       - autocomplete

Configure your embedding model.

# Embedding
   - name: granite-embedding:30m
     provider: ollama
     model: granite-embedding:30m
     roles:
       - embed
     embedOptions:
       maxChunkSize: 512

(optional) Select your newly configure models.

If you have multiple models configured for a given role, you can now select which to use from the models configuration panel.

Step 4. Try your agent

With continue installed and Granite running, you should be ready to try out your new local AI development agent. Click the new Continue icon in your sidebar.

Make sure you select Agent mode from the mode selector.

Agent Mode

Check out the official documentation for some ideas on how to use your agent.

Next steps: Extend the framework

Once you're off the ground with the basic setup, there are lots of great ways to extend the framework to fit your personal needs.

Using the hub

Continue offers a central registry of all its key components at: https://hub.continue.dev. This includes Models, Rules, MCP Servers, Prompts, and Assistants.

To take advantage of the hub, you first need to create a free account with Continue. You can then log in through the config dropdown:

Continue Login

Once logged in, you can follow the docs to start using different blocks from the hub!

Continue CLI

Continue also comes with a great CLI (command line interface) for interacting with your development agent directly from the command line. It uses all the same configuration as the editor extension and is simple to install:

npm i -g @continuedev/cli

Once installed, it can be launched as an interactive terminal UI (TUI) or run in headless mode:

# TUI mode
cn

# Headless mode
cn -p "Review the last 5 commits for issues"

For more information, read the full Continue CLI Quick Start docs.

Experimenting with different models

Another nice feature of continue is the ability to easily toggle between different models in the chat panel. You can configure this using the "models" section of the core config.yaml. For me, this was useful to experiment with the difference between the various sizes in the Granite family and other popular open models such as gpt-oss:20b.

Model toggle

To set this up, you simply have to add additional entries in the "models" list:

models:
  # Chat
  - name: gpt-oss:20b
    provider: ollama
    model: gpt-oss:20b
    defaultCompletionOptions:
      contextLength: 131072
    roles:
      - chat
      - edit
      - apply
  - name: Granite 4 Small
    provider: ollama
    model: granite4:small-h
    defaultCompletionOptions:
      contextLength: 131072
    roles:
      - chat
      - edit
      - apply
    capabilities:
      - tool_use
  - name: Granite 4 Tiny
    provider: ollama
    model: granite4:tiny-h
    defaultCompletionOptions:
      contextLength: 131072
    roles:
      - chat
      - edit
      - apply
    capabilities:
      - tool_use
  - name: Granite 4 Micro
    provider: ollama
    model: granite4:micro-h
    defaultCompletionOptions:
      contextLength: 131072
    roles:
      - chat
      - edit
      - apply
    capabilities:
      - tool_use
  # Autocomplete
  - name: granite-code:3b
    provider: ollama
    model: granite-code:3b
    roles:
      - autocomplete
  - name: Granite 4 Nano
    provider: ollama
    model: granite4:350m-h
    roles:
      - autocomplete
  # Embeddings
  - name: granite-embedding:30m
    provider: ollama
    model: granite-embedding:30m
    roles:
      - embed
    embedOptions:
      maxChunkSize: 512

Here are some other models on ollama that may be worth trying out. Many models on Ollama do not have standard OSS license, but these ones may be worth experimenting with:

Model	Sizes	License	Link
gpt-oss	20B, 120B	Apache 2.0	`https://ollama.com/library/gpt-oss`
qwen3-coder	30B, 480B	Apache 2.0	`https://ollama.com/library/qwen3-coder`
qwen2.5-coder	0.5B, 1.5B, 3B, 7B, 14B, 32B	Apache 2.0	`https://ollama.com/library/qwen2.5-coder`

Import local models from GGUF and GGML

While the ollama library is a great tool to manage your models, many of us also have numerous model files already downloaded on our machines that we don't want to duplicate. The ollama Modelfile is a powerful tool that can be used to create customized model setups by deriving from known models and customizing the inference parameters, including the ability to add (Q)LoRA Adapters Adapters (see the docs for more details).

For our purpose, we only need the simple FROM statement, which can point to a known model in the ollama library or a local file on disk. This makes it really easy wrap the process into an import-to-ollama bash script:

#!/usr/bin/env bash

file_path=""
model_name=""
model_label="local"
while [[ $# -gt 0 ]]
do
    key="$1"

    case $key in
        -f|--file)
            file_path="$2"
            shift
            ;;
        -m|--model-name)
            model_name="$2"
            shift
            ;;
        -l|--model-label)
            model_label="$2"
            shift
            ;;
        *)
            echo "Unknown option: $key"
            exit 1
            ;;
    esac
    shift
done

if [ "$file_path" == "" ]
then
    echo "Missing required argument -f|--file"
    exit 1
fi
file_path="$(realpath $file_path)"

# Check if model_name is empty and assign file name as model_name if true
if [ "$model_name" == "" ]
then
    model_name=$(basename $file_path)
    model_name="${model_name%.*}"
fi

# Append the model label to the model name
model_name="$model_name:$model_label"
echo "model_name: $model_name"

# Create a temporary directory for working
tempdir=$(mktemp -d)
echo "Working Dir: $tempdir"

# Write the file path to Modelfile in the temporary directory
echo "FROM $file_path" > $tempdir/Modelfile

# Import the model using ollama create command
echo "importing model $model_name"
ollama create $model_name -f $tempdir/Modelfile

Local LLM Web UI

There are numerous additional AI applications, use cases, and patterns that can be adapted to work with local LLMs. Exploring LLMs locally can be greatly accelerated with a local web UI. The Open WebUI project works seamlessly with ollama to provide a web-based LLM workspace for experimenting with prompt engineering, retrieval augmented generation (RAG), and tool use.

To set up Open WebUI, follow the steps in their documentation. The simplest versions are:

Docker

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

Pip

pip install open-webui
  open-webui serve

Once running, you can open the UI at http://localhost:8080.

open http://localhost:8080

The first time you log in, you'll need to set up an "account." Since this entirely local, you can fill in garbage values (foo@bar.com/asdf) and be off to the races!

Open WebUI

TLDR

For the impatient, here's the end-to-end setup script:

# Download and install ollama
open https://ollama.com/download

# Download IBM Grainte models
ollama pull granite4:tiny-h
ollama pull granite4:350m-h
ollama pull granite-embedding:30m

# Install continue in VS Code
code --install-extension continue.continue

# Configure continue to use the models
cat <<'EOF' > "$HOME/.continue/tmp.yaml"
name: local assistant
version: 1.0.0
schema: v1
models:
  # Chat
  - name: Granite 4 Tiny
    provider: ollama
    model: granite4:tiny-h
    defaultCompletionOptions:
      contextLength: 131072
    roles:
      - chat
      - edit
      - apply
    capabilities:
      - tool_use
  # Autocomplete
  - name: granite-code:3b
    provider: ollama
    model: granite-code:3b
    roles:
      - autocomplete
  - name: Granite 4 Nano
    provider: ollama
    model: granite4:350m-h
    roles:
      - autocomplete
  # Embeddings
  - name: granite-embedding:30m
    provider: ollama
    model: granite-embedding:30m
    roles:
      - embed
    embedOptions:
      maxChunkSize: 512
EOF

Summary

I've demonstrated how to solve the problems of cost, licensing, and data privacy in adopting AI co-pilot tools in an enterprise setting using IBM's Granite Models, Ollama, Visual Studio Code, and Continue. With this setup, developers can effectively avoid the common obstacles to adopting AI-powered development tools in enterprise environments, including data privacy concerns, licensing restrictions, and cost barriers. Using local LLMs offers a unique opportunity for developers to harness the capabilities of AI-driven code completion, refactoring, and analysis while ensuring the integrity and security of their codebase.

Next steps

For a practical tour of building an application with your newly set up code assistant based on Granite, check out the tutorial, "Developing a gen AI application using IBM Granite."

Explore more articles and tutorials about watsonx on IBM Developer.

Try watsonx for free

The Granite models are all available in watsonx.ai.

Build an AI strategy for your business on one collaborative AI and data platform called IBM watsonx, which brings together new generative AI capabilities, powered by foundation models, and traditional machine learning into a powerful platform spanning the AI lifecycle. With watsonx.ai, you can train, validate, tune and deploy models with ease and build AI applications in a fraction of the time with a fraction of the data. These models are accessible to all as many no-code and low-code options are available for beginners.

Try watsonx.ai, the next-generation studio for AI builders.