This is a cache of https://developer.ibm.com/tutorials/awb-local-ai-copilot-ibm-granite-code-ollama-continue/. It is a snapshot of the page as it appeared on 2025-11-19T05:09:08.573+0000.
Build a local AI co-pilot using IBM Granite 4, Ollama, and Continue - IBM Developer
In this tutorial, I will show how to use a collection of open source components to run a feature-rich development agent in Visual Studio Code while meeting data privacy, licensing, and cost challenges common to enterprise users. The setup is powered by local large language models (LLMs) with IBM's latest open-source llm family, Granite 4. All components run on a developer's workstation and have business friendly licensing. For the quick version, just jump to the TL;DR script end-to-end setup script.
The developer world is quickly becoming the best place for AI developers to drink our own champagne with the promise of generative AI to accelerate our own work. There are numerous excellent AI assistant tools out there in the market (Claude Code, Codex, Cursor, Windsurf, GitHub Copilot, tabnine, Sourcegraph Amp, Project Bob to name a few). These tools offer in-editor chatbots, code completion, code explanation, test generation, auto-documentation, cli development agents, and a host of other developer-centric tools. Unfortunately, for many of us, these tools sit out of reach behind corporate data privacy policies (yes, we can access Project Bob here at IBM, but the rest are not available.)
There are three main barriers to adopting these tools in an enterprise setting:
Data Privacy: Many corporations have privacy regulations that prohibit sending internal code or data to third party services.
Generated Material Licensing: Many models, even those with permissive usage licenses, do not disclose their training data and therefore may produce output that is derived from training material with licensing restrictions.
Cost: Many of these tools are paid solutions which require investment by the organization. For larger organizations, this would often include paid support and maintenance contracts which can be extremely costly and slow to negotiate.
The first problem to solve is avoiding the need to send code to a remote service. One of the most widely used tools in the AI world right now is Ollama which wraps the underlying model serving project llama.cpp. The ollama CLI makes it seamless to run LLMs on a developer's workstation, using the OpenAI API with the /completions and /chat/completions endpoints. Users can take advantage of available GPU resources and offload to CPU where needed. My workstation is a MacBook Pro with an Apple M3 Max and 64GB of shared memory which means I have roughly 45GB of usable VRAM to run models with! Users with less powerful hardware can still use ollama with smaller models and/or models with higher levels of quantization.
On a Mac workstation, the simplest way to install ollama is via their webpage: https://ollama.com/download. This will install a menu-bar app to run the ollama server in the background and keep you up-to-date with the latest releases. There is also a convenient chat GUI that can be used to chat directly with models.
Step 2. Fetch the Granite 4 models
The second problem to solve is choosing a model that gives high-quality output and was trained on enterprise safe data. There are numerous good code models available on the ollama library and huggingface. The IBM Granite 4 models achieved ISO 42001 Certification which establishes the trustworthiness and transparency benchmarks for enterprise AI systems, including training data licensing. Since generated material licensing is one of the primary concerns I've already identified, and since I work for IBM, I chose this family of models for my own use.
Granite comes in a range of sizes and architectures to fit your workstation's available resources. Generally, the bigger models perform best, but require more resources and will be slower. I chose the tiny-h option as my starting point for chat and the 350m-h option for autocomplete. Ollama offers a convenient pull feature to download models:
In addition to the language models for chat and code generation, you will need a strong embedding model to enable the Retrieval Augmented Generation (RAG) capabilities of Continue. The Granite family also contains strong, lightweight embedding models. I chose granite-embedding:30m since my code is entirely in english and the 30m model performs well at a fraction of the weights of other leading models. You can pull it too!
ollama pull granite-embedding:30m
Copy codeCopied!
Step 3. Set up Continue
With the Granite models available and ollama running, it's time to start using them in your editor. The first step is to get Continue installed into Visual Studio Code. This can be done with a quick command line call:
code --install-extension continue.continue
Copy codeCopied!
Alternately, you can install continue using the extensions tab in VS Code:
Open the Extensions tab.
Search for "continue."
Click the Install button.
Next, you need to configure Continue to use your Granite models with Ollama. By default, Continue comes with a local config that you can edit through the side panel with the following steps:
Click Configure Models.
Click Configure next to your Chat model.
This will open the config.yaml file in your editor. You can also simply edit it directly (by default it lives in $HOME/.continue/config.yaml).
Configure your Chat model.
To enable your ollama Granite models, you'll need to edit the models section. Each model requires the following fields:
name: The name you'll see in your UI when selecting models
provider: The provider that will run the model (ollama in our case)
model: The name that the provider uses for the model
roles: The roles that the model can serve within Continue
Here's what the entry for Granite 4 Tiny looks like:
The tool_use capability enables the model to be used in Agent Mode
The contextLength field determines how much context the model can handle at once. The Granite 4 hybrid models (denoted with -h) are extremely efficient, so you can run with long context even on lighter weight workstations!
Configure your Autocomplete model.
Just like chat, the model for autocomplete is configured under the models section:
If you have multiple models configured for a given role, you can now select which to use from the models configuration panel.
Step 4. Try your agent
With continue installed and Granite running, you should be ready to try out your new local AI development agent. Click the new Continue icon in your sidebar.
Make sure you select Agent mode from the mode selector.
To take advantage of the hub, you first need to create a free account with Continue. You can then log in through the config dropdown:
Once logged in, you can follow the docs to start using different blocks from the hub!
Continue CLI
Continue also comes with a great CLI (command line interface) for interacting with your development agent directly from the command line. It uses all the same configuration as the editor extension and is simple to install:
npm i -g @continuedev/cli
Copy codeCopied!
Once installed, it can be launched as an interactive terminal UI (TUI) or run in headless mode:
# TUI mode
cn
# Headless mode
cn -p "Review the last 5 commits for issues"
Another nice feature of continue is the ability to easily toggle between different models in the chat panel. You can configure this using the "models" section of the core config.yaml. For me, this was useful to experiment with the difference between the various sizes in the Granite family and other popular open models such as gpt-oss:20b.
To set this up, you simply have to add additional entries in the "models" list:
Here are some other models on ollama that may be worth trying out. Many models on Ollama do not have standard OSS license, but these ones may be worth experimenting with:
While the ollama library is a great tool to manage your models, many of us also have numerous model files already downloaded on our machines that we don't want to duplicate. The ollamaModelfile is a powerful tool that can be used to create customized model setups by deriving from known models and customizing the inference parameters, including the ability to add (Q)LoRA Adapters Adapters (see the docs for more details).
For our purpose, we only need the simple FROM statement, which can point to a known model in the ollama library or a local file on disk. This makes it really easy wrap the process into an import-to-ollama bash script:
#!/usr/bin/env bash
file_path=""
model_name=""
model_label="local"while [[ $# -gt 0 ]]
do
key="$1"case$keyin
-f|--file)
file_path="$2"shift
;;
-m|--model-name)
model_name="$2"shift
;;
-l|--model-label)
model_label="$2"shift
;;
*)
echo"Unknown option: $key"exit 1
;;
esacshiftdoneif [ "$file_path" == "" ]
thenecho"Missing required argument -f|--file"exit 1
fi
file_path="$(realpath $file_path)"# Check if model_name is empty and assign file name as model_name if trueif [ "$model_name" == "" ]
then
model_name=$(basename$file_path)
model_name="${model_name%.*}"fi# Append the model label to the model name
model_name="$model_name:$model_label"echo"model_name: $model_name"# Create a temporary directory for working
tempdir=$(mktemp -d)
echo"Working Dir: $tempdir"# Write the file path to Modelfile in the temporary directoryecho"FROM $file_path" > $tempdir/Modelfile
# Import the model using ollama create commandecho"importing model $model_name"
ollama create $model_name -f $tempdir/Modelfile
Copy codeCopied!Show more
Local LLM Web UI
There are numerous additional AI applications, use cases, and patterns that can be adapted to work with local LLMs. Exploring LLMs locally can be greatly accelerated with a local web UI. The Open WebUI project works seamlessly with ollama to provide a web-based LLM workspace for experimenting with prompt engineering, retrieval augmented generation (RAG), and tool use.
To set up Open WebUI, follow the steps in their documentation. The simplest versions are:
Once running, you can open the UI at http://localhost:8080.
open http://localhost:8080
Copy codeCopied!
The first time you log in, you'll need to set up an "account." Since this entirely local, you can fill in garbage values (foo@bar.com/asdf) and be off to the races!
TLDR
For the impatient, here's the end-to-end setup script:
I've demonstrated how to solve the problems of cost, licensing, and data privacy in adopting AI co-pilot tools in an enterprise setting using IBM's Granite Models, Ollama, Visual Studio Code, and Continue. With this setup, developers can effectively avoid the common obstacles to adopting AI-powered development tools in enterprise environments, including data privacy concerns, licensing restrictions, and cost barriers. Using local LLMs offers a unique opportunity for developers to harness the capabilities of AI-driven code completion, refactoring, and analysis while ensuring the integrity and security of their codebase.
The Granite models are all available in watsonx.ai.
Build an AI strategy for your business on one collaborative AI and data platform called IBM watsonx, which brings together new generative AI capabilities, powered by foundation models, and traditional machine learning into a powerful platform spanning the AI lifecycle. With watsonx.ai, you can train, validate, tune and deploy models with ease and build AI applications in a fraction of the time with a fraction of the data. These models are accessible to all as many no-code and low-code options are available for beginners.
Try watsonx.ai, the next-generation studio for AI builders.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.