About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Tutorial
Build a RAG-powered Markdown documentation assistant
Turn your Markdown files into a dynamic, interactive, and a conversational resource
Imagine you’re heads-down focused in a project, searching a GitHub repository’s Markdown files for that one small unit test command or an elusive detail about configuring an API. You’re flipping between READMEs, wikis, and scattered “docs” folders, losing time and patience. What if there was a way to just ask your documentation? "How do I run a single unit test from the suite?" or "What’s the retry policy for the endpoint?" and get a precise, context-aware answer in seconds? This is where, the technology of Retrieval-Augmented Generation (RAG) can help make your documentation conversational.
In this tutorial, we’ll build an intelligent documentation assistant that lets you chat with your project’s Markdown documentation (those .md files you see in GitHub, like READMEs). Using JavaScript, and a tool called LangChain, and the IBM Granite model via Ollama, we’ll create a command-line interface (CLI) that connects to your GitHub repository, pulls in your documentation, and answers your questions in plain language. It’s like having a super-seasoned teammate who knows every word of your project’s docs, akin to a pair programming buddy in your day to day workflows.
Why Markdown? Markdown is the lingua franca of developer documentation. It is lightweight, packed with critical info, readable, and ubiquitous in GitHub repos. They are the go-to format for project documentation on GitHub. Our assistant makes them interactive, saving you time and frustration. It’s the perfect starting point for a RAG-powered assistant.
Here’s what we’re building
We’re creating a command-line assistant that lets you chat with any markdown file instantly turning your documentation into an interactive, AI-powered resource.
Ask Questions, Get Answers: Provide a public URL to a markdown file (like a README or guide from GitHub). The assistant downloads and processes it, so you can ask questions about its content in natural language.
AI-Powered, Contextual Responses: When you ask a question, the assistant searches the document for the most relevant sections and uses a local large language model (IBM Granite 3.3 via Ollama) to generate accurate, context-aware answers.
No Complex Setup: There’s no need to clone repositories, manage tokens, or set up databases. Just paste a markdown file URL and start chatting.
Proof of Concept: This demo focuses on a single markdown file to showcase the Retrieval-Augmented Generation (RAG) workflow. The design is simple, but the approach can be extended to entire documentation sites, web chat interfaces, or large-scale document search.
Prerequisites
Before we start coding, you’ll need the following tools to try this in a step by step fashion. This tutorial assumes you’re comfortable with JavaScript and Node.js, but we’ll explain each step clearly for newcomers to RAG or LangChain.
- Langchain.js
Node.js (v18 or v20+). Node.js runs the JavaScript code for your assistant.
LangChain.js officially supports Node.js 18.x, 19.x, and 20.x. For best compatibility and performance, use Node.js 20 or newer.
Ollama. Ollama runs large language models (LLMs) like IBM’s Granite 3.3 directly on your machine for privacy and offline use.
A GitHub Repository. Have a public repo with Markdown files (for example, a
docsfolder with.mdfiles) ready for testing.
Setting up your environment
After you have installed Node.js and Ollama, follow these steps to set up your environment:
Use Ollama to pull the Granite 3.3 model:
ollama pull granite3.3:2bStart Ollama (if it’s not already running):
ollama serveNo GPU is required for small demos, but a modern CPU/GPU will improve performance.
Instead of cloning your whole repo, you'll provide a direct link to a markdown file, such as a README or guide from GitHub. Click "Raw" to get the direct URL (for example,
https://raw.githubusercontent.com/user/repo/main/docs/guide.md). Copy this link for use in the CLI.
Setting up your project
Clone this demo repository:
git clone https://github.com/adigidh/markdown-rag-tutorial-demo.gitInstall dependencies. These libraries enable RAG, markdown parsing, embeddings, and the chat interface. These packages cover CLI prompts, HTTP requests, text chunking, embeddings, and LLM interfacing with Ollama.
npm installSave the code from the artifact make it executable:
chmod +x index.js
You’re now ready to build the assistant. Let’s explore how it works before diving into the code.
How It Works: The RAG Pipeline Explained
Our AI assistant uses a technique called Retrieval-Augmented Generation (RAG) to make your documentation conversational. Think of it like a super-smart search engine combined with a chatbot. The RAG pipeline will turn your Markdown files into a dynamic, interactive, and a conversational resource.
Here’s a high-level overview of how the process works in this simplified demo:
Input collection: The CLI prompts you for a direct URL to a markdown file (for example, a raw
.mdfile from GitHub or any public web link).This makes it easy to get started—no need for GitHub tokens or cloning entire repositories.
Markdown downloading: The assistant fetches the markdown file from the provided URL.
This ensures it always works with the latest version of your documentation, with no manual downloads required.
Markdown processing and chunking: The assistant splits the markdown content into smaller, manageable "chunks" using a smart text splitter.
Chunking allows the AI to focus on the most relevant pieces of information and ensures each chunk fits within the AI’s processing limits.
Embedding and indexing: Each chunk is transformed into a mathematical "embedding" using the Granite 3.3 model (running locally via Ollama).
Because this is a POC demo, these embeddings are stored in an in-memory vector database, enabling fast and accurate similarity search.
RAG pipeline: Retrieval and generation. When you ask a question in the CLI, the assistant turns your question into an embedding.
- It searches the vector database for the most relevant chunks of your markdown.
- These chunks are combined with your question and sent to the Granite 3.3 model, which generates a clear, context-aware answer.
- This ensures every answer is grounded in your actual documentation, not just the AI’s general knowledge.
Conversational chat interface: You interact with the assistant through a simple command-line chat. As an user type your questions and get instant, helpful answers. Type exit at any time to quit the session.
This makes your documentation interactive and user-friendly, even for non-technical users.
LangChain handles the heavy lifting of document processing and retrieval, while Ollama powers the AI components locally.
An analogical reference would equate to a conversation with a librarian who not only finds the right books (your Markdown files) but also summarizes the exact pages you need. LangChain handles the heavy lifting of document processing and retrieval, while Ollama powers the AI components locally.
Why Local? Running Ollama locally ensures your documentation stays private and works offline—perfect for sensitive projects or air-gapped environments.
Step-by-step implementation: Building the AI assistant
Let’s walk through the code to build the assistant, focusing on the key components. The full code is in index.js, but we’ll break it down into manageable steps with explanations. Follow along to see how each piece fits together.
Step 1. User-friendly start: Asking for a Markdown file
When you launch the assistant, it greets you with a simple question:
"Enter the URL to your markdown file"
Instead of making you download code or connect to any complicated systems, the assistant just needs a link to a markdown file (like a README or guide from GitHub). This makes setup easy for anyone.
Here's the code for this:
const url = await input({ message: 'Enter the URL to your markdown file:' });
The program will wait for you to type or paste a link and hit enter.
Step 2. Fetching the Markdown content
The assistant retrieves the file from the URL provided by the user. Accessing the file is necessary so the assistant can review its contents and later respond to questions about it.
Here's the code for this:
const response = await fetch(url);
const markdown = await response.text();
If the file can’t be downloaded (for example, if the link is broken), the assistant will let you know and stop.
Step 3. Breaking the document into chunks
The assistant splits the markdown file into smaller, meaningful pieces (“chunks”).
AI models work best when they process information in bite-sized pieces. This also helps the assistant find the most relevant part of the document for each question. The code below uses a "text splitter" to break up the file, making sure each chunk isn’t too big and that important context isn’t lost between chunks.
Here's the code for this:
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
separators: ['\n\n', '\n', ' ', ''],
});
const docs = [new Document({ pageContent: markdown, metadata: { source: url } })];
const chunks = await splitter.splitDocuments(docs);
Step 4. Turning chunks into AI-searchable data (Embeddings and indexing)
Each chunk is converted into a special mathematical "fingerprint" (called an embedding) that captures its meaning. This is critical because it allows the assistant to quickly compare your question to all the document chunks and find the ones most likely to contain the answer.
Here's the code for this:
const embeddings = new OllamaEmbeddings({
model: 'granite3.3:2b',
baseUrl: 'http://localhost:11434',
});
const vectorStore = await MemoryVectorStore.fromDocuments(chunks, embeddings);
The assistant uses the Granite 3.3 model (running locally via Ollama) to create these embeddings. For simplicity in this proof of concept, all the embeddings are stored in memory for fast searching.
Step 5. Setting up the conversational AI (Chatbot piece)
At this stage, we configure the core chatbot component of the assistant. This is where the AI model (IBM’s Granite 3.3 running locally via Ollama) is prepared to generate answers based on the context retrieved from your markdown document.
- We create a prompt template that instructs the AI how to behave, acting as an expert documentation assistant that answers questions accurately and references the provided context.
- The prompt template defines a system message (the assistant’s role and guidelines) and a human message (the user’s question).
- The language model (LLM) is initialized to receive these prompts and produce conversational responses.
- This is the heart of the chatbot: it takes your questions plus the relevant document chunks and generates helpful, context-aware answers.
Here's the code for this:
const llm = new ChatOllama({
model: 'granite3.3:2b',
temperature: 0.1,
baseUrl: 'http://localhost:11434',
});
const promptTemplate = ChatPromptTemplate.fromMessages([
[
'system',
`You are an expert documentation assistant. Use the following context to answer questions about the documentation accurately and helpfully.
Context: {context}
Guidelines:
- Provide accurate information based only on the provided context
- Include relevant code examples when available
- Mention the source document when possible
- If information is not in the context, clearly state that`,
],
['human', '{question}'],
]);
Step 6. The interactive chat loop
The assistant opens a chat prompt in your terminal, just like a messaging app. This lets you ask questions in plain English and get instant answers. When you type a question, the assistant:
- Searches for the most relevant chunks in the document.
- Feeds those chunks and your question to the AI.
- Prints the AI’s answer in the terminal.
Here's the code for this:
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout,
prompt: '> ',
});
console.log('\nReady! Ask your questions about the document. Type "exit" to quit.');
rl.prompt();
rl.on('line', async (line) => {
if (line.trim().toLowerCase() === 'exit') {
rl.close();
return;
}
try {
const retriever = vectorStore.asRetriever({ k: 5 });
const relevantDocs = await retriever.getRelevantDocuments(line.trim());
const context = relevantDocs.map(doc => doc.pageContent).join('\n\n');
const promptMessages = await promptTemplate.formatMessages({
context: context,
question: line.trim(),
});
const response = await llm.invoke(promptMessages);
console.log(`Answer: ${response.content}\n`);
} catch (err) {
console.error('Error:', err.message);
}
rl.prompt();
});
Step 7. Finding the answer (The RAG pipeline)
Let's break down the RAG pipeline:
- Retrieval: The assistant searches the markdown chunks for the most relevant information.
- Augmentation: It gives this information (the “context”) to the AI model.
- Generation: The AI crafts a helpful answer, grounded in the actual documentation.
This means you get answers that are accurate and based on your real docs—not just the AI’s general knowledge or the model training data.
Step 8. Exiting the assistant gracefully
When you’re done asking questions, it’s important to close the assistant cleanly. The CLI listens for the command exit (case-insensitive). When you type exit and press Enter:
- The assistant stops the chat loop.
- The readline interface closes, freeing up resources.
- A friendly exit ensures no hanging processes remain, and your terminal returns to normal.
- This simple exit mechanism provides a smooth user experience and prevents unexpected crashes or resource leaks.
Here's the code for this:
rl.on('line', async (line) => {
if (line.trim().toLowerCase() === 'exit') {
console.log('Goodbye! Exiting the assistant.');
rl.close();
return;
}
// ... handle question ...
});
Future enhancements and making the AI assistant production-ready
This proof-of-concept demonstrates how Retrieval-Augmented Generation (RAG) can transform a single Markdown file into a conversational, searchable resource. However, the potential for this assistant goes far beyond a single document or a command-line interface. Here’s how you can take this idea to the next level and build a truly production-ready solution:
Integrate it with a full-featured chat interface
- Web or in-app chat: The backend RAG pipeline can be connected to any open-source chat UI (like Rasa, or Botpress) or commercial solutions (such as IBM Watson Assistant). This allows users to interact with documentation through a modern, user-friendly web interface, embedded widget, or even mobile app.
- Embedded as a component on your documentation site: Integrate the assistant directly into your documentation website, so users can ask questions and get answers in real time, improving support and onboarding.
Expand it to use multiple files and complete directories
- Directory Support: Instead of a single markdown file, extend the assistant to process entire directories (or even repositories) of documentation. This enables intelligent search and Q&A over all your technical guides, API docs, or knowledge bases.
- Efficient Indexing: Automatic re-index can be implemented as separate background jobs or scheduled tasks to re-index docs as they change, ensuring answers are always up to date.
Scalable and efficient data retrieval
- Advanced Vector Stores: For large documentation sets and knowledge bases, consider using scalable vector databases (like Pinecone, Milvus, or Weaviate) instead of in-memory or local stores. These solutions handle millions of documents efficiently and support distributed deployments.
- Intelligent Ranking: Add semantic ranking, metadata filtering, and feedback loops to improve answer accuracy and relevance.
Security, privacy, and access control
- Authentication: Integrate user authentication and authorization to restrict access to sensitive documentation.
- On-premises or cloud deployment: Depending on your privacy requirements, deploy the assistant on your own infrastructure, in a private cloud, or as a managed SaaS.
Performance and hardware considerations
- Scaling up for full fledged application: This demo uses a single markdown file for simplicity and to minimize hardware requirements. For production use—especially with large documentation sets or many users—higher computing power (more RAM, CPU, and optionally GPU) will be necessary.
- Optimized Models: Consider using quantized or distilled models for faster inference, or leverage cloud-based LLM APIs for elastic scaling.
Enhanced user experience
- Rich Answers: Add support for images, diagrams, code highlighting, and clickable links in answers.
- Personalization: Remember user context and preferences to provide tailored responses.
Summary
The idea presented in this tutorial is to have a digital assistant tailored specifically for your project’s guides. You type a question, like “How do I install the app?” or “How do I rollout a deployment?” and the assistant reads through your Markdown documents to find the answer. It’s smart enough to understand your question and pull out just the right information, even if it’s buried deep in a file. It saves you from digging through files yourself, and it works offline, so your private project details stay secure. Whether you’re a manager checking how a feature works or a new team member learning the ropes, this assistant makes your life easier.
While this POC focuses on a single markdown file to showcase the core RAG workflow, the architecture is designed for extensibility. With additional engineering, you can scale this assistant to cover all your documentation, provide intelligent search, and deliver a seamless conversational experience, whether embedded in your docs site, integrated with a chatbot, or deployed as a standalone app.
In summary, this demonstrated how conversational technology can make technical documentation more approachable and useful for everyone. By starting with a simple, interactive CLI demo here, we’ve laid the groundwork for a powerful tool that can grow and adapt per users’ needs. As you iterate and expand on this foundation, you’ll unlock new ways to help users find answers quickly and confidently— shaping the future of intelligent retrieval, one conversation at a time.