What is a vector database?

What is a vector database?

A vector database is a specialized database designed to store, manage, and search high-dimensional vector embeddings. Its primary function is to serve as an external knowledge base that a large language model (LLM) can query, thereby "grounding" the model's responses with the data stored in the knowledge base and mitigating the risk of hallucination. The core capability of a vector database is performing semantic similarity searches, which identify data points that are conceptually similar, rather than just matching keywords. It indexes and stores both dense and sparse vector embeddings from machine learning models to enable fast similarity search and retrieval. These embeddings capture the semantic relationships within unstructured data such as text, images, or audio. In the vector space created by the database, related items are positioned closer to one another, which allows the system to rank results by relevance.


Vector embeddings

What are vector embeddings and how are they created?

Vector embeddings are numerical arrays of floating-point values that represent data such as words, phrases, or entire documents. They are generated by machine learning models, such as large language models, that transform digital media into points within a high-dimensional space. This process captures the underlying semantic meaning and relationships of the original data. For instance, an image of a "golden retriever playing in a park" could be converted into an embedding that is numerically close to the embedding for the text "happy dog outside." It's important to note that embeddings created by one provider's model cannot be understood by another; for example, an embedding from an OpenAI model is not compatible with one from another provider.

What are dense vectors (embeddings)?

Dense vectors are high-dimensional numerical embeddings where almost all elements are non-zero values. A critical characteristic of dense vectors is that all vectors generated by a particular model must have the same fixed number of dimensions, which is a prerequisite for measuring similarity. For instance, embeddings from Azure OpenAI models have 1,536 dimensions. Typically produced by transformer models, they capture rich and nuanced semantic meaning, making them ideal for semantic similarity search. A dense vector for the word "cat," for example, might appear as [0.135, -0.629, 0.327, 0.366, ...].

What are sparse vectors (embeddings)?

Sparse vectors are high-dimensional numerical embeddings where the majority of elements are zero, a structure that optimizes for both storage and computational efficiency. Unlike dense retrievers, sparse retrievers use traditional search techniques like term frequency-inverse document frequency (TF-IDF) or BM25 to match queries to documents based on keywords. For example, a search for "healthy snack" might produce a sparse vector that expands to and assigns weights to related terms like ["apple" (3.0), "carrot" (2.5), "vitamin" (1.2)], while all other terms in the vocabulary have a weight of zero. This structure is highly compatible with traditional inverted indices, allowing for efficient retrieval.

Measuring similarity

How are vector similarity and distance measured?

In vector search, similarity is quantified by calculating the distance or the angle between two vectors in a high-dimensional space; vectors that are closer together are considered more semantically similar. Common metrics used to measure this proximity include cosine similarity, Euclidean distance, dot product, Hamming distance, and Manhattan distance.

  • L2 distance (Euclidean distance) is the most common metric and represents the straight-line "as the crow flies" distance between two vector points.
  • L1 distance (Manhattan distance) measures the distance by summing the absolute differences of the vector components, as if navigating a city grid.
  • Linf distance (Chebyshev distance) is the maximum difference along any single dimension.
  • Cosine similarity measures the cosine of the angle between two vectors to determine if they point in a similar direction, irrespective of their magnitude. A score of 1 means identical vectors, and –1 means they are opposite. This is a common choice for normalized embedding spaces, such as those from OpenAI models.
  • Dot product similarity considers both the angle and the magnitude of the vectors. It is equivalent to cosine similarity for normalized vectors but is often more computationally efficient.
  • Hamming distance calculates the number of dimensions at which two vectors differ.
  • Max inner product (MaxSim) is a similarity metric used when a single piece of data (like a document) is represented by multiple vectors (e.g., a vector for each word). It calculates similarity by comparing each vector in one document to the most similar vector in the other document and then aggregating the results.

Efficient search algorithms

What is a multistaged search in vector search systems?

A multistaged retrieval or a retriever framework (to keep it simple, we may call it a search pipeline, too) is an orchestrated workflow that defines the sequence of steps for processing a query. This typically includes steps such as query analysis, initial retrieval from one or more indices (e.g., combining lexical and vector search for a hybrid approach), filtering of results, and a final reranking stage before returning the results to the user.


What are the benefits of using the retriever framework for building search pipelines?

The primary benefit is modularity and flexibility. It allows developers to easily combine different search and ranking strategies (such as hybrid search) and construct complex, multistage retrieval pipelines tailored to specific needs without having to build the entire system from scratch.

What is semantic reranking?

Semantic reranking is a second-stage process that improves the relevance of search results. After an initial, fast retrieval stage fetches a broad set of candidate documents, a more computationally intensive but more accurate model is used to reorder this smaller set to produce a more precise final ranking.

How does a "retrieve-and-rerank" multistage process work?

A "retrieve-and-rerank" pipeline operates in two distinct stages:

  1. Retrieve: An efficient, scalable retrieval method (like ANN vector search or lexical BM25 search) is used to fetch an initial set of candidate documents from the full index.
  2. Rerank: This smaller candidate set is then passed to a more powerful model (like a cross-encoder) that performs a deeper analysis of the semantic relationship between the query and each document, reordering them to improve the final relevance.

What is the difference between bi-encoder and cross-encoder architectures for reranking?

  • A bi-encoder generates separate embeddings for the query and the documents independently. Because the document embeddings can be precalculated and indexed, this architecture is very fast and is used for the initial retrieval stage.
  • A cross-encoder processes the query and a document together as a single input. This allows it to capture much deeper contextual interactions, making it highly accurate but also much slower. Due to its computational cost, it is only suitable for the reranking stage on a small set of candidate results.

Storage and optimization

How are vectors typically stored in a vector database, and what storage challenges arise?

Vectors are typically stored as arrays of 32-bit floating-point numbers (float32). The primary challenge is the immense storage footprint; a single 384-dimension vector consumes approximately 1.5KB. An index of 100 million documents can therefore increase in size by seven times just by adding one vector field. Because vector search algorithms like HNSW require the index to be loaded into RAM for performance, this creates significant challenges related to memory cost and scalability.

What is vector quantization?

Vector quantization is a lossy compression technique that reduces the memory and computation requirements of a model by representing its parameters with fewer bits. This is especially useful for LLMs, which can have billions of parameters. By converting high-precision float32 numbers to lower-precision integers like int8 or int4, quantization can significantly reduce model size and speed up inference with minimal impact on accuracy.

What is scalar quantization (SQ)?

Scalar quantization compresses vectors by mapping the continuous range of float32 values to a discrete set of lower-precision integer values (e.g., int8). This can achieve up to 4x reduction in storage size while preserving a significant amount of the vector's magnitude information, which is important for relevance.

What is binary quantization (BQ)?

Binary quantization is a more aggressive compression technique that converts each component of a float32 vector into a binary representation (e.g., 1-bit). This can achieve up to 32x compression, offering maximum memory savings and enabling faster computations using integer-based operations, often at the cost of some precision loss.

What are the benefits of an integrated vector storage (database) and search platform?

An integrated platform that combines vector storage and search with traditional database functionalities (like lexical search and filtering) offers significant benefits. It simplifies the architecture by eliminating the need to synchronize data between separate systems. Most importantly, it enables powerful hybrid search, where lexical search, vector search, and metadata filtering can be performed in a single, unified query, leading to more relevant results and a simpler developer experience.