What is a vector database?
What is a vector database?
A vector database is a specialised database that stores, manages, and searches high-dimensional vector embeddings to enable semantic similarity search.
These embeddings, generated by machine learning models, capture the semantic relationships within unstructured data such as text, images, or audio, positioning conceptually related items closer together in vector space so the system can rank results by relevance. Vector databases index and store both dense and sparse embeddings for fast retrieval, and they commonly serve as an external knowledge base that a large language model (LLM) can query to ground its responses in trusted data and mitigate the risk of hallucination.
Vector embeddings
What are vector embeddings and how are they created?
Vector embeddings are numerical arrays of floating-point values that represent data such as words, phrases, or entire documents. They are generated by machine learning models, such as large language models, that transform digital media into points within a high-dimensional space. This process captures the underlying semantic meaning and relationships of the original data. For instance, an image of a "golden retriever playing in a park" could be converted into an embedding that is numerically close to the embedding for the text "happy dog outside." It's important to note that embeddings created by one provider's model cannot be understood by another; for example, an embedding from an OpenAI model is not compatible with one from another provider.
What are dense vectors (embeddings)?
Dense vectors are high-dimensional numerical embeddings where almost all elements are non-zero values. A critical characteristic of dense vectors is that all vectors generated by a particular model must have the same fixed number of dimensions, which is a prerequisite for measuring similarity. For instance, embeddings from Azure OpenAI models have 1,536 dimensions. Typically produced by transformer models, they capture rich and nuanced semantic meaning, making them ideal for semantic similarity search. A dense vector for the word "cat," for example, might appear as [0.135, -0.629, 0.327, 0.366, ...].
What are sparse vectors (embeddings)?
What is vector search?
Vector search is a technique that finds similar data by representing it as high-dimensional numerical vectors, often called embeddings. This method is highly versatile because machine learning models can generate embeddings for any type of digital media, including text, images, and audio. The fundamental concept involves transforming data into a vector space where the geometric distance between vectors indicates their semantic similarity. A vector search is a query operation that finds the vectors most similar to a given query vector based on a specific similarity metric. For example, a search for "canine" could semantically match a document containing the word "dog," unlike traditional keyword search, which matches the literal term rather than its underlying meaning.
Measuring similarity
How are vector similarity and distance measured?
In vector search, similarity is quantified by calculating the distance or the angle between two vectors in a high-dimensional space; vectors that are closer together are considered more semantically similar. Common metrics used to measure this proximity include cosine similarity, Euclidean distance, dot product, Hamming distance, and Manhattan distance.
- L2 distance (Euclidean distance) is the most common metric and represents the straight-line "as the crow flies" distance between two vector points.
- L1 distance (Manhattan distance) measures the distance by summing the absolute differences of the vector components, as if navigating a city grid.
- Linf distance (Chebyshev distance) is the maximum difference along any single dimension.
- Cosine similarity measures the cosine of the angle between two vectors to determine if they point in a similar direction, irrespective of their magnitude. A score of 1 means identical vectors, and –1 means they are opposite. This is a common choice for normalized embedding spaces, such as those from OpenAI models.
- Dot product similarity considers both the angle and the magnitude of the vectors. It is equivalent to cosine similarity for normalized vectors but is often more computationally efficient.
- Hamming distance calculates the number of dimensions at which two vectors differ.
- Max inner product (MaxSim) is a similarity metric used when a single piece of data (like a document) is represented by multiple vectors (e.g., a vector for each word). It calculates similarity by comparing each vector in one document to the most similar vector in the other document and then aggregating the results.
Efficient search algorithms
What is k-nearest neighbour (kNN) search?
k-nearest neighbour (kNN) search is the core operation behind vector similarity search: given a query vector, it finds the k vectors in the database that are closest to it according to a chosen distance metric. The "k" is simply the number of results you want returned, for example, the 10 most similar product images or the 5 most relevant document chunks for a RAG query. An exact kNN search guarantees it has found the true closest matches by comparing the query against every vector in the index, which is precise but scales poorly. This is why production systems typically rely on approximate methods instead.
What is the difference between exact kNN and approximate nearest neighbour (ANN) search?
Exact kNN compares a query vector against every vector in the dataset to guarantee it returns the true closest matches. It is accurate but computationally expensive, and latency grows linearly with the size of the index. Approximate nearest neighbour (ANN) search trades a small amount of accuracy for a large gain in speed by narrowing the search to a promising subset of vectors rather than scanning all of them. In practice, ANN recall is high enough that most production vector search systems use it by default, reserving exact kNN for small datasets or cases where perfect recall is required.
How do algorithms like HNSW and ANN enable efficient vector search?
Searching for similar vectors in a massive, high-dimensional dataset presents a significant challenge. A brute-force approach, which compares a query vector to every other vector, is computationally infeasible as the dataset grows. This is solved by using approximate nearest neighbor (ANN) algorithms. These techniques rapidly find vectors that are nearest to a query without performing an exhaustive comparison. A common ANN algorithm is hierarchical navigable small world (HNSW), which organizes vectors into a layered graph structure where vectors are connected based on similarity, enabling fast traversal. This is more efficient and accurate than a FLAT (brute-force) search, which is computationally intensive but more precise. By dramatically reducing the search scope, these structures achieve massive speed gains in exchange for a small, and typically acceptable, reduction in absolute accuracy.
What is a multistaged search in vector search systems?
A multistaged retrieval or a retriever framework (to keep it simple, we may call it a search pipeline, too) is an orchestrated workflow that defines the sequence of steps for processing a query. This typically includes steps such as query analysis, initial retrieval from one or more indices (e.g., combining lexical and vector search for a hybrid approach), filtering of results, and a final reranking stage before returning the results to the user.
What are the benefits of using the retriever framework for building search pipelines?
The primary benefit is modularity and flexibility. It allows developers to easily combine different search and ranking strategies (such as hybrid search) and construct complex, multistage retrieval pipelines tailored to specific needs without having to build the entire system from scratch.
What is semantic reranking?
Semantic reranking is a second-stage process that improves the relevance of search results. After an initial, fast retrieval stage fetches a broad set of candidate documents, a more computationally intensive but more accurate model is used to reorder this smaller set to produce a more precise final ranking.
How does a "retrieve-and-rerank" multistage process work?
A "retrieve-and-rerank" pipeline operates in two distinct stages:
- Retrieve: An efficient, scalable retrieval method (like ANN vector search or lexical BM25 search) is used to fetch an initial set of candidate documents from the full index.
- Rerank: This smaller candidate set is then passed to a more powerful model (like a cross-encoder) that performs a deeper analysis of the semantic relationship between the query and each document, reordering them to improve the final relevance.
What is the difference between bi-encoder and cross-encoder architectures for reranking?
- A bi-encoder generates separate embeddings for the query and the documents independently. Because the document embeddings can be precalculated and indexed, this architecture is very fast and is used for the initial retrieval stage.
- A cross-encoder processes the query and a document together as a single input. This allows it to capture much deeper contextual interactions, making it highly accurate but also much slower. Due to its computational cost, it is only suitable for the reranking stage on a small set of candidate results.
Storage and optimization for vector databases
How are vectors typically stored in a vector database, and what storage challenges arise?
Vectors are typically stored as arrays of 32-bit floating-point numbers (float32). The primary challenge is the immense storage footprint; a single 384-dimension vector consumes approximately 1.5KB. An index of 100 million documents can therefore increase in size by seven times just by adding one vector field. Because vector search algorithms like HNSW require the index to be loaded into RAM for performance, this creates significant challenges related to memory cost and scalability.
What is vector quantization?
Vector quantization is a lossy compression technique that reduces the memory and computation requirements of a model by representing its parameters with fewer bits. This is especially useful for LLMs, which can have billions of parameters. By converting high-precision float32 numbers to lower-precision integers like int8 or int4, quantization can significantly reduce model size and speed up inference with minimal impact on accuracy.
What is scalar quantization (SQ)?
Scalar quantization compresses vectors by mapping the continuous range of float32 values to a discrete set of lower-precision integer values (e.g., int8). This can achieve up to 4x reduction in storage size while preserving a significant amount of the vector's magnitude information, which is important for relevance.
What is binary quantization (BQ)?
Binary quantization is a more aggressive compression technique that converts each component of a float32 vector into a binary representation (e.g., 1-bit). This can achieve up to 32x compression, offering maximum memory savings and enabling faster computations using integer-based operations, often at the cost of some precision loss.
What are the benefits of a vector database?
Vector databases offer several advantages over traditional databases when working with unstructured data and AI applications:
Semantic search beyond keyword matching: Vector databases retrieve results based on meaning rather than exact word matches. A query for "affordable laptops for students" can surface products described as "budget notebooks for college" because the underlying embeddings capture conceptual similarity rather than literal text overlap.
Fast similarity search at scale: By combining vector indexing with approximate nearest neighbor algorithms such as HNSW, vector databases return relevant results from billions of vectors in milliseconds, making real-time AI applications feasible.
Support for unstructured data: Text, images, audio, video, and other unstructured formats can all be represented as vector embeddings and searched through a single, unified system, removing the need for separate pipelines per data type.
Grounding for large language models: Vector databases serve as the retrieval layer in retrieval augmented generation (RAG) architectures, supplying LLMs with relevant, up-to-date context from trusted sources and reducing hallucinations.
Hybrid search capabilities: Modern vector databases combine dense vector search, sparse vector search, and traditional keyword (BM25) search in a single query, delivering more accurate results than any single method on its own.
Scalability and performance: Purpose-built indexing structures, quantization, and distributed architectures allow vector databases to handle growing datasets and high query volumes without significant degradation in latency.
Lower operational complexity: An integrated vector database removes the need to stitch together separate systems for storage, embedding management, and search, simplifying the architecture for AI-powered applications.
What are the benefits of an integrated vector database and search platform?
An integrated platform that combines vector storage and search with traditional database functionalities (like lexical search and filtering) offers significant benefits. It simplifies the architecture by eliminating the need to synchronize data between separate systems. Most importantly, it enables powerful hybrid search, where lexical search, vector search, and metadata filtering can be performed in a single, unified query, leading to more relevant results and a simpler developer experience.
What is metadata filtering in a vector database?
Metadata filtering is the process of narrowing vector search results based on structured attributes attached to each vector, such as date, category, author, price, language, or user permissions. While vector search retrieves results by semantic similarity, metadata filters apply hard constraints that results must satisfy, so a query for "lightweight running shoes" can be restricted to items that are in stock, priced under £100, and available in the user's region.
Combining vector similarity with metadata filtering is essential for production applications. A RAG system, for example, may need to limit responses to documents the current user is authorized to see, or to articles published within the last six months. Without metadata filtering, semantically relevant but contextually wrong results would surface routinely.
There are two common approaches. Pre-filtering applies metadata constraints before the similarity search, reducing the candidate set up front; this guarantees the requested number of results but can be slower on highly selective filters. Post-filtering runs the similarity search first and then discards non-matching results, which is faster but can return fewer results than requested if the filter is restrictive. Modern vector databases typically combine both strategies, choosing dynamically based on filter selectivity to balance recall, latency, and accuracy.
How does vector search differ from lexical search?
- Lexical search (e.g., BM25) is based on keyword matching. It finds documents that contain the exact terms present in the query. It is precise but does not understand context or synonyms.
- Vector search is based on semantic meaning. It finds documents that are conceptually similar to the query, even if they do not share any keywords. It is excellent for understanding user intent but can be less precise than lexical search.
What are common use cases for vector databases? What can developers build with vector search?
Developers use vector databases to build sophisticated applications that rely on understanding the semantic meaning of data. Common use cases include:
- Semantic search: Creating search experiences that understand user intent beyond keywords, such as in ecommerce or document discovery systems
- Retrieval augmented generation (RAG): Providing LLMs and chatbots with access to external, up-to-date knowledge to generate more accurate and factual answers
- Recommendation engines: Recommending products, articles, or media based on conceptual similarity to a user's interests or past behavior
- Image and multimodal search: Finding visually similar images or searching across different data types (e.g., using text to find an image)
Vector database capabilities with Elastic
Elasticsearch is a vector database built on the foundations of the world's most widely deployed search engine, combining vector search, lexical search, and metadata filtering in a single platform. Rather than running a dedicated vector store alongside a separate search system, teams can use one engine to power semantic search, hybrid retrieval, and retrieval augmented generation at production scale.
Key capabilities include:
- Native dense and sparse vector support: Index, store, and query both dense vector embeddings and sparse embeddings (including those produced by Elastic's ELSER model) within the same index.
- Hybrid search out of the box: Combine BM25 lexical search, dense vector kNN, and sparse vector retrieval in a single query using the retriever framework, with reciprocal rank fusion (RRF) for merging results.
- Built-in inference: The semantic_text field type and inference API handle chunking, embedding generation, and query-time vectorization automatically, removing the need to manage a separate embedding pipeline.
- Memory-efficient storage: Better Binary Quantization (BBQ) is enabled by default for dense vectors, reducing memory footprint by up to 32x while preserving recall, which lowers infrastructure costs at scale.
- Filtered vector search: Apply metadata filters, geospatial constraints, and document-level security alongside vector similarity in a single query, with the query planner choosing pre- or post-filtering based on selectivity.
- Production-grade operations: Inherits Elasticsearch's distributed architecture, including sharding, replication, role-based access control, snapshots, and monitoring, all of which are mature and battle-tested in large deployments.
- Semantic reranking: The Elastic Rerank model provides a second-stage relevance boost on top of initial retrieval, improving result quality without requiring reindexing.
Get started with vector search in Elasticsearch in the Elasticsearch Labs documentation.
Deployment flexibility: On-premises and air-gapped vector databases
Where a vector database runs is increasingly as important as how it performs. Most vector databases are available as managed cloud services, but a growing number of organizations need to deploy them inside their own infrastructure, either on-premises or in fully air-gapped environments.
On-premises deployment means the vector database runs on hardware the organization controls, typically inside its own data centre or private cloud. Data never leaves the organization's network perimeter, and the operations team manages installation, scaling, upgrades, and security directly. Air-gapped deployment goes a step further: The environment has no connection to the public internet at all. Software updates, model weights, and embedding pipelines must be brought in through controlled, offline processes, and no telemetry or data can be sent out.
This matters more for vector databases than for most other infrastructure. In a RAG pipeline, the vector database holds the knowledge base that grounds the language model's responses, which often means it contains the most sensitive content an organization owns: internal research, customer records, legal documents, source code, classified intelligence, or proprietary operational data. The embeddings themselves can also leak information about the source content, so the storage layer is a meaningful part of the threat model, not an afterthought.
Several sectors have hard requirements that effectively rule out public-cloud-only vector databases:
- Federal, defense, and intelligence agencies often operate in classified or air-gapped networks where no external connectivity is permitted and where software must meet specific accreditation standards before it can be deployed.
- Healthcare providers handling protected health information must meet strict regulations around where patient data is stored and processed, and many prefer to keep AI workloads inside hospital or payer infrastructure rather than send embeddings of clinical notes to a third-party service.
- Financial services firms face data residency rules, regulatory audit requirements, and internal policies that often mandate keeping customer and transaction data within specific jurisdictions or inside the firm's own controlled environments.
- Enterprises with data residency or sovereignty requirements, including any organization operating under regulations such as GDPR or similar regional frameworks, need the ability to pin their vector database to a specific country or region or to run it entirely inside their own infrastructure.
Choosing a vector database that supports the full deployment spectrum, managed cloud, self-managed cloud, on-premises, and air-gapped, lets organizations standardize on a single technology across workloads with very different security and compliance profiles, rather than running separate stacks for sensitive and non-sensitive data.