Blog Post

Cassandra for AI: No more Small Data

A database that adapts and grows with your dataset in the age of AI

Aaron Ploetz

A common design strategy for a new application’s data storage is to build the app with a simple, smaller-scale database. However, the last few years have seen global data usage hitting exponential growth, largely driven by AI use cases. The old adage of “SQLite is all you need” just doesn’t cut it anymore; especially when building an AI application.

AI is driving the global data footprint

In the last few years, we have seen an explosion of AI-driven “data hungry” applications. This new generation of applications needs its data to be curated, chunked, loaded, and often vectorized into manageable parts in backend systems.

The advent of AI applications that were being built around retrieval augmented generation (RAG) accelerated this trend. Data is no longer just used to train the large language model (LLM), but also to build out the context used to improve the model’s accuracy. Often, that data will be used to generate vector embeddings to allow for vector search operations. The result of all this is that the world is storing and using more data than we ever have before, moving from 101 zettabytes globally in 2022 to more than 180 zettabytes today.

This data growth trend is not likely to change any time soon. Cloud providers are expanding their data center holdings, as both the compute and storage needs of AI are expected to far exceed current capabilities. Some providers are even working on building data centers in space as a way to reduce requirements for power, cooling, and construction. In the future, data storage strategies will need to account for network distance as well as on-disk robustness.

Apache Cassandra is the solution

One path to solve the data needs of the future is the Apache Cassandra® database project. Cassandra is an open source, NoSQL database that can deliver high-performance access to data at large-scale. Companies like Netflix, Apple, Disney, and Wal Mart use Cassandra to serve large amounts of data to customers all over the world. In fact, Netflix has a Cassandra footprint that supports more than 20 petabytes of data, with their most active cluster serving more than 10 million queries per second. Given its track record in the tech industry for the last 15 years, it is easy to see that Cassandra truly is a battle-tested, high-performance database solution.

Cassandra fits neatly into the AI needs of the future for several reasons. It has a native vector type allowing it to support Approximate Nearest Neighbor (ANN) searches. This capability is an absolute must for any database that backs AI.

Also, Cassandra’s best-in-class geographic awareness has been an absolute game-changer. If an application needs its data replicated to specific geographic areas, operators can divide a cluster’s nodes into logical data centers (which map underneath to physical data centers). A common use case for this is in E-commerce, where the same data might be required to support customers on the US West Coast, the East Coast, and Western Europe. Cassandra does not care if a cluster’s data centers are in North America, Asia, or in orbit.

Many current and future applications will need to support AI data across the world. Cassandra is the right tool for that job.

Key AI features in Cassandra

Cassandra has come a long way in recent years. Gone are the days of it being a simple, highly available key/value store. The newer versions of Cassandra include features that are commonly found in most enterprise grade databases. These have not only improved its usability and performance but also given it the means to support robust AI data deployments. Specifically, these features are:

Vector search. Cassandra has had vector search since version 5.0. Using the new JVector library, Cassandra users have access to a native CQL vector type. This allows developers to build solutions leveraging cosine and Euclidean-based ANN vector searches. Existing Cassandra database clusters can also be upgraded, providing a path for adding vector search and AI to existing Cassandra applications.
ACID transactions. Cassandra is a vector database that uses transactions that are atomic, consistent, isolated, and durable (ACID). With the implementation of the Accord protocol, transactional updates to embeddings and other AI-supporting data can now be applied at-scale, across multiple Cassandra nodes. While still in the later stages of development, this feature will be released in Cassandra 6.0, sometime in Spring of 2026.
UnifiedCompactionStrategy (UCS). This new compaction strategy was added in Cassandra 5.0. It combines the best features the previous strategies into a single, configurable compaction strategy. This leads to improved performance of compaction, read, and streaming operations, as well as reduced write amplification. Ultimately, UCS allows clusters with high performance requirements to run on far fewer cloud resources than previous versions of Cassandra. If your AI embeddings require semi-frequent updates, this feature will be a substantial help in keeping the disk footprint under control.

Additional features of Cassandra

The number of features that have been built into Cassandra over the years are too many to mention. But there are several others that have proven to be very helpful to both operators and AI developers. These features include:

Trie Memtables
Storage Attached Indexes
Support for newer versions of Java
Dynamic Data Masking
Guardrails framework

As you can see, Cassandra is no longer a specialized “niche” NoSQL project. It is a fully featured, enterprise-grade database capable of supporting the AI workloads of today and tomorrow.

Summary

Cassandra has made its journey from the fringes to the forefront of the database world. It is a solid, proven database that underpins always-on systems like video streaming platforms and world-wide e-commerce. If you haven’t checked out Cassandra in a while, it might be time to give it another look.

In this new era of AI data, applications are going to need a database that can adapt and grow with their dataset. Cassandra is that database. Gone are the days of it being dismissed as the “database of last resort.” It should be a core part of any future large-scale data platform.

The bottom line is that if your AI application needs to serve its dataset at large-scale, it needs Cassandra.

Get started with Cassandra

Want to give Cassandra a try, but not sure how to get started? Have a look at IBM’s Astra DB, which is a database as a service (DBaaS) for Cassandra. Astra DB can enable you to quickly deploy a serverless Cassandra cluster on the industry’s most generous “free tier,” allowing you the freedom to test and explore how to serve your application’s data needs. Just visit https://astra.datastax.com and sign up today!