How to determine if your application can migrate from RDBMS to Cassandra as data store?

Jayaraj Jayaraman
5 min readApr 17, 2020

--

Answering this questionnaire helps in determining whether one needs and can adopt Cassandra as the data store, or to continue sticking with RDBMS with alternative solutions for scaling out.

Query Patterns Analysis

  • Are the current Query Patterns final or do you see scope for change in access/query patterns? If you foresee changes to data model, please be aware that changes to data model cannot be accommodated so easily without compromising the performance. Unlike RDBMS, Cassandra data modelling once done based on the current known query patterns, it cannot be easily changed to accommodate new model and query patterns.
  • What kind of workload? Mostly inserts, Read-heavy? Or, are there going to be many updates? Will there be deletes? More updates and deletes will lead to tombstones and JVM Heap pressure and associated overheads, performance issues, SSTable management overheads.
  • Are there queries using IN clause? Can it be avoided?
  • Do you support unrestricted queries? If no, are your queries into tables always based on primary keys? If no, how many other columns based on which queries done? In other words, how many indexes you have on the table that has most indexes and supports search by multiple columns? Can the queries that use other columns/indexes other than primary key can be avoided? Secondary indexes are not recommended as they will result in scatter/gather queries across multiple nodes, which will affect the performance. Note also that indexes are local to nodes.
  • If cannot do queries only with primary keys, can you create multiple tables with same (almost same) data duplicated across tables but each table having different primary key based on which you will be able to query? Remember, if the duplicate data across tables will need to be updated, there are no transaction support, hence atomicity and consistency of updates are not guaranteed.

In such cases, BATCH is possible to get transactions at statement execution level (provides atomicity & isolation). For this, there are some best practices to be followed like “batch statements to be within the partition” else we may loose out on latency

If application targets more writes then BATCH may not be the right way to achieve it

Batch should not be used for increasing the throughput

  • Do you have Join queries? Can you avoid join queries by combining the tables and/or denormalising the tables?
  • Do you have queries that use order by or group by clauses? Can they be avoided? If not, the column based on which you want to order/group should be the clustering key. Order by query cannot be unconditional, always primary key should be used with WHERE EQ/IN clause. Also, ordering is enforced among the records of the same partition, not across the partition.
  • Are there queries independent of partition keys (first part of primary key), only based on cluster key or arbitrary column’s value? Such queries will result in queries in all nodes and will affect the performance. Query based on arbitrary column is not recommended though you can force it by passing “ALLOW FILTERING” !
  • Do you have requirements for generating cross-tables/cross-partitions aggregates? Not supported or recommended by Cassandra. Can use Spark/Hive on top of Cassandra to achieve the same.

Sizing/Capacity Analysis

  • How many tables? What are the top 3 tables that can grow with millions/billions of records before purging takes place, and what is the size per row? What are the top 3 tables with most columns (wide tables), and what is the size per row?
  • Considering that all DCs use cloud version and the purge policy in place, what would be the number of records of top big table and what would be the disk capacity required?

Horizontal Partitioning & Blended Usage Analysis

  • Can you separate the tables into 2 categories, a) Read-Heavy, and b) Write (update)-heavy? If yes, “Insert-only” and “Read-heavy” data can be stored in Cassandra and “update-heavy” data can be stored in RDBMS. And, this arrangement might help in settling with RDBMS available with its limitations.
  • If many tables, can the tables be grouped into fewer small set of related tables such that the related tables form the single domain that can be managed/operated by a single micro service with its own bounded context?

Relational Characteristics Analysis

  • Are the related tables have parent-child relation and/or foreign key relation among them?
  • Can the related tables be combined by denormalising them into one table?
  • If combining is not possible, is it possible for duplicating the data across tables without need to update both the tables for any changes to duplicate data? Changes to duplicate data across multiple tables cannot be done atomically and consistently because there is not going to be transactions in Cassandra !

Transaction and Consistency Requirements Analysis

  • Do you have transactions involving different transactional resources such as JMS and DB, two DBs? Do you have transactions that do multiple DB operations in atomic way with the help of RDBMS commit/rollback mechanisms? Do you have distributed transactions involving multiple remote transactional resources? Can these be handled using Compensation/Saga patterns?
  • Do you need strong consistency? Cassandra’s Read-consistency tuned for Quorum should provide strong consistency semantics, but application should tolerate the latency incurred.
  • Do you need ACID-compliant transactions, with nested transactions, commits/rollbacks, and full referential integrity required? Can you achieve all these in application code? Using Cassandra Batch capability is not recommended as it will lead to multiple transactions across nodes + configuring the consistency level to Quorum will add up to the latency.

Potential Tombstone Creation Scenario Analysis

  • Do you have column that can have NULL value? If so, the code needs to be rewritten such that the column is skipped from writing (unset column) instead of writing the column with NULL value, which will result in tombstone.
  • Do you use DB as a queue, meaning store data, process/act on it, eventually delete it after some time? This will result in tombstones, which will lead to JVM Heap pressure and associated overheads, performance issues, SSTable management overheads (compaction)

--

--