Cassandra_ Introduction

Zora Hirbodvash
5 min readSep 13, 2022

--

One of the most popular NoSQL databases is Apache Cassandra. Apache Cassandra is an open-source, distributed, decentralized/distributed storage system (database), which manages very large amounts of structured data across a worldwide network. Despite its high scalability and performance with no single point of failure, it doesn’t compromise on availability.

Cassandra-Architecture

· Components of Cassandra

Cassandra consists of the following key components:

Node − A node is a place where data is stored.

Data center − A collection of nodes related to one another.

Cluster − Clusters are composed of one or more data centers.

Commit log − Cassandra maintains a log of commits as a crash recovery mechanism. Whenever a write operation is performed, a commit log is kept.

Mem-table − This is a memory-resident data structure. The data will be written to the mem-table after the commit log. There may be several mem-tables for a single-column family.

SSTable − Memtable data is flushed to this file when a threshold value is reached.

Bloom filter − These algorithms test whether an element is a member of a set in a quick, nondeterministic way. This is a special type of cache. After every query, bloom filters are accessed.

· Cassandra Query Language

Cassandra can be accessed through its nodes using Cassandra Query Language (CQL). Databases are treated as containers of tables in CQL. Using cqlsh, programmers can work with CQL, or separate application language drivers.

Any of the nodes can be accessed by clients for read-write operations. This node (coordinator) acts as a proxy between the client and the nodes that hold the data.

· Write Operations

Every write activity that takes place on nodes is recorded in their commit logs. After the

data is captured, it will be stored in the mem-table. As soon as the mem-table is full, data

will be written into the SStable data file. The cluster automatically partitions and replicates all writes. In Cassandra, SSTables are periodically consolidated, discarding unnecessary data.

· Read Operations

As Cassandra reads data from the mem-table, it checks the bloom filter to find the appropriate SSTable that holds the required information.

Cassandra-Data model

The data model of Cassandra is significantly different from what we normally see in an RDBMS. This chapter provides an overview of how Cassandra stores its data.

· Cluster

A Cassandra database runs on several machines that are connected to one another. Clusters are the outermost containers. Every node contains a replica, and if a node fails, the replica takes over. In a cluster, Cassandra arranges the nodes in a ring format and assigns data to them.

· Keyspace

Cassandra’s Keyspaces are its outermost containers. Keyspaces in Cassandra have the

following basic attributes:

· Replication factor -This is the number of machines in the cluster that will receive copies of the same data.

· Replica placement strategy − It is nothing but the strategy to place replicas in the ring. This refers to the strategy for placing replicas in the ring. We have simple strategies (rack-aware strategies), old topology strategies (rack-aware strategies), and network topology strategies (datacenter-shared strategies).

· Column families − The keyspace contains a list of one or more column families. A column family consists of a collection of rows. Columns are ordered in each row. The column families represent your data’s structure. There are at least one and often many column families in each keyspace.

The syntax of creating a Keyspace is as follows −

CREATE KEYSPACE Keyspace nameWITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};

The following illustration shows a schematic view of a Keyspace.

· Column Family

Column families are containers for ordered collections of rows. In turn, each row consists of a collection of columns.

Fig. 1. Data modeling in Cassandra architecture

Cassandra-Python drive

Connecting to the Cassandra

In order to execute queries against a Cassandra cluster, we must first set up an instance of Cluster. Every Cassandra cluster you want to interact with will typically require one Cluster instance.

Installing the Cassandra driver first is the easiest way to create a Cluster. For installing packages, pip is recommended. The Python dependencies for the driver will be installed simultaneously with the driver. Here is how to install the driver:

pip install cassandra-driver

To check if the installation was successful, you can run:

python -c ‘import cassandra; print cassandra.__version__’

Creating a Cluster is as simple as this:

from cassandra.cluster import Cluster

cluster = Cluster()

Session Keyspace

The Cluster isn’t actually connected to any nodes when it is instantiated. A Session is created by calling Cluster.connect() in order to establish connections and execute queries:

cluster = Cluster()

session = cluster.connect()

For all queries made through that session, the connect() method takes an optional keyspace argument:

cluster = Cluster()

session = cluster.connect(‘mykeyspace’)

Execution Profiles

An execution_profiles dictionary contains profiles.

The base ExecutionProfile can be constructed by passing all attributes:

from cassandra.cluster import Cluster, ExecutionProfile, EXEC_PROFILE_DEFAULT

from cassandra.policies import WhiteListRoundRobinPolicy, DowngradingConsistencyRetryPolicy

from cassandra.query import tuple_factory

profile = ExecutionProfile(

load_balancing_policy=WhiteListRoundRobinPolicy([‘127.0.0.1’]),

retry_policy=DowngradingConsistencyRetryPolicy(),

consistency_level=ConsistencyLevel.LOCAL_QUORUM,

serial_consistency_level=ConsistencyLevel.LOCAL_SERIAL,

request_timeout=15,

row_factory=tuple_factory

)

cluster = Cluster(execution_profiles={EXEC_PROFILE_DEFAULT: profile})

session = cluster.connect()

print(session.execute(“SELECT release_version FROM system.local”).one())

Executing Queries

We can now execute queries once we have a Session. Using execute() is the simplest way to execute a query:

rows = session.execute(‘SELECT name, age, email FROM users’)

for user_row in rows:

print user_row.name, user_row.age, user_row.email

If the operation fails, this will transparently pick a Cassandra node to execute the query against.

Each row of the result set is a namedtuple by default. For each column defined in the schema, such as name, age, and so on, each row will have a matching attribute. The fields can also be accessed by position or unpacked as normal tuples. These three examples are equivalent:

rows = session.execute(‘SELECT name, age, email FROM users’)

for row in rows:

print row.name, row.age, row.email

rows = session.execute(‘SELECT name, age, email FROM users’)

for (name, age, email) in rows:

print name, age, email

rows = session.execute(‘SELECT name, age, email FROM users’)

for row in rows:

print row[0], row[1], row[2]

Prepared Statements

Cassandra parses prepared statements and saves them for future use. With a prepared statement, the driver only needs to send the parameters’ values. Because Cassandra does not have to re-parse the query each time, the network traffic and CPU utilization within Cassandra are reduced.

To prepare a query, use Session.prepare():

user_lookup_stmt = session.prepare(“SELECT * FROM users WHERE user_id=?”)

users = []

for user_id in user_ids_to_query:

user = session.execute(user_lookup_stmt, [user_id])

users.append(user)

Source:

https://docs.datastax.com/en/developer/python-driver/3.25/getting_started/

https://www.tutorialspoint.com/cassandra/index.htm

--

--

Zora Hirbodvash
Zora Hirbodvash

Written by Zora Hirbodvash

I am physicist, and I am working as data scientist now

No responses yet