How to store your clickstream data? (Part II)

3 min readOct 20, 2019

--

Adventure of a click

In the first part of this blog series, I tried to explain how to get and access clickstream data. I will try to explain how to store your clickstream data on your database in this part of the blog series.

Table of Contents:

Reading data from Apache Kafka via Python.
Transformation Apache Kafka message.
About the data design.
Writing data to Apache Cassandra.
Demo
What’s next?

First, I set up a demo for it on docker. You can clone my repo and play with it.

mustafaileri/recommendation-engine-sandbox

Contribute to mustafaileri/recommendation-engine-sandbox development by creating an account on GitHub.

github.com

Reading data from Apache Kafka:
There are many different options to consume data from Apache Kafka. I will use “kafka-python” package.
You can find a sample on my demo in the “click_stream_consumer.py” file.

Transformation Apache Kafka message:
In our case, we can’t use clickstream data directly. Because “divolte” is producing the compressed data via lz4. We defined compression type in “Divolte” configuration as below.

...
acks = 1        
retries = 0        
compression.type = lz4        
max.in.flight.requests.per.connection = 1        
client.id = divolte.collector_0
...

https://github.com/mustafaileri/recommendation-engine-sandbox/blob/clickstream/data/divolte/divolte-collector.conf#L27

We need to use an Apache Avro file to decompress and decode the message from Kafka. We already define it when we configured the “Divolte”

We use this “avro” file to decode clickstream data from Apache Kafka.

mustafaileri/recommendation-engine-sandbox

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Sample code to decode a message which is encoded with Apache Avro.

avro.schema.Parse(open('PATH_OF_AVRO FILE').read())
bytes_reader = io.BytesIO(msg.value)
decoder = avro.io.BinaryDecoder(bytes_reader)
reader = avro.io.DatumReader(schema)
return reader.read(decoder)

3. About the data design:

The database choice and the architecture of the data is very important if you play with “big data”.

I preferred Apache Casandra to store the data. I already try to explain why I choose the Apache Cassandra in my previous blog. You can take a look if you want:
https://medium.com/@mustafa.ileri/set-up-cassandra-node-cluster-for-production-with-python-972aae84a0a0

I learned 2 important things when I started to store the clickstream data on Apache Cassandra.

The primary key mechanism is working completely different in Apache Cassandra.
The partition key is very important to store the data.

For example, you can use your partition key in where condition, but only with the equal operator. You can’t use <, > operators for the pk.

On the other hand, if you would like to make queries by date range you need to define as a primary key and use other primary keys.

I used the date as partition and primary key. Because I need to get data for the last 30,60, 90 days. The disadvantage of this architecture is that: a new partition will be created for each day.

You can get more info about Apache Cassandra primary keys from this blog:
https://www.datastax.com/blog/2016/02/most-important-thing-know-cassandra-data-modeling-primary-key

4. Writing data to Apache Cassandra:

I used the “cassandra-driver” library for writing the data to Apache Cassandra.

Python Cassandra Driver

A Python client driver for Apache Cassandra. This driver works exclusively with the Cassandra Query Language v3 (CQL3)…

docs.datastax.com

Especially I liked object mapper on it.

5. Demo

You can clone my repository from GitHub. Then you can run my docker-compose file.

docker-compose up --build

Sending event to Divolte:

Visit: http://localhost:8290/ and send this code from the console.

divolte.signal('event', {"item_id": 101321, "event_date":3213213})

You can see the log file in your

click_stream_consumer_1 | INFO:root:{‘click_stream_cookie_id’: ‘0:k1co3zug:yAGV1PjuaWIQEmZ1Yx~f8eHnB~Yr9Eqo’, ‘event_name’: ‘event’, ‘item_id’: ‘101’, ‘event_date’: 1571607632237, ‘url’: ‘http://localhost:8290/', ‘client_ip’: ‘192.168.224.1’}

To connect Apache Cassandra:

docker-compose exec  cassandra /bin/bash
cqlsh localhost -ucassandra -pcassandracassandra@cqlsh> SELECT * FROM demo.click_stream;

mustafaileri/recommendation-engine-sandbox

Contribute to mustafaileri/recommendation-engine-sandbox development by creating an account on GitHub.

github.com

6. What’s next?

We got, transformed and stored our clickstream data until now. In my next blog, I will try to explain that transferring the data to Apache Spark, processing it to build a recommendation engine and managing the process via Apache Airflow.

Apache Cassandra

Data Engineering

Written by Mustafa İleri

Tech Lead / Architect, Data Engineer, loves #python #symfony #django

No responses yet

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams