How to store your clickstream data? (Part II)
In the first part of this blog series, I tried to explain how to get and access clickstream data. I will try to explain how to store your clickstream data on your database in this part of the blog series.
Table of Contents:
- Reading data from Apache Kafka via Python.
- Transformation Apache Kafka message.
- About the data design.
- Writing data to Apache Cassandra.
- Demo
- What’s next?
First, I set up a demo for it on docker. You can clone my repo and play with it.
Reading data from Apache Kafka:
There are many different options to consume data from Apache Kafka. I will use “kafka-python” package.
You can find a sample on my demo in the “click_stream_consumer.py” file.
- Transformation Apache Kafka message:
In our case, we can’t use clickstream data directly. Because “divolte” is producing the compressed data via lz4. We defined compression type in “Divolte” configuration as below.
...
acks = 1
retries = 0
compression.type = lz4
max.in.flight.requests.per.connection = 1
client.id = divolte.collector_0
...
We need to use an Apache Avro file to decompress and decode the message from Kafka. We already define it when we configured the “Divolte”
We use this “avro” file to decode clickstream data from Apache Kafka.
Sample code to decode a message which is encoded with Apache Avro.
avro.schema.Parse(open('PATH_OF_AVRO FILE').read())
bytes_reader = io.BytesIO(msg.value)
decoder = avro.io.BinaryDecoder(bytes_reader)
reader = avro.io.DatumReader(schema)
return reader.read(decoder)
3. About the data design:
The database choice and the architecture of the data is very important if you play with “big data”.
I preferred Apache Casandra to store the data. I already try to explain why I choose the Apache Cassandra in my previous blog. You can take a look if you want:
https://medium.com/@mustafa.ileri/set-up-cassandra-node-cluster-for-production-with-python-972aae84a0a0
I learned 2 important things when I started to store the clickstream data on Apache Cassandra.
- The primary key mechanism is working completely different in Apache Cassandra.
- The partition key is very important to store the data.
For example, you can use your partition key in where condition, but only with the equal operator. You can’t use <, > operators for the pk.
On the other hand, if you would like to make queries by date range you need to define as a primary key and use other primary keys.
I used the date as partition and primary key. Because I need to get data for the last 30,60, 90 days. The disadvantage of this architecture is that: a new partition will be created for each day.
You can get more info about Apache Cassandra primary keys from this blog:
https://www.datastax.com/blog/2016/02/most-important-thing-know-cassandra-data-modeling-primary-key
4. Writing data to Apache Cassandra:
I used the “cassandra-driver” library for writing the data to Apache Cassandra.
Especially I liked object mapper on it.
5. Demo
You can clone my repository from GitHub. Then you can run my docker-compose file.
docker-compose up --build
Sending event to Divolte:
Visit: http://localhost:8290/ and send this code from the console.
divolte.signal('event', {"item_id": 101321, "event_date":3213213})
You can see the log file in your
click_stream_consumer_1 | INFO:root:{‘click_stream_cookie_id’: ‘0:k1co3zug:yAGV1PjuaWIQEmZ1Yx~f8eHnB~Yr9Eqo’, ‘event_name’: ‘event’, ‘item_id’: ‘101’, ‘event_date’: 1571607632237, ‘url’: ‘http://localhost:8290/', ‘client_ip’: ‘192.168.224.1’}
To connect Apache Cassandra:
docker-compose exec cassandra /bin/bash
cqlsh localhost -ucassandra -pcassandracassandra@cqlsh> SELECT * FROM demo.click_stream;
6. What’s next?
We got, transformed and stored our clickstream data until now. In my next blog, I will try to explain that transferring the data to Apache Spark, processing it to build a recommendation engine and managing the process via Apache Airflow.