Setting up and Configuring Spark Cluster

Mustafa İleri
4 min readNov 1, 2018

--

Apache Spark is an engine for analyzing big data. So it needs a lot of resources while doing this. In this blog, I will basically try how to set up a spark cluster which involves installing spark master and workers that depends on this master.

Table of Contents:

  1. Installing Java.
  2. Download and Install Spark & Hadoop.
  3. SSH Configuration.
  4. Configuring Spark Master.
  5. Starting / Stopping Spark.
  6. Docker Compose Implementation (Bonus)

1. Installing Java (Master & Slaves)

sudo apt-get update
sudo apt-get upgrade
sudo apt-get install openjdk-8-jdk

To check the java version, run this command:

java -version

It should be like that:

openjdk version “1.8.0_181”
OpenJDK Runtime Environment (build 1.8.0_181–8u181-b13–1ubuntu0.16.04.1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

2. Download and install the Spark 2.3.1 and Hadoop 2.7 (Master & Slaves)

Download the Spark 2.3.1

** I prefer 2.3.1 version of Spark because the following days, I will write a blog about how to integrate spark with “cassandra”. “cassandra-spark-connector” supports Spark 2.3.1 that as the latest version.

wget https://www-us.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz

Extract “tgz” file to the directory:

tar -xzvf spark-2.3.1-bin-hadoop2.7.tgz

Download Hadoop 2.7

wget https://www-us.apache.org/dist/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz

Now we have to configure “.bashrc” file to define “HADOOP_HOME” variable. Generally, I prefer vim to edit on the server. You can choose any different alternative if you prefer.

export HADOOP_HOME="<YOUR HADOOP PATH>"
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH

By the way, I prefer to install Spark and Hadoop to “/opt/” directory. In my case the path should be like that:

export HADOOP_HOME="/opt/hadoop-2.7.7"
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH

At this point, Spark and Hadoop should be installed your server.
You can check your installations with “pyspark” command like this.

But first, you should go your spark installation directory.

cd spark-2.3.1-bin-hadoop2.7
./bin/pyspark
Python 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.3.1
/_/
Using Python version 2.7.12 (default, Dec 4 2017 14:50:18)
SparkSession available as 'spark'.
>>>

Now, it’s ok. You installed Spark and Hadoop successfully on your server.

3. SSH Configuration (Only Master)

It is not so necessary, but it is very useful to start and stop master and worker with only a command.

First, we generate a public ssh key on the master.

ssh-keygen -t rsa -P ""

Copy the content of your public ssh key to .ssh/authorized_keys on your workers.

You can ensure to access to slaves form master via ssh.

ssh slave_01
ssh slave_02

4. Configuring spark-env.sh and slaves files (Only Master)

Now, we define “SPARK_MASTER_HOST” and “JAVA_HOME”

Add these below lines to <SPARK_INTALLATION_DIRECTORY>/conf/spark-env.sh

export SPARK_MASTER_HOST=<SPARK_MASTER_HOST_OR_URL>
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Then, we define slaves in <SPARK_INTALLATION_DIRECTORY>/conf/slaves like that

<HOST_OR_URL_SLAVE_01>
<HOST_OR_URL_SLAVE_02>
<HOST_OR_URL_SLAVE_03>

**Warning, don’t forget to modify that according to your installation.

5. Starting or Stopping Spark(Only Master)

So far, we installed Spark to both master and slaves. Then configured ssh connection settings for access from master to slaves directly. And we defined the host of master and hosts of slaves in spark configuration that in master.

Now we can start or stop cluster by only one command.

To start run this command in your spark folder.

./sbin/start-all.sh

If it works correctly. You will get an output like that

starting org.apache.spark.deploy.master.Master, logging to /opt/spark-2.3.1-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-ip-X-X-X-X.outslave_01: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-2.3.1-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-X-X-X-X.outslave_02: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-2.3.1-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-X-X-X-X.outslave_03: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-2.3.1-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-X-X-X-X.out

6. Docker Compose Implementation

I have a sample Spark cluster infrastructure on Github. You can use if you want it:

➜ git clone git@github.com:mustafaileri/spark-cluster.git
➜ cd spark-cluster
➜ docker-compose up --build

Now you can access web user interface from http://localhost:8080
if everything is working correctly.

As you see, there are a master spark server and a slave.

Now we can increase the number of workers by the scale-up slave.

➜  spark-cluster git:(master) docker-compose scale slave=3
Starting spark-cluster_slave_1 ... done
Creating spark-cluster_slave_2 ... done
Creating spark-cluster_slave_3 ... done

Now we have 3 workers on single master.

I will try to write about how we can run “jupyter notebook” on the remote host.

Thanks for reading.

--

--

Mustafa İleri
Mustafa İleri

Written by Mustafa İleri

Tech Lead / Architect, Data Engineer, loves #python #symfony #django

No responses yet