Setting up and Configuring Spark Cluster

4 min readNov 1, 2018

Apache Spark is an engine for analyzing big data. So it needs a lot of resources while doing this. In this blog, I will basically try how to set up a spark cluster which involves installing spark master and workers that depends on this master.

cd spark-2.3.1-bin-hadoop2.7
./bin/pysparkPython 2.7.12 (default, Dec  4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/Using Python version 2.7.12 (default, Dec  4 2017 14:50:18)
SparkSession available as 'spark'.
>>>

Now, it’s ok. You installed Spark and Hadoop successfully on your server.

3. SSH Configuration (Only Master)

It is not so necessary, but it is very useful to start and stop master and worker with only a command.

First, we generate a public ssh key on the master.

ssh-keygen -t rsa -P ""

Copy the content of your public ssh key to .ssh/authorized_keys on your workers.

You can ensure to access to slaves form master via ssh.

ssh slave_01
ssh slave_02

4. Configuring spark-env.sh and slaves files (Only Master)

Now, we define “SPARK_MASTER_HOST” and “JAVA_HOME”

Add these below lines to <SPARK_INTALLATION_DIRECTORY>/conf/spark-env.sh

export SPARK_MASTER_HOST=<SPARK_MASTER_HOST_OR_URL>
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Then, we define slaves in <SPARK_INTALLATION_DIRECTORY>/conf/slaves like that

<HOST_OR_URL_SLAVE_01>
<HOST_OR_URL_SLAVE_02>
<HOST_OR_URL_SLAVE_03>

**Warning, don’t forget to modify that according to your installation.

5. Starting or Stopping Spark(Only Master)

So far, we installed Spark to both master and slaves. Then configured ssh connection settings for access from master to slaves directly. And we defined the host of master and hosts of slaves in spark configuration that in master.

Now we can start or stop cluster by only one command.

To start run this command in your spark folder.

./sbin/start-all.sh

If it works correctly. You will get an output like that

starting org.apache.spark.deploy.master.Master, logging to /opt/spark-2.3.1-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-ip-X-X-X-X.outslave_01: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-2.3.1-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-X-X-X-X.outslave_02: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-2.3.1-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-X-X-X-X.outslave_03: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-2.3.1-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-X-X-X-X.out

6. Docker Compose Implementation

I have a sample Spark cluster infrastructure on Github. You can use if you want it:

mustafaileri/spark-cluster

Spark Cluster on Docker Compose. Contribute to mustafaileri/spark-cluster development by creating an account on GitHub.

github.com

➜ git clone git@github.com:mustafaileri/spark-cluster.git
➜ cd spark-cluster
➜ docker-compose up --build

Now you can access web user interface from http://localhost:8080
if everything is working correctly.

As you see, there are a master spark server and a slave.

Now we can increase the number of workers by the scale-up slave.

➜  spark-cluster git:(master) docker-compose scale slave=3
Starting spark-cluster_slave_1 ... done
Creating spark-cluster_slave_2 ... done
Creating spark-cluster_slave_3 ... done

Now we have 3 workers on single master.

I will try to write about how we can run “jupyter notebook” on the remote host.

Thanks for reading.

Setting up and Configuring Spark Cluster

Table of Contents:

1. Installing Java (Master & Slaves)

2. Download and install the Spark 2.3.1 and Hadoop 2.7 (Master & Slaves)