Setting up and Configuring Spark Cluster
Apache Spark is an engine for analyzing big data. So it needs a lot of resources while doing this. In this blog, I will basically try how to set up a spark cluster which involves installing spark master and workers that depends on this master.
Table of Contents:
- Installing Java.
- Download and Install Spark & Hadoop.
- SSH Configuration.
- Configuring Spark Master.
- Starting / Stopping Spark.
- Docker Compose Implementation (Bonus)
1. Installing Java (Master & Slaves)
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install openjdk-8-jdk
To check the java version, run this command:
java -version
It should be like that:
openjdk version “1.8.0_181”
OpenJDK Runtime Environment (build 1.8.0_181–8u181-b13–1ubuntu0.16.04.1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
2. Download and install the Spark 2.3.1 and Hadoop 2.7 (Master & Slaves)
Download the Spark 2.3.1
** I prefer 2.3.1 version of Spark because the following days, I will write a blog about how to integrate spark with “cassandra”. “cassandra-spark-connector” supports Spark 2.3.1 that as the latest version.
wget https://www-us.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
Extract “tgz” file to the directory:
tar -xzvf spark-2.3.1-bin-hadoop2.7.tgz
Download Hadoop 2.7
wget https://www-us.apache.org/dist/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
Now we have to configure “.bashrc” file to define “HADOOP_HOME” variable. Generally, I prefer vim to edit on the server. You can choose any different alternative if you prefer.
export HADOOP_HOME="<YOUR HADOOP PATH>"
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
By the way, I prefer to install Spark and Hadoop to “/opt/” directory. In my case the path should be like that:
export HADOOP_HOME="/opt/hadoop-2.7.7"
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
At this point, Spark and Hadoop should be installed your server.
You can check your installations with “pyspark” command like this.
But first, you should go your spark installation directory.
cd spark-2.3.1-bin-hadoop2.7
./bin/pysparkPython 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.3.1
/_/Using Python version 2.7.12 (default, Dec 4 2017 14:50:18)
SparkSession available as 'spark'.
>>>
Now, it’s ok. You installed Spark and Hadoop successfully on your server.
3. SSH Configuration (Only Master)
It is not so necessary, but it is very useful to start and stop master and worker with only a command.
First, we generate a public ssh key on the master.
ssh-keygen -t rsa -P ""
Copy the content of your public ssh key to .ssh/authorized_keys on your workers.
You can ensure to access to slaves form master via ssh.
ssh slave_01
ssh slave_02
4. Configuring spark-env.sh and slaves files (Only Master)
Now, we define “SPARK_MASTER_HOST” and “JAVA_HOME”
Add these below lines to <SPARK_INTALLATION_DIRECTORY>/conf/spark-env.sh
export SPARK_MASTER_HOST=<SPARK_MASTER_HOST_OR_URL>
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Then, we define slaves in <SPARK_INTALLATION_DIRECTORY>/conf/slaves like that
<HOST_OR_URL_SLAVE_01>
<HOST_OR_URL_SLAVE_02>
<HOST_OR_URL_SLAVE_03>
**Warning, don’t forget to modify that according to your installation.
5. Starting or Stopping Spark(Only Master)
So far, we installed Spark to both master and slaves. Then configured ssh connection settings for access from master to slaves directly. And we defined the host of master and hosts of slaves in spark configuration that in master.
Now we can start or stop cluster by only one command.
To start run this command in your spark folder.
./sbin/start-all.sh
If it works correctly. You will get an output like that
starting org.apache.spark.deploy.master.Master, logging to /opt/spark-2.3.1-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-ip-X-X-X-X.outslave_01: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-2.3.1-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-X-X-X-X.outslave_02: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-2.3.1-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-X-X-X-X.outslave_03: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-2.3.1-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-X-X-X-X.out
6. Docker Compose Implementation
I have a sample Spark cluster infrastructure on Github. You can use if you want it:
➜ git clone git@github.com:mustafaileri/spark-cluster.git
➜ cd spark-cluster
➜ docker-compose up --build
Now you can access web user interface from http://localhost:8080
if everything is working correctly.
As you see, there are a master spark server and a slave.
Now we can increase the number of workers by the scale-up slave.
➜ spark-cluster git:(master) docker-compose scale slave=3
Starting spark-cluster_slave_1 ... done
Creating spark-cluster_slave_2 ... done
Creating spark-cluster_slave_3 ... done
Now we have 3 workers on single master.
I will try to write about how we can run “jupyter notebook” on the remote host.
Thanks for reading.