Machine Learning & Big Data Blog

How to Use Jupyter Notebooks with Apache Spark

2 minute read
Walker Rowe

In this article, we explain how to set up PySpark for your Jupyter notebook. This setup lets you write Python code to work with Spark in Jupyter.

Many programmers use Jupyter, formerly called iPython, to write Python code, because it’s so easy to use and it allows graphics. Unlike Zeppelin notebooks, you need to do some initial configuration to use Apache Spark with Jupyter.

An important note on Apache clusters. You cannot use Jupyter with an Apache cluster because PySpark doesn’t work with clusters. Luckily, you don’t need that when working with Jupyter because it runs your jobs on whatever Spark instance you indicate. But Jupyter cannot run jobs across the cluster—it won’t run the code in distributed mode. This is only an issue in very large data sets, in which case you’d use submit-spark to run your code on the cluster.

Now, let’s get starting setting up PySpark for your Jupyter notebook.

Setting PySpark and Jupyter environment variables

First, all these environment variables. These set PySpark so that it will use that content and then pass it to the Jupyter browser.

Below, I use an IP address that’s routable on an internal network, so that I can read my Jupyter notebook from the public internet. I put –no-browser so that it won’t open a browser on my local device. If you only want to run this on your laptop, you can use the loopback address.

export SPARK_HOME='/usr/share/spark/spark-3.0.0-preview-bin-hadoop2.7'
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_DRIVER_PYTHON="jupyter" 
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=8889" 
export SPARK_LOCAL_IP="172.31.46.15"

Now run PySpark. You will get a screen like this, below. Paste the pink URL into your browser.

(In the example, parisx is the internal address. So, I would replace it with the internet one, such as mydomain.com:8889/?token=6cfc363cf7dab1f2e1f2c73b37113ef496155595b29baac5)

[I 17:50:27.046 NotebookApp] Serving notebooks from local directory: /home/ubuntu
[I 17:50:27.046 NotebookApp] The Jupyter Notebook is running at:
[I 17:50:27.046 NotebookApp] http://parisx:8889/?token=6cfc363cf7dab1f2e1f2c73b37113ef496155595b29baac5
[I 17:50:27.046 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 17:50:27.050 NotebookApp] 
http://parisx:8889/?token=6cfc363cf7dab1f2e1f2c73b37113ef496155595b29baac5
To access the notebook, open this file in a browser:
file:///run/user/1000/jupyter/nbserver-30498-open.html
Or copy and paste one of these URLs:
http://parisx:8889/?token=6cfc363cf7dab1f2e1f2c73b37113ef496155595b29baac5
[W 17:50:28.112 NotebookApp] 404 GET /api/kernels/a769e52d-eaf2-49f7-b79b-4fe588a7bdd0/channels?session_id=fbab46a7332344e48d3052f36f6e589f (71.12.95.23): Kernel does not exist: a769e52d-eaf2-49f7-b79b-4fe588a7bdd0
[W 17:50:28.137 NotebookApp] 404 GET /api/kernels/a769e52d-eaf2-49f7-b79b-4fe588a7bdd0/channels?session_id=fbab46a7332344e48d3052f36f6e589f (71.12.95.23) 30.85ms referer=None
[W 17:50:33.313 NotebookApp] Replacing stale connection: a769e52d-eaf2-49f7-b79b-4fe588a7bdd0:fbab46a7332344e48d3052f36f6e589f

If you want the notebook to keep running when you disconnect, use nohup pyspark& to run it as a background job. Then cat the file nohup.out to see the token number to use.

The code, explained

Below is sample code to prove that it works. Unlike the PySpark shell, when you use Jupyter you have to get the SparkContext and SQLContext, as shown below. You do not need to create the SQLContext; that is already done by PySpark.

sc = pyspark.SparkContext.getOrCreate(conf=conf)
sqlcontext = SQLContext(sc)
from pyspark.sql.types import StructType, StructField, FloatType, BooleanType
from pyspark.sql.types import DoubleType, IntegerType, StringType
import pyspark
from pyspark import SQLContext
conf = pyspark.SparkConf() 
sc = pyspark.SparkContext.getOrCreate(conf=conf)
sqlcontext = SQLContext(sc)
schema = StructType([
StructField("sales", IntegerType(),True), 
StructField("sales person", StringType(),True)
])
data = ([(10, 'Walker'),
( 20, 'Stepher')
])
df=sqlcontext.createDataFrame(data,schema=schema)

Lastly, display the data.

df.show()
+-----+------------+
|sales|sales person|
+-----+------------+
|   10|      Walker|
|   20|     Stepher|
+-----+------------+

Learn ML with our free downloadable guide

This e-book teaches machine learning in the simplest way possible. This book is for managers, programmers, directors – and anyone else who wants to learn machine learning. We start with very basic stats and algebra and build upon that.


These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing blogs@bmc.com.

BMC Bring the A-Game

From core to cloud to edge, BMC delivers the software and services that enable nearly 10,000 global customers, including 84% of the Forbes Global 100, to thrive in their ongoing evolution to an Autonomous Digital Enterprise.
Learn more about BMC ›

About the author

Walker Rowe

Walker Rowe is an American freelancer tech writer and programmer living in Cyprus. He writes tutorials on analytics and big data and specializes in documenting SDKs and APIs. He is the founder of the Hypatia Academy Cyprus, an online school to teach secondary school children programming. You can find Walker here and here.