Machine Learning & Big Data Blog

Basics of Graphing Streaming Big Data

3 minute read
Walker Rowe

Imagine creating a live chart that updates as data flows in. With this you could watch currency value fluctuations, streaming IOT data, application performance, cybersecurity events, or other data in real time.

It is not so hard to create Spark Streaming data. We give an example below. But creating any graphs more elaborate than simple SQL charts is quite complex. The problem is you either need to know Matplotlib (works with Python) or Highcharts. Highcharts requires that you know AngularJS, which is a JavaScript framework for creating web pages using JavaScript.

To make live charts using JavaScript you need to add watch variables and pass data to them, which Apache Spark explains here. Apache Zeppelin supports that with Spark and Scala. It does not support live charts with anything but Scala, yet.

Here we show how to pass variables to AngularJS below.

Matplotlib is more of a graphing package for data science, Python, and R programmers. It does not support live updates. But, on the positive side, it is not complicated. To create charts by yourself in JavaScript would be quite complex, which is why you need a framework, like HighCharts.

HighCharts is free for non-commercial use. You could also look at Google Charts for the R programming language, which is called googleVis. Regarding Google Charts, it is designed for Google’s own cloud and Google Big Tables rather than open source Apache Spark.

One thing that makes graphing difficult in any language is understanding the different kinds of charts. There are dozens. You can learn something about that by studying these design principles on the HighCharts website.

Install Zeppelin

So let’s get going. Here we are going to give the simplest possible example of passing data to an web page (i.e., AngularJS). Then we give the simplest possible example of Spark Streaming.

First, you need to install Zeppelin. The easiest way to do that is to use Docker:

docker pull dylanmei/zeppelin
docker run –rm -p 8080:8080 dylanmei/zeppelin

Binding

Zeppelin lets you pass variables to Angular using z-commands, like z.put, z.run, and z.angularBind. Here we make the simplest possible example. (The problem with most examples on the internet is they are too complicated. No one starts with a very simple example that you can easily understand. So we do.)

Below we make an Array of 1 element. Then we create an RDD and map over it 1 time to pass the value 1 to the HTML table element “1” shown below.

val data = Array(1)
val distData = sc.parallelize(data)
distData.map( l=> z.angularBind(l.toString, 1))

This is an HTML table. We use %angular to tell Zeppelin that this code is JavaScript and not Spark. Note that we put the variable name that we want to pass data too in double curly braces {{}}. We do not actually use any JavaScript, just simple HTML.

%angular
<html>
<table>
<tr>
<td>value={{1}}</td>
</tr>
</table>

So when you run that and the Spark Scala code it will output:

value=1

Streaming code

Below we give an example of Spark Streaming code. In part II of this blog post we will show how to make a graph using this data and HighCharts. But for now just get a feel of how to create streaming code. We also wrote another, more complex example of streaming Twitter Tweets here.

Know that when you run streaming code in Zeppelin it will freeze up your browser, since it is streaming. So you have to kill the Docker process like this:

docker stop $(docker ps -aq)

If you kill it any other way you will have errors because you are trying to run to SparkContexts at the same time.

What this example does is listen on port 80 for traffic on your laptop, i.e., web page traffic. If you were to dump that as text you could see data packets. Here we just create a Dstream and use the info() method to echo some of that. It does not show the whole packet.

To prove that you have data coming into port 80 on your laptop, you could install nmap and then run:

while true;do nmap –packet-trace -A localhost -p 80;sleep 1;done

Here is the code.

import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Encoders
import org.apache.spark.streaming._
val ssc = new StreamingContext(sc, Seconds(5))
val lines = ssc.socketTextStream("localhost", 80)
lines.print()
ssc.start()
ssc.awaitTermination()

It should output something like this is a continuous loop:

Time: 1499497850000 ms

Next steps

The next steps would be to make JavaScript AngularJS variables listen for updates from the Spark bound variables and then use HighCharts, the JavaScript framework for graphs. That is far more complicated than what we just showed.

Learn ML with our free downloadable guide

This e-book teaches machine learning in the simplest way possible. This book is for managers, programmers, directors – and anyone else who wants to learn machine learning. We start with very basic stats and algebra and build upon that.


These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing blogs@bmc.com.

BMC Bring the A-Game

From core to cloud to edge, BMC delivers the software and services that enable nearly 10,000 global customers, including 84% of the Forbes Global 100, to thrive in their ongoing evolution to an Autonomous Digital Enterprise.
Learn more about BMC ›

About the author

Walker Rowe

Walker Rowe is an American freelancer tech writer and programmer living in Cyprus. He writes tutorials on analytics and big data and specializes in documenting SDKs and APIs. He is the founder of the Hypatia Academy Cyprus, an online school to teach secondary school children programming. You can find Walker here and here.