4 May 2015 / java 8

An overview of Apache Spark RDD & Java 8 Streams

Abstract: Everyday we make conscious or subconscious decisions based on information we receive from different sources. We analyse data in our own way without necessarily being a certified data analyst. Broadly speaking, data crunching can happen to anyone of us and during anytime of the day. If you happen to be a software engineer then chances are rather high that at times you would like to automatise data extraction and its processing. In this post we are going to take a look at two quite different tools that can help you with data analysis - Apache Spark & Java Development Kit (JDK) 8. In particular, we will examine two of their components - the Resilient Distributed Dataset (RDD)^[1] and Java Streams^[2] respectively.

Goal: Learn about Apache Spark RDD and Java 8 Streams

Acknowledgement: My gratitude goes to the open source community and to the researchers at the UC Berkeley AMPLab thanks to which we have Spark today.

Code: The examples associated with this article can be found at GitHub.

Before we go any further and regardless of your background we need to make one thing crystal clear: Spark and JDK are very different platforms. Besides having their own proper implementations and historical backgrounds they also have different overall purposes. Spark is a powerful data analysis platform, whereas the JDK helps you develop Java applications of different nature (not only for data analysis and to be more precise not only in Java). In this article we we will focus on Spark's resilient distributed dataset (RDD) and JDK 8 Stream API. We will talk about some of their dissimilarities and common characters. There will be indications when you may consider using Spark over JDK Streams or vice versa but it is equally important to note that you can use JDK and Spark together.

A short stop at the home page of Apache Spark project (or in the book Learning Spark; Chapter 1: What Is Apache Spark?) reveals that what we have in our hands is a platform composed of several components and designed to facilitate fast data analysis on a cluster of machines.

Digging deeper, we find that it has been developed as a research project at the UC Berkeley AMPLab in 2009. A year later it became an open-source project (you can find it on GitHub) and in 2013 joined the Apache Software Foundation.

There are books, articles and a lot of documentation describing different aspects and components of Spark. Here we are going to take a closer look at the core concept of Apache Spark, namely the resilient distributed dataset (for short RDD).

Now, let us look at the Java Development Kit (JDK). As its name suggests, the JDK is here to help you with any development you may wish to do (mainly but not only) with Java. The first version came in the beginning of 1996 and about two years later it brought the Collection framework. Thanks to this addition developers were able to go beyond playing with simple data structures. On 18th of March 2014 Java 8 was released and arguably the most important feature it brought is the Streams API (or simply streams).

For your convenience, I have re-printed some definition you may find for Java 8 Streams :

"Classes in the new java.util.stream package provide a Stream API to support functional-style operations on streams of elements. The Stream API is integrated into the Collections API, which enables bulk operations on collections, such as sequential or parallel map-reduce transformations." Source: What's New in JDK 8
"a sequence of elements from a source that supports aggregate operations" Source: Processing Data with Java SE 8 Streams, Part 1
"A stream is a sequence of elements. Unlike a collection, it is not a data structure that stores elements. Instead, a stream carries values from a source through a pipeline." Source: Java Documentation : Aggregate Operations
"A Stream is a free flowing sequence of elements. They do not hold any storage as that responsibility lies with collections such as arrays, lists and sets. Every stream starts with a source of data, sets up a pipeline, processes the elements through a pipeline and finishes with a terminal operation. They allow us to parallelise the load that comes with heavy operations without having to write any parallel code. A new package java.util.stream was introduced in Java 8 to deal with this feature." Source: Java 8 streams API and parallelism : What are streams?

Now, compare these with the following definitions for Resilient Distributed Datasets (RDDs):

"The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel." Source: The Docs

and here is what scientific community has to say:

"In this paper, we propose a new abstraction called resilient
distributed datasets (RDDs) that enables efficient data reuse in a broad range of applications. RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators." Source: Resilient Distributed Datasets

... and also:

"Formally, an RDD is a read-only, partitioned collection
of records. ... RDDs do not need to be materialized at all times. Instead, an RDD has enough information about how it was derived from other datasets (its lineage) to compute its partitions from data in stable storage." Source: Resilient Distributed Datasets

... and as a conclusion:

"We have presented resilient distributed datasets (RDDs),
an efficient, general-purpose and fault-tolerant abstraction
for sharing data in cluster applications." Source: Resilient Distributed Datasets

The first thing that makes an impression is probably the cluster term which we see mentioned on several occasions. It is quite an important term since it brings a whole new dimension to our workspace. A dimension that comes hand-in-hand with the problem of fault tolerance (luckily for us the authors of Spark have already though about it). When we use a Java Stream we don't usually think of a cluster and that's ok. Not every problem needs a cluster to be resolved. Spark, on the other hand, is straightforward designed to operate on a cluster of machines while being of a general-purpose.

Reflecting on all this, should give you some hints on when to consider using Spark and when to stay with Java 8 Streams. Hint: the moment you start scratching your head on how to distribute your streams is a good time to step back and ask yourself if it would be appropriate to switch to Spark. That being said, you should know that you can run Spark on your local machine and play with it. However, in order to unleash the full power of Spark you should give it problems that require fast analysis of large datasets.

Moving forward, we come across the parallel idiom which is also stressed several times. Both, the RDDs and the Streams, are designed to facilitate parallel computation. In addition, Spark will help you to distribute both data and computation over the cluster, whereas the JDK cannot do this for you, at least not yet on its own. Who knows, maybe one day we will see a similar functionality within the JDK.

For one last time, go back to the definitions and re-read. What you will notice is that both use the term collection. You probably know or have heard of a Java collection (Vector, Set, List, Queue). That is why it is important to be aware that neither an RDD nor a Stream is actually a Java collection. Both of them are an abstract notion of a collection of elements, yet neither of them needs to have any elements at all. That is their internal beauty, you can reuse them over different "materialised" collection of elements.

Going beyond definitions, both concepts - the RDD and the Stream have two types of operations that we can perform on them. For RDD these are called:

transformation operations - transform an RDD to another RDD
action operations - compute a result from an RDD

whereas for a stream we can use:

intermediate operations - transform a Stream to another Stream
terminal operations - compute a result from a Stream

I don't think you will be much surprised to learn that in both platforms intermediate and respectively transformation operations are "lazy" operations. Action and respectively terminal operations are "eager" operations. (inspired by lambda calculus & functional programming)

That however, is not the only architectural design common for both subjects of our evaluation. As you may have noticed, the pipe & filter paradigm is nested in the internal workings of both structures. Here is a link on the subject that I find rather visually captivating and well documented: Pipes and Filters Pattern (MSDN Microsoft Docs)

Having the basics we are ready to see some code. I have prepared two implementations of how-to count "words" (for simplicity we are going to consider that words are any string of character separated with at least one white space from another string of character) from a book (in particular James Joyce's Ulysses). One of the implementations is based on Spark and the other on a Java Stream. You can find the code on GitHub, here I'm going to print only the interesting parts.

Let's start with the implementation based on Java Stream:

File: WordsCounterJava.java

package org.ingini.spark.java8;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.TreeMap;

import static java.util.function.Function.identity;
import static java.util.stream.Collectors.counting;
import static java.util.stream.Collectors.groupingBy;

public class WordsCounterJava {

    public static final String REGEX = "\\s+";

    public TreeMap<String, Long> count(String source) throws IOException {

        return Files.lines(Paths.get(source))
                .map(line -> line.split(REGEX))
                .flatMap(Arrays::stream)
                .collect(groupingBy(identity(), TreeMap::new, counting()));
    }

}

There is not much complexity here. We are simply reading the source file line by line and transforming each line in a sequence of words (via the map method). Since we have a sequence of words for each line and we have many lines, it would be better to flatMap them to a single sequence of words. In the end we group them by their identity (i.e. the identity of a string is the string itself) and we count them. I've used a TreeMap in order to benefit from the natural ordering that it provides.

Now to the Spark example:

File: WordsCounterSpark.java

package org.ingini.spark.java8;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;

import java.io.Serializable;
import java.util.Arrays;
import java.util.List;


public class WordsCounterSpark implements Serializable {

    public static final String REGEX = "\\s+";

    public List<Tuple2<String, Integer>> count(String source) {

        SparkConf conf = new SparkConf().setAppName("ingini-spark-java8").setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(conf);

        JavaRDD<String> input = sc.textFile(source);

        JavaPairRDD<String, Integer> counts = input.flatMap(line -> Arrays.asList(line.split(REGEX)))
                .mapToPair(word -> new Tuple2(word, 1))
                .reduceByKey((x, y) -> (Integer) x + (Integer) y)
                .sortByKey();

         return counts.collect();
    }
}

Here the things are a bit more complicated. First we need to setup the SparkConf & JavaSparkContext. Second, we create a JavaRDD from a textFile. It's worthy mentioning that this initial RDD will operated on line-by-line from the text file. That's why we split each line into sequence of words and flatMap them. Then we transform a word into a key-value tuple with a safe base for incremental counting. Once we have done that, we group by words (reduceByKey) our key-value tuples from the previous RDD and in the end we sort them in natural order.

As you can see, both implementations are quite similar. Spark requires a bit more configuration and is a bit more verbose but it's also a lot more powerful. Yet, don't throw away a Java 8 Stream only because Spark is more powerful. As software engineers we need to know what tools exist and when to apply them. It's not always evident or an easy choice to make. Remember to seek for a balance between complexity and maintainability (it's not necessarily you who's going to update the code one day). Along these lines, there is no need to kill a fly with a sledgehammer when it would be easier with a flyswatter.

An overview of Apache Spark RDD & Java 8 Streams

3 things to know when working & debugging Amazon SageMaker

Authentication & authorization with Spring Boot & MongoDB