apache-spark

Topics related to apache-spark:

Getting started with apache-spark

Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. A developer should use it when (s)he handles large amount of data, which usually imply memory limitations and/or prohibitive processing time.

It should also mention any large subjects within apache-spark, and link out to the related topics. Since the Documentation for apache-spark is new, you may need to create initial versions of those related topics.

Text files and operations in Scala

Shared Variables

Stateful operations in Spark Streaming

Handling JSON in Spark

Unit tests

Window Functions in Spark SQL

Partitions

The number of partitions is critical for an application's performance and/or successful termination.

A Resilient Distributed Dataset (RDD) is Spark's main abstraction. An RDD is split into partitions, that means that a partition is a part of the dataset, a slice of it, or in other words, a chunk of it.

The greater the number of partitions is, the smaller the size of each partition is.

However, notice that a large number of partitions puts a lot of pressure on Hadoop Distributed File System (HDFS), which has to keep a significant amount of metadata.

The number of partitions is related to the memory usage, and a memoryOverhead issue can be related to this number (personal experience).

A common pitfall for new users is to transform their RDD into an RDD with only one partition, which usually looks like that:

data = sc.textFile(file)
data = data.coalesce(1)

That's usually a very bad idea, since you are telling Spark to put all the data is just one partition! Remember that:

A stage in Spark will operate on one partition at a time (and load the data in that partition into memory).

As a result, you tell Spark to handle all the data at once, which usually results in memory related errors (Out of Memory for example), or even a null pointer exception.

So, unless you know what you are doing, avoid repartitioning your RDD in just one partition!

Migrating from Spark 1.6 to Spark 2.0

Introduction to Apache Spark DataFrames

Joins

One thing to note is your resources versus the size of data you are joining. This is where your Spark Join code might fail giving you memory errors. For this reason make sure you configure your Spark jobs really well depending on the size of data. Following is an example of a configuration for a join of 1.5 million to 200 million.

Using Spark-Shell

spark-shell   --executor-memory 32G   --num-executors 80  --driver-memory 10g --executor-cores 10

Using Spark Submit

spark-submit   --executor-memory 32G   --num-executors 80  --driver-memory 10g --executor-cores 10 code.jar

Spark Launcher

Spark Launcher can help developer to poll status of spark job submitted. There are basically eight statuses that can be polled.They are listed below with there meaning::

/** The application has not reported back yet. */
UNKNOWN(false),
/** The application has connected to the handle. */
CONNECTED(false),
/** The application has been submitted to the cluster. */
SUBMITTED(false),
/** The application is running. */
RUNNING(false),
/** The application finished with a successful status. */
FINISHED(true),
/** The application finished with a failed status. */
FAILED(true),
/** The application was killed. */
KILLED(true),
/** The Spark Submit JVM exited with a unknown status. */
LOST(true);

Configuration: Apache Spark SQL

Spark DataFrame

How to ask Apache Spark related question?

Calling scala jobs from pyspark

Error message 'sparkR' is not recognized as an internal or external command or '.binsparkR' is not recognized as an internal or external command

Used reference from r-bloggers

Client mode and Cluster Mode