spark-mergejoin (homepage)

Robust and scalable join operators using sort-merge algorithm (high data skew, low cardinality, etc)

@hindog / (0)

In the standard distribution, Spark includes RDD join operators that must collect all values across all rows on both sides of the join for any given key into an in-memory buffer before it can output the joined values for that key. Because of this, you might experience scalability issues when joining large datasets or datasets that exhibit a low-cardinality or skewed distribution of values.  This library solves these issues by spilling to disk (only) when necessary to avoid OOM yet still performs very comparable to the standard join operators in situations were no spilling is necessary.

For example, this will fail (assuming -Xmx512M) because the standard join operators collect all values for a particular key in memory and you'll get an OutOfMemoryError:

val rdd1 = sc.parallelize(0 to 10000000).map(v => "key1" -> v)
val rdd2 = sc.parallelize(0 to 10000000).map(v => "key1" -> v)
rdd1.join(rdd2).take(1)

Here is the same join using the "mergeJoin" operator provided by this package (there's also leftOuterJoin, rightOuterJoin, fullOuterJoin):

val rdd1 = sc.parallelize(0 to 10000000).map(v => "key1" -> v)
val rdd2 = sc.parallelize(0 to 10000000).map(v => "key1" -> v)
rdd1.mergeJoin(rdd2).take(1)

This will succeed because it will spill to disk when necessary, but will only do so when necessary.  Because of this, performance should be comparable to the standard join operators, except in the spill case, you'll still get a successful join rather than OutOfMemoryError. 

See the homepage link for more detailed information.


Tags

  • 1|core

How to

Include this package in your Spark Applications using:

spark-shell, pyspark, or spark-submit

> $SPARK_HOME/bin/spark-shell --packages com.hindog.spark:spark-mergejoin_2.11:2.0.1

sbt

In your sbt build file, add:

libraryDependencies += "com.hindog.spark" % "spark-mergejoin_2.11" % "2.0.1"

Maven

In your pom.xml, add:
<dependencies>
  <!-- list of dependencies -->
  <dependency>
    <groupId>com.hindog.spark</groupId>
    <artifactId>spark-mergejoin_2.11</artifactId>
    <version>2.0.1</version>
  </dependency>
</dependencies>

Releases

Version: 2.0.1 ( 86c61f | zip | jar ) / Date: 2017-04-04 / License: Apache-2.0 / Scala version: 2.11

Version: 2.0.0 ( eefd16 | zip | jar ) / Date: 2017-01-19 / License: Apache-2.0 / Scala version: 2.11

Version: 1.6.0 ( 500bb2 | zip | jar ) / Date: 2017-01-19 / License: Apache-2.0 / Scala version: 2.11

Version: 1.5.0 ( e93b28 | zip | jar ) / Date: 2017-01-19 / License: Apache-2.0 / Scala version: 2.11