spark-skewjoin

spark-skewjoin (homepage)

Joins for skewed datasets in Spark

This library adds the skewJoin operation to RDD[(K, V)] where possible (certain implicit typeclasses are required for K and V). A skew join is just like a normal join except that keys with large amounts of values are not processed by a single task but instead spread out across many tasks.

How to

Include this package in your Spark Applications using:

spark-shell, pyspark, or spark-submit

> $SPARK_HOME/bin/spark-shell --packages com.tresata:spark-skewjoin_2.10:0.2.0

sbt

In your sbt build file, add:

libraryDependencies += "com.tresata" % "spark-skewjoin_2.10" % "0.2.0"

Maven

In your pom.xml, add:

<dependencies>
  <!-- list of dependencies -->
  <dependency>
    <groupId>com.tresata</groupId>
    <artifactId>spark-skewjoin_2.10</artifactId>
    <version>0.2.0</version>
  </dependency>
</dependencies>

Releases

Version: 0.2.0-s_2.10 ( e37803 | zip | jar ) / Date: 2015-11-13 / License: Apache-2.0 / Scala version: 2.10

Spark Scala/Java API compatibility: - 100% , - 80% , - 100% , - 80% , - 100% , - 80%

Version: 0.2.0-s_2.11 ( e37803 | zip | jar ) / Date: 2015-11-13 / License: Apache-2.0 / Scala version: 2.11

Spark Scala/Java API compatibility: - 80% , - 100% , - 100% , - 100%

Version: 0.1.0 ( b6d9a9 | zip | jar ) / Date: 2015-08-29 / License: Apache-2.0 / Scala version: 2.10

Spark Scala/Java API compatibility: - 80% , - 100% , - 80% , - 100% , - 80%