datafu

datafu (homepage)

A collection of general APIs/UDAFs for Spark

datafu-spark contains a number of spark API's and a "Scala-Python bridge" that makes calling Scala code from Python, and vice-versa, easier.

Here are some examples of things you can do with it:

"Dedup" a table - remove duplicates based on a key and ordering (typically a date updated field, to get only the mostly recently updated record).

Join a table with a numeric field with a table with a range

Do a skewed join between tables (where the small table is still too big to fit in memory)

Count distinct up to - an efficient implementation when you just want to verify that a certain minimum of distinct rows appear in a table

Tags (No tags yet, login to add one. )

How to

Include this package in your Spark Applications using:

spark-shell, pyspark, or spark-submit

> $SPARK_HOME/bin/spark-shell --packages org.apache.datafu:datafu-spark_2.11:1.6.1

sbt

In your sbt build file, add:

libraryDependencies += "org.apache.datafu" % "datafu-spark_2.11" % "1.6.1"

Maven

In your pom.xml, add:

<dependencies>
  <!-- list of dependencies -->
  <dependency>
    <groupId>org.apache.datafu</groupId>
    <artifactId>datafu-spark_2.11</artifactId>
    <version>1.6.1</version>
  </dependency>
</dependencies>

Releases

Version: 1.6.1 ( 09a685 | zip | jar ) / Date: 2022-06-02 / License: Apache-2.0 / Scala version: 2.11