datafu (homepage)
A collection of general APIs/UDAFs for Spark
@apache / (0)
datafu-spark contains a number of spark API's and a "Scala-Python bridge" that makes calling Scala code from Python, and vice-versa, easier.
Here are some examples of things you can do with it:
"Dedup" a table - remove duplicates based on a key and ordering (typically a date updated field, to get only the mostly recently updated record).
Join a table with a numeric field with a table with a range
Do a skewed join between tables (where the small table is still too big to fit in memory)
Count distinct up to - an efficient implementation when you just want to verify that a certain minimum of distinct rows appear in a table
Tags (No tags yet, login to add one. )
How to
Include this package in your Spark Applications using:
spark-shell, pyspark, or spark-submit
> $SPARK_HOME/bin/spark-shell --packages org.apache.datafu:datafu-spark_2.11:1.6.1
sbt
In your sbt build file, add:
libraryDependencies += "org.apache.datafu" % "datafu-spark_2.11" % "1.6.1"
Maven
In your pom.xml, add:<dependencies> <!-- list of dependencies --> <dependency> <groupId>org.apache.datafu</groupId> <artifactId>datafu-spark_2.11</artifactId> <version>1.6.1</version> </dependency> </dependencies>
Releases
Version: 1.6.1 ( 09a685 | zip | jar ) / Date: 2022-06-02 / License: Apache-2.0 / Scala version: 2.11