spark-dynamodb (homepage)

Plug-and-play implementation of an Apache Spark custom data source for AWS DynamoDB.

Plug-and-play implementation of an Apache Spark custom data source for AWS DynamoDB.
We published a small article about the project, check it out here: https://www.audienceproject.com/blog/tech/sparkdynamodb-using-aws-dynamodb-data-source-apache-spark/


Features
- Distributed, parallel scan with lazy evaluation
- Throughput control by rate limiting on target fraction of provisioned table/index capacity
- Schema discovery to suit your needs
- Dynamic inference
- Static analysis of case class
- Column and filter pushdown
- Global secondary index support
- Write support


Getting The Dependency

The library is available from Maven Central(https://mvnrepository.com/artifact/com.audienceproject/spark-dynamodb).
Add the dependency in SBT as "com.audienceproject" %% "spark-dynamodb" % "latest"
Spark is used in the library as a "provided" dependency, which means Spark has to be installed separately on the container where the application is running, such as is the case on AWS EMR.


Acknowledgements

Usage of parallel scan and rate limiter inspired by work in https://github.com/traviscrawford/spark-dynamodb


Tags

  • 1|AWS
  • 1|dynamodb

How to

This package doesn't have any releases published in the Spark Packages repo, or with maven coordinates supplied. You may have to build this package from source, or it may simply be a script. To use this Spark Package, please follow the instructions in the README.

Releases

No releases yet.