spark-pipeline-utils

spark-pipeline-utils (homepage)

Utility classes to extend and generalize Spark's ML pipeline framework

SQL code is messy and quickly gets hard to manage, debug and interpret. Regardless of whether or not you're training a machine learning model, organizing your code into a modular, configurable framework makes it much easier to manage, generalize and share. These are all good things, especially for production-quality codebases.

In particular, it adds the following features:

* support for Transformer-only pipelines (i.e. ETL pipelines)
* support for aggregations as pipeline stages, and multi-aggregation pipelines
* support for windowing functions as pipeline stages, and pipelines of multiple windowing functions
* support for "exploding" transformers (ie. using the explode function to expand rows in a DataFrame)
* support for running multiple pipelines in parallel, and then re-joining their results based on common columns
* a few handy transformers that expand upon what's already provided in the Spark ML API, including:
* Column selection, dropping and renaming
* Wrap any function into a tranformer stage (current ML framework only provides a 1-to-1 (ie. Unary) transformer)

How to

This package doesn't have any releases published in the Spark Packages repo, or with maven coordinates supplied. You may have to build this package from source, or it may simply be a script. To use this Spark Package, please follow the instructions in the README.

Releases

No releases yet.