spark-state-tools

spark-state-tools (homepage)

Structured Streaming State Tools for Apache Spark

Spark State Tools provides features about offline manipulation of Structured Streaming state on existing query.

The features we provide as of now are:

* Show some state information which you'll need to provide to enjoy below features
** state operator information from checkpoint
** state schema from streaming query
* Create savepoint from existing checkpoint of Structured Streaming query
** You can pick specific batch (if it exists on metadata) to create savepoint
* Read state as batch source of Spark SQL
* Write DataFrame to state as batch sink of Spark SQL
** With feature of writing state, you can achieve rescaling state (repartition), simple schema evolution, etc.
* Migrate state format from old to new
** migrating Streaming Aggregation from ver 1 to 2
** migrating FlatMapGroupsWithState from ver 1 to 2

How to

Include this package in your Spark Applications using:

spark-shell, pyspark, or spark-submit

> $SPARK_HOME/bin/spark-shell --packages net.heartsavior.spark:spark-state-tools_2.11:0.3.0

sbt

In your sbt build file, add:

libraryDependencies += "net.heartsavior.spark" % "spark-state-tools_2.11" % "0.3.0"

Maven

In your pom.xml, add:

<dependencies>
  <!-- list of dependencies -->
  <dependency>
    <groupId>net.heartsavior.spark</groupId>
    <artifactId>spark-state-tools_2.11</artifactId>
    <version>0.3.0</version>
  </dependency>
</dependencies>

Releases

Version: 0.3.0 ( acd602 | zip | jar ) / Date: 2020-05-21 / License: Apache-2.0 / Scala version: 2.11