spark-state-tools (homepage)

Structured Streaming State Tools for Apache Spark

@HeartSaVioR / (0)

Spark State Tools provides features about offline manipulation of Structured Streaming state on existing query.

The features we provide as of now are:

* Show some state information which you'll need to provide to enjoy below features
** state operator information from checkpoint
** state schema from streaming query
* Create savepoint from existing checkpoint of Structured Streaming query
** You can pick specific batch (if it exists on metadata) to create savepoint
* Read state as batch source of Spark SQL
* Write DataFrame to state as batch sink of Spark SQL
** With feature of writing state, you can achieve rescaling state (repartition), simple schema evolution, etc.
* Migrate state format from old to new
** migrating Streaming Aggregation from ver 1 to 2
** migrating FlatMapGroupsWithState from ver 1 to 2


Tags

  • 1|data source
  • 1|structured streaming
  • 1|state

How to

Include this package in your Spark Applications using:

spark-shell, pyspark, or spark-submit

> $SPARK_HOME/bin/spark-shell --packages net.heartsavior.spark:spark-state-tools_2.11:0.3.0

sbt

In your sbt build file, add:

libraryDependencies += "net.heartsavior.spark" % "spark-state-tools_2.11" % "0.3.0"

Maven

In your pom.xml, add:
<dependencies>
  <!-- list of dependencies -->
  <dependency>
    <groupId>net.heartsavior.spark</groupId>
    <artifactId>spark-state-tools_2.11</artifactId>
    <version>0.3.0</version>
  </dependency>
</dependencies>

Releases

Version: 0.3.0 ( acd602 | zip | jar ) / Date: 2020-05-21 / License: Apache-2.0 / Scala version: 2.11