Structured Streaming State Tools for Apache Spark
@HeartSaVioR / (0)
Spark State Tools provides features about offline manipulation of Structured Streaming state on existing query.
The features we provide as of now are:
* Show some state information which you'll need to provide to enjoy below features
** state operator information from checkpoint
** state schema from streaming query
* Create savepoint from existing checkpoint of Structured Streaming query
** You can pick specific batch (if it exists on metadata) to create savepoint
* Read state as batch source of Spark SQL
* Write DataFrame to state as batch sink of Spark SQL
** With feature of writing state, you can achieve rescaling state (repartition), simple schema evolution, etc.
* Migrate state format from old to new
** migrating Streaming Aggregation from ver 1 to 2
** migrating FlatMapGroupsWithState from ver 1 to 2
Include this package in your Spark Applications using:
spark-shell, pyspark, or spark-submit
> $SPARK_HOME/bin/spark-shell --packages net.heartsavior.spark:spark-state-tools_2.11:0.3.0
In your sbt build file, add:
libraryDependencies += "net.heartsavior.spark" % "spark-state-tools_2.11" % "0.3.0"
MavenIn your pom.xml, add:
<dependencies> <!-- list of dependencies --> <dependency> <groupId>net.heartsavior.spark</groupId> <artifactId>spark-state-tools_2.11</artifactId> <version>0.3.0</version> </dependency> </dependencies>