spark-compaction (homepage)
Spark tool to handle file compaction.
@KeithSSmith / (0)
When writing data into HDFS, data can be written to a large number of small files, that if left unchecked will cause unnecessary strain on the HDFS NameNode. To handle this situation, it is good practice to run a compaction job on directories that contain many small files to help reduce the resource strain of the NameNode by ensuring HDFS blocks are filled efficiently. It is common practice to do this type of compaction with MapReduce or on Hive tables and partitions, this tool is designed to accomplish the same task utilizing Spark.
Tags
How to
Include this package in your Spark Applications using:
spark-shell, pyspark, or spark-submit
> $SPARK_HOME/bin/spark-shell --packages com.github.KeithSSmith:spark-compaction:1.0.0
sbt
In your sbt build file, add:
libraryDependencies += "com.github.KeithSSmith" % "spark-compaction" % "1.0.0"
Maven
In your pom.xml, add:<dependencies> <!-- list of dependencies --> <dependency> <groupId>com.github.KeithSSmith</groupId> <artifactId>spark-compaction</artifactId> <version>1.0.0</version> </dependency> </dependencies>
Releases
Version: 1.0.0 ( 0f6935 | zip | jar ) / Date: 2016-04-22 / License: Apache-2.0