Spark tool to handle file compaction.
@KeithSSmith / (0)
When writing data into HDFS, data can be written to a large number of small files, that if left unchecked will cause unnecessary strain on the HDFS NameNode. To handle this situation, it is good practice to run a compaction job on directories that contain many small files to help reduce the resource strain of the NameNode by ensuring HDFS blocks are filled efficiently. It is common practice to do this type of compaction with MapReduce or on Hive tables and partitions, this tool is designed to accomplish the same task utilizing Spark.
Include this package in your Spark Applications using:
spark-shell, pyspark, or spark-submit
> $SPARK_HOME/bin/spark-shell --packages com.github.KeithSSmith:spark-compaction:1.0.0
In your sbt build file, add:
libraryDependencies += "com.github.KeithSSmith" % "spark-compaction" % "1.0.0"
MavenIn your pom.xml, add:
<dependencies> <!-- list of dependencies --> <dependency> <groupId>com.github.KeithSSmith</groupId> <artifactId>spark-compaction</artifactId> <version>1.0.0</version> </dependency> </dependencies>