spark-df-profiling (homepage)
Create HTML profiling reports from Apache Spark DataFrames
@julioasotodv / (1)
Generates profile reports from an Apache Spark DataFrame. It is based on pandas_profiling, but for Spark's DataFrames instead of pandas'.
For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:
Essentials: type, unique values, missing values
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent values
Histogram
All operations are done efficiently, which means that no Python UDFs or .map() transformations are used at all; only Spark SQL's catalyst (Tungsten) and codegen is used for the retrieval of all statistics.
Tags
How to
Include this package in your Spark Applications using:
spark-shell, pyspark, or spark-submit
> $SPARK_HOME/bin/spark-shell --packages julioasotodv:spark-df-profiling:1.1.2
sbt
If you use the sbt-spark-package plugin, in your sbt build file, add:
spDependencies += "julioasotodv/spark-df-profiling:1.1.2"
Otherwise,
resolvers += "Spark Packages Repo" at "https://repos.spark-packages.org/" libraryDependencies += "julioasotodv" % "spark-df-profiling" % "1.1.2"
Maven
In your pom.xml, add:<dependencies> <!-- list of dependencies --> <dependency> <groupId>julioasotodv</groupId> <artifactId>spark-df-profiling</artifactId> <version>1.1.2</version> </dependency> </dependencies> <repositories> <!-- list of other repositories --> <repository> <id>SparkPackagesRepo</id> <url>https://repos.spark-packages.org/</url> </repository> </repositories>
Releases
Version: 1.1.2 ( fabb60 | zip | jar ) / Date: 2016-07-26 / License: Apache-2.0