spark-df-profiling

spark-df-profiling (homepage)

Create HTML profiling reports from Apache Spark DataFrames

Generates profile reports from an Apache Spark DataFrame. It is based on pandas_profiling, but for Spark's DataFrames instead of pandas'.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

Essentials: type, unique values, missing values
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent values
Histogram

All operations are done efficiently, which means that no Python UDFs or .map() transformations are used at all; only Spark SQL's catalyst (Tungsten) and codegen is used for the retrieval of all statistics.

How to

Include this package in your Spark Applications using:

spark-shell, pyspark, or spark-submit

> $SPARK_HOME/bin/spark-shell --packages julioasotodv:spark-df-profiling:1.1.2

sbt

If you use the sbt-spark-package plugin, in your sbt build file, add:

spDependencies += "julioasotodv/spark-df-profiling:1.1.2"

Otherwise,

resolvers += "Spark Packages Repo" at "https://repos.spark-packages.org/"

libraryDependencies += "julioasotodv" % "spark-df-profiling" % "1.1.2"

Maven

In your pom.xml, add:

<dependencies>
  <!-- list of dependencies -->
  <dependency>
    <groupId>julioasotodv</groupId>
    <artifactId>spark-df-profiling</artifactId>
    <version>1.1.2</version>
  </dependency>
</dependencies>
<repositories>
  <!-- list of other repositories -->
  <repository>
    <id>SparkPackagesRepo</id>
    <url>https://repos.spark-packages.org/</url>
  </repository>
</repositories>

Releases

Version: 1.1.2 ( fabb60 | zip | jar ) / Date: 2016-07-26 / License: Apache-2.0