Complexity metrics for big data problems.
@JMailloH / (1)
There is currently a wide availability of data mining algorithms in big data, however, there are no specific metrics focused on tackling complexity and redundancy problems in large datasets. Thus, we propose to answer the following question: do classification algorithms need so much data?
To address this objective, we propose two metrics and the implementation of several state of the art metrics, adapting them to the big data problem. This repository is a open-source package that includes complexity metrics. This package includes two original proposals for studying the density and complexity of large datasets:
ND: Neighborhood Density, this metric returns the percentual difference of the euclidean distance, calculated with all available data, and with half of the randomly chosen data.
DTP: Decision Tree Progression, this metric returns the accuracy percentage difference by training a Decision Tree with the totality of the data, and discarding half of them randomly.
It also includes some state-of-the-art metrics, which have been designed and developed to run large datasets.
F1: Maximum Fisher's discriminant ratio
F2: Volume of overlapping region
F3: Maximum individual feature efficiency
F4: Collective feature efficiency
C1: Entropy of class portions
C2: Imbalance ratio
If you want to know more about the metrics and experiments carried out and the conclusions obtained, please consult the github repository associated.
Include this package in your Spark Applications using:
spark-shell, pyspark, or spark-submit
> $SPARK_HOME/bin/spark-shell --packages JMailloH:ComplexityMetrics:1.0
If you use the sbt-spark-package plugin, in your sbt build file, add:
spDependencies += "JMailloH/ComplexityMetrics:1.0"
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven" libraryDependencies += "JMailloH" % "ComplexityMetrics" % "1.0"
MavenIn your pom.xml, add:
<dependencies> <!-- list of dependencies --> <dependency> <groupId>JMailloH</groupId> <artifactId>ComplexityMetrics</artifactId> <version>1.0</version> </dependency> </dependencies> <repositories> <!-- list of other repositories --> <repository> <id>SparkPackagesRepo</id> <url>http://dl.bintray.com/spark-packages/maven</url> </repository> </repositories>