A Nearest Neighbor Classifier for High-Speed Big Data Streams with Instance Selection
@sramirez / (0)
Here we present an efficient nearest neighbor solution to classify fast and massive data streams using Apache Spark. It is formed by a distributed case-base and an instance selection method that enhances its performance and effectiveness. A distributed metric tree (based on M-trees) has been designed to organize the case-base and consequently to speed up the neighbor searches. This distributed tree consists of a top-tree (in the master node) that routes the searches in the first levels and several leaf nodes (in the slaves nodes) that solve the searches in next levels through a completely parallel scheme.
Include this package in your Spark Applications using:
spark-shell, pyspark, or spark-submit
> $SPARK_HOME/bin/spark-shell --packages sramirez:spark-IS-streaming:0.8
If you use the sbt-spark-package plugin, in your sbt build file, add:
spDependencies += "sramirez/spark-IS-streaming:0.8"
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven" libraryDependencies += "sramirez" % "spark-IS-streaming" % "0.8"
MavenIn your pom.xml, add:
<dependencies> <!-- list of dependencies --> <dependency> <groupId>sramirez</groupId> <artifactId>spark-IS-streaming</artifactId> <version>0.8</version> </dependency> </dependencies> <repositories> <!-- list of other repositories --> <repository> <id>SparkPackagesRepo</id> <url>http://dl.bintray.com/spark-packages/maven</url> </repository> </repositories>