spark-IS-streaming (homepage)

A Nearest Neighbor Classifier for High-Speed Big Data Streams with Instance Selection

@sramirez / (0)

Here we present an efficient nearest neighbor solution to classify fast and massive data streams using Apache Spark. It is formed by a distributed case-base and an instance selection method that enhances its performance and effectiveness. A distributed metric tree (based on M-trees) has been designed to organize the case-base and consequently to speed up the neighbor searches. This distributed tree consists of a top-tree (in the master node) that routes the searches in the first levels and several leaf nodes (in the slaves nodes) that solve the searches in next levels through a completely parallel scheme.


Tags

  • 1|streaming
  • 1|machine learning
  • 1|instance selection
  • 1|reduction

How to

Include this package in your Spark Applications using:

spark-shell, pyspark, or spark-submit

> $SPARK_HOME/bin/spark-shell --packages sramirez:spark-IS-streaming:0.8

sbt

If you use the sbt-spark-package plugin, in your sbt build file, add:

spDependencies += "sramirez/spark-IS-streaming:0.8"

Otherwise,

resolvers += "Spark Packages Repo" at "https://repos.spark-packages.org/"

libraryDependencies += "sramirez" % "spark-IS-streaming" % "0.8"

Maven

In your pom.xml, add:
<dependencies>
  <!-- list of dependencies -->
  <dependency>
    <groupId>sramirez</groupId>
    <artifactId>spark-IS-streaming</artifactId>
    <version>0.8</version>
  </dependency>
</dependencies>
<repositories>
  <!-- list of other repositories -->
  <repository>
    <id>SparkPackagesRepo</id>
    <url>https://repos.spark-packages.org/</url>
  </repository>
</repositories>

Releases

Version: 0.8 ( 581a5e | zip | jar ) / Date: 2017-01-27 / License: Apache-2.0 / Scala version: 2.10