A community index of third-party packages for Apache Spark.

Showing packages 1 - 50 out of 512

Another, hopefully better, implementation of ALS on Spark (already merged into MLlib)

@mengxr / Latest release: 0.1.0 (2014-11-27) / BSD 3-Clause / (1)

  • 3|ml
  • 2|mllib
  • 2|recommendation


An example project for doing grid search in MLlib

@spark-ml / Latest release: 0.0.1 (2014-11-27) / BSD 3-Clause / (2)

  • 1|ml
  • 1|example
  • 1|examples


Integration utilities for using Spark with Apache Avro data

@databricks / Latest release: 4.0.0-s_2.11 (2017-10-30) / Apache-2.0 / (13)

  • 6|sql
  • 4|input
  • 4|avro


Redshift Data Source for Apache Spark

@databricks / Latest release: 3.0.0-preview1 (2016-11-01) / Apache-2.0 / (3)

  • 2|sql
  • 2|data source
  • 2|redshift


High Performance Kafka Consumer for Spark Streaming.Supports Multi Topic Fetch, Kafka Security. Reliable offset management in Zookeeper. No Data-loss. No dependency on HDFS and WAL. In-built PID rate controller. Support Message Handler . Offset Lag checker

@dibbhatt / Latest release: 2.1.0 (2019-08-28) / Apache-2.0 / (7)

  • 4|streaming
  • 3|kafka


Large-scale neural data analysis with Spark

@freeman-lab / Latest release: 0.4.1 (2014-11-27) / BSD 3-Clause / (6)

  • 3|neuroscience
  • 2|python
  • 2|machine learning


Spark launch script for Microsoft Azure

@sigmoidanalytics / No release yet / (10)

  • 1|Azure
  • 1|spark
  • 1|Microsoft


Pig on Apache Spark

@sigmoidanalytics / No release yet / (9)

  • 1|streaming
  • 1|spark
  • 1|pig


Spark launch script for Google Compute Engine

@sigmoidanalytics / No release yet / (8)

  • 1|Google Compute Engine
  • 1|spark
  • 1|GCE


REST job server for Spark

@spark-jobserver / No release yet / (3)

  • 1|application
  • 1|REST
  • 1|Mesos


Gaussian Mixture Model Implementation in Pyspark

@FlytxtRnD / Latest release: 0.1 (2015-04-07) / EPL-1.0 / (5)

  • 1|python
  • 1|mllib


Spark SQL CSV data source

@databricks / Latest release: 1.5.0-s_2.11 (2016-09-07) / Apache-2.0 / (10)

  • 4|csv
  • 3|sql
  • 2|DataSource


An efficient updatable key-value store for Apache Spark

@amplab / Latest release: 0.4.0 (2017-01-11) / Apache-2.0 / (1)

  • 2|core
  • 2|kv
  • 1|anothertag


Performance tests for Spark

@databricks / No release yet / (1)


KillrWeather is a reference application (in progress) showing how to easily leverage and integrate Apache Spark, Apache Cassandra, and Apache Kafka for fast, streaming computations on time series data in asynchronous Akka event-driven environments.

@killrweather / No release yet / (1)

  • 1|streaming


Locality Sensitive Hashing for Apache Spark

@mrsqueeze / No release yet / (0)

  • 1|mllib
  • 1|lsh


Zeppelin, a web-based notebook that enables interactive data analytics.

@NFLabs / No release yet / (3)

  • 1|Applications
  • 1|notebook
  • 1|interactive


Integration utilities for using Spark with Apache HBase data

@haosdent / No release yet / (1)

  • 1|hbase


Spark SQL DBF Library

@mraad / No release yet / (0)

  • 1|sql


The example in Scala of reading data saved in hbase by Spark and the example of converter for python

@GenTang / No release yet / (3)

  • 1|python
  • 1|hbase


A Clojure library for Apache Spark: fast, fully-features, and developer friendly

@gorillalabs / Latest release: 1.0.0 (2014-12-31) / EPL-1.0 / (3)

  • 2|clojure
  • 1|API


A kernel that enables applications to interact with Apache Spark.

@ibm-et / No release yet / (0)

  • 1|ipython
  • 1|foundation
  • 1|interactive


Learn the pySpark API through pictures and simple examples

@jkthompson / No release yet / (0)

  • 2|tutorials
  • 2|examples


Connecting Apache Spark with different data stores

@Stratio / Latest release: 0.7.0-RC1 (2015-01-14) / Apache-2.0 / (20)

  • 6|database
  • 6|mongo
  • 6|cassandra


Streaming CEP Engine Powered by Spark Streaming & Siddhi

@Stratio / Latest release: 0.6.2 (2015-01-14) / Apache-2.0 / (19)

  • 5|spark streaming
  • 5|cep
  • 4|complex event processing


This project generalizes the Spark MLLIB K-Means clusterer to support arbitrary distance functions

@derrickburns / No release yet / (3)

  • 1|clustering
  • 1|mllib
  • 1|machine learning


Sparkling Water provides H2O algorithms inside Spark cluster

@h2oai / Latest release: 1.4.3 (2015-07-06) / Apache-2.0 / (2)

  • 1|h2o
  • 1|algorithms
  • 1|machine learning


Visualize streaming machine learning in Spark

@freeman-lab / No release yet / (1)

  • 1|streaming
  • 1|machine learning
  • 1|visualization


Apache Camel Streaming Consumer

@synsys / Latest release: 1.0.0 (2015-01-26) / Apache-2.0 / (0)

  • 1|streaming
  • 1|consumer
  • 1|camel


Connect Spark to HBase for reading and writing data with ease

@nerdammer / Latest release: 1.0.3 (2016-04-20) / Apache-2.0 / (3)

  • 1|streaming
  • 1|hbase
  • 1|library


Base classes to use when writing tests with Spark

@holdenk / Latest release: 2.2.2_0.11.0 (2018-12-23) / Apache-2.0 / (10)

  • 3|testing
  • 1|streaming
  • 1|tools


An external PySpark module that works like R's read.csv or Panda's read_csv, with automatic type inference and null value handling. Parses csv data into SchemaRDD. No installation required, simply include pyspark_csv.py via SparkContext.

@seahboonsiew / No release yet / (1)

  • 2|python
  • 2|csv
  • 1|sql


MongoDB data source for Spark SQL

@Stratio / Latest release: 0.12.0 (2016-08-31) / Apache-2.0 / (14)

  • 5|MongoDB
  • 5|Spark SQL
  • 2|sql


PySpark Cassandra brings back the fun in working with Cassandra data in PySpark.

@TargetHolding / Latest release: 0.3.5 (2016-03-30) / Apache-2.0 / (1)

  • 1|python
  • 1|spark
  • 1|sql


DBSCAN clustering algorithm on top of Apache Spark

@alitouka / No release yet / (0)

  • 1|clustering
  • 1|ml
  • 1|dbscan


A Spark Package Template

@brkyvz / Latest release: 1.2-s_2.10 (2016-05-25) / Apache-2.0 / (1)

  • 1|python
  • 1|demo
  • 1|template


Sbt plugin for Spark packages

@databricks / Latest release: 0.2.4 (2016-07-15) / Apache-2.0 / (3)

  • 1|tools
  • 1|sbt


Use Cascading Taps and Scalding DSL with Spark — Edit

@tresata / Latest release: 0.5.0-s_2.10 (2015-11-13) / Apache-2.0 / (0)


Secondary sort and streaming reduce for Spark

@tresata / Latest release: 0.4.0-s_2.11 (2015-11-03) / Apache-2.0 / (0)

  • 1|core


Low level integration of Spark and Kafka

@tresata / Latest release: 0.6.0-s_2.10 (2015-11-13) / Apache-2.0 / (0)

  • 1|streaming


Connects Spark to Cassandra

@datastax / Latest release: 2.4.0-s_2.11 (2018-11-29) / Apache-2.0 / (14)

  • 3|spark
  • 3|cassandra
  • 2|nosql


A command line tool for Spark packages

@databricks / Latest release: 0.3.0 (2015-03-17) / Apache-2.0 / (1)

  • 1|tools


Data-Driven Spark allows quick data exploration based on Apache Spark

@FRosner / No release yet / (0)


Power BI API adapter for Apache Spark

@granturing / Latest release: 1.5.0_0.0.7 (2015-09-13) / Apache-2.0 / (0)

  • 2|streaming
  • 1|sql
  • 1|realtime


Spark Streaming, Machine Learning and meetup.com streaming API.

@actions / No release yet / (1)

  • 1|ml
  • 1|example
  • 1|streaming


Pure python package used for testing Spark Packages

@brkyvz / Latest release: 0.4.2 (2016-02-14) / Apache-2.0 / (0)


Feature selection based on information gain: maximum relevancy minimum redundancy

@wxhC3SC6OPm8M1HXboMy / No release yet / (0)

  • 1|mllib


Hand routine to import csv files as tables in spark sql

@wxhC3SC6OPm8M1HXboMy / No release yet / (0)


Spark connector for SequoiaDB

@SequoiaDB / Latest release: 1.12-s_2.11 (2015-03-30) / Apache-2.0 / (2)

  • 2|sequoiadb
  • 2|nosql
  • 2|sql


Use Apache Spark straight from the Browser

@andypetrella / Latest release: v0.4.0 (2015-03-29) / Apache-2.0 / (2)

  • 1|notebook
  • 1|charts
  • 1|interactive