Publishing your Spark Package in the Spark Packages repository

If you wish to publish your Spark Package on the Spark Packages repository, without going through the hassle of publishing on Maven Central, you need to supply a release artifact. Publishing your package on the Spark Packages repository has many advantages:

  1. Easy integration with Spark: Spark users can include your package in their applications with ease using the --packages $PACKAGE_NAME:$PACKAGE_VERSION argument in spark-submit, spark-shell or pyspark.
  2. Language interoperability: Does your package provide a Python API, but the core of your code is in Scala, and you don't know how to properly distribute your package? By properly setting up the release artifact, your packages will be able to work both with Python and Scala/Java.
  3. Scala/Java API compatibility: Did you write your code against Spark 1.2 and are you wondering if it works against Spark 1.1? We run Scala/Java API compliance checks against minor Spark releases after 1.0 (inclusive). This will allow users to run your packages with compatible versions of Spark.

The Release Artifact

The Release Artifact is a zip file that includes a jar file and a pom file. The name of the artifact must be in the format $GITHUB_REPO_NAME-$VERSION.zip. Similarly, the name of the jar and the pom must be $GITHUB_REPO_NAME-$VERSION.jar and $GITHUB_REPO_NAME-$VERSION.pom respectively. For example, the release artifact for the 0.2 release of the Spark Package databricks/spark-avro will be called spark-avro-0.2.zip and it's contents would be spark-avro-0.2.jar and spark-avro-0.2.pom.

In order to generate the release artifact, we strongly suggest that you use the spDist method of the sbt-spark-package plugin if your package contains any Scala or Java code (and Python), or the zip command of the spark-package command line tool if your package consists of pure Python code.

The Jar

The jar file should contain your compiled class files. If your project contains Python code, then it should also include the compiled Python files (.pyc). The main python package must be in the root directory in the jar. In addition, if your Python package contains python dependencies, they must be mentioned inside `requirements.txt` which should also be in the root directory in the jar. For example, here is how the contents of the jar is for `brkyvz/demo-scala-python`.

> jar tf demo-scala-python_2.10-0.1.jar
    META-INF/MANIFEST.MF
    com/
    com/brkyvz/
    com/brkyvz/spark/
    com/brkyvz/spark/WordCounter$$anonfun$count$1.class
    com/brkyvz/spark/WordCounter$$anonfun$count$2.class
    com/brkyvz/spark/WordCounter.class
    requirements.txt
    setup.pyc
    tests.pyc
    wrdcntr/
    wrdcntr/__init__.pyc
    wrdcntr/wrdcntr.pyc

The jar can be generated by the spPackage command of the sbt-spark-package plugin. If you decide to use the spark-package command line tool and your package contains a `src/main/scala` or `src/main/java` directory, the zip command will ask you to supply a jar containing the compiled classes.

The POM

The POM is where you declare the metadata of your package. The POM must contain the groupId, artifactId and version. The version must match with the name of the jar, pom, and release artifact. In addition, the groupId must be the organization name, and the artifactId must be the repository name of the github repository, where your package is hosted. For example, even though organization is defined as com.databricks in the build.sbt of `databricks/spark-avro`, the groupId must be set as databricks in the POM.

In addition, sbt appends the minor Scala version to the artifactId by default. For example, when compiling `databricks/spark-avro` with Scala 2.10.4, sbt makes the artifactId spark-avro_2.10. Similarly, the artifactId is spark-avro_2.11 when compiled with Scala 2.11.5. This is not a valid format for publishing on the Spark Packages Repository. The artifactId must be the name of the github repository in lowercase.

Here is how sbt creates the pom for the 0.2 release of `databricks/spark-avro` compiled with Scala 2.11. This is wrong!

<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>0.2</version>

Here is how it should be.

<groupId>databricks</groupId>
<artifactId>spark-avro</artifactId>
<version>0.2</version>

If you want to support multiple versions of Scala, you may append the Scala version to the version number with a hyphen using a prefix s. For example, if the 0.2 release for `databricks/spark-avro` was made with Scala 2.11, in order to make a 0.2 release with Scala 2.10, use <version>0.2-s_2.10</version>

The sbt-spark-package plugin handles all these for you, therefore we highly recommend that you use the spDist command to generate the release artifact directly or spMakePom to generate the POM.
If your package consists of only Python, the spark-package command line tool will generate a POM with the correct entries during the zip command.