Publishing your Spark Package in the Spark Packages repository
If you wish to publish your Spark Package on the Spark Packages repository, without going through the hassle of publishing on Maven Central, you need to supply a release artifact. Publishing your package on the Spark Packages repository has many advantages:
- Easy integration with Spark: Spark users can include your package in their applications
with ease using the
--packages $PACKAGE_NAME:$PACKAGE_VERSION
argument in spark-submit, spark-shell or pyspark. - Language interoperability: Does your package provide a Python API, but the core of your code is in Scala, and you don't know how to properly distribute your package? By properly setting up the release artifact, your packages will be able to work both with Python and Scala/Java.
- Scala/Java API compatibility: Did you write your code against Spark 1.2 and are you wondering if it works against Spark 1.1? We run Scala/Java API compliance checks against minor Spark releases after 1.0 (inclusive). This will allow users to run your packages with compatible versions of Spark.
The Release Artifact
The Release Artifact is a zip file that includes a jar file and a
pom file. The name of the artifact must be in the format
$GITHUB_REPO_NAME-$VERSION.zip
. Similarly, the name of the jar
and the pom must be $GITHUB_REPO_NAME-$VERSION.jar
and $GITHUB_REPO_NAME-$VERSION.pom
respectively. For example, the release artifact for the 0.2 release of the Spark Package
databricks/spark-avro
will be called spark-avro-0.2.zip
and it's contents would be
spark-avro-0.2.jar
and spark-avro-0.2.pom
.
In order to generate the release artifact, we strongly suggest that you use the spDist
method
of the sbt-spark-package plugin
if your package contains any Scala or Java code (and Python), or the zip
command of the
spark-package command line tool if
your package consists of pure Python code.
The Jar
The jar file should contain your compiled class files. If your project contains Python code, then it should also include the compiled Python files (.pyc). The main python package must be in the root directory in the jar. In addition, if your Python package contains python dependencies, they must be mentioned inside `requirements.txt` which should also be in the root directory in the jar. For example, here is how the contents of the jar is for `brkyvz/demo-scala-python`.
> jar tf demo-scala-python_2.10-0.1.jar
META-INF/MANIFEST.MF
com/
com/brkyvz/
com/brkyvz/spark/
com/brkyvz/spark/WordCounter$$anonfun$count$1.class
com/brkyvz/spark/WordCounter$$anonfun$count$2.class
com/brkyvz/spark/WordCounter.class
requirements.txt
setup.pyc
tests.pyc
wrdcntr/
wrdcntr/__init__.pyc
wrdcntr/wrdcntr.pyc
The jar can be generated by the spPackage
command of the
sbt-spark-package plugin. If you
decide to use the spark-package command line tool
and your package contains a `src/main/scala` or `src/main/java` directory, the zip
command
will ask you to supply a jar containing the compiled classes.
The POM
The POM is where you declare the metadata of your package. The POM must contain the
groupId
, artifactId
and version
. The version
must match
with the name of the jar, pom, and release artifact. In addition, the groupId
must
be the organization name, and the artifactId
must be the repository name of the
github repository, where your package is hosted. For example, even though organization
is defined as com.databricks
in the build.sbt
of `databricks/spark-avro`,
the groupId
must be set as databricks
in the POM.
In addition, sbt appends the minor Scala version to the artifactId
by default. For
example, when compiling `databricks/spark-avro` with Scala 2.10.4, sbt makes the artifactId
spark-avro_2.10
. Similarly, the artifactId
is spark-avro_2.11
when compiled with Scala 2.11.5. This is not a valid format for publishing on the Spark Packages
Repository. The artifactId
must be the name of the github repository in lowercase.
Here is how sbt creates the pom for the 0.2 release of `databricks/spark-avro` compiled with Scala 2.11. This is wrong!
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>0.2</version>
Here is how it should be.
<groupId>databricks</groupId>
<artifactId>spark-avro</artifactId>
<version>0.2</version>
If you want to support multiple versions of Scala, you may append the Scala version to the version
number with a hyphen using a prefix s. For example, if the 0.2 release for `databricks/spark-avro` was
made with Scala 2.11, in order to make a 0.2 release with Scala 2.10, use
<version>0.2-s_2.10</version>
The sbt-spark-package plugin handles
all these for you, therefore we highly recommend that you use the spDist
command to
generate the release artifact directly or spMakePom
to generate the POM.
If your package consists of only Python, the
spark-package command line tool
will generate a POM with the correct entries during the zip
command.