Spark-xml-utils provides the ability to filter documents based on an xpath expression, return specific nodes for an xpath/xquery expression, or transform documents using a xslt stylesheet.
@elsevierlabs-os / (0)
The spark-xml-utils library was developed because there is a large amount of xml in some big datasets and I felt this data could be better served by providing some helpful xml utilities. This includes the ability to filter documents based on an xpath expression, return specific nodes for an xpath/xquery expression, or transform documents using a xslt stylesheet. By providing some basic wrappers to Saxon, the spark-xml-utils library exposes some basic XPath, XSLT, and XQuery functionality that can readily be leveraged by any Spark application.
Spark-xml-utils is not meant for processing one large single GBs xml record. However, if you have many xml records (we have millions)in the MBs (or less) then this should be a handy tool.
Include this package in your Spark Applications using:
spark-shell, pyspark, or spark-submit
> $SPARK_HOME/bin/spark-shell --packages elsevierlabs-os:spark-xml-utils:1.8.0
If you use the sbt-spark-package plugin, in your sbt build file, add:
spDependencies += "elsevierlabs-os/spark-xml-utils:1.8.0"
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven" libraryDependencies += "elsevierlabs-os" % "spark-xml-utils" % "1.8.0"
MavenIn your pom.xml, add:
<dependencies> <!-- list of dependencies --> <dependency> <groupId>elsevierlabs-os</groupId> <artifactId>spark-xml-utils</artifactId> <version>1.8.0</version> </dependency> </dependencies> <repositories> <!-- list of other repositories --> <repository> <id>SparkPackagesRepo</id> <url>http://dl.bintray.com/spark-packages/maven</url> </repository> </repositories>