aut (homepage)

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives built around Apache Spark. This toolkit is part of the Archives Unleashed Project.

The toolkit grew out of a previous project called Warcbase. The following article provides a nice overview, much of which is still relevant:

Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives. ACM Journal on Computing and Cultural Heritage, 10(4), Article 22, 2017.


Tags

  • 1|pyspark
  • 1|tools
  • 1|Web archives
  • 1|Digital Humanities

How to

Include this package in your Spark Applications using:

spark-shell, pyspark, or spark-submit

> $SPARK_HOME/bin/spark-shell --packages io.archivesunleashed:aut:0.18.0

sbt

In your sbt build file, add:

libraryDependencies += "io.archivesunleashed" % "aut" % "0.18.0"

Maven

In your pom.xml, add:
<dependencies>
  <!-- list of dependencies -->
  <dependency>
    <groupId>io.archivesunleashed</groupId>
    <artifactId>aut</artifactId>
    <version>0.18.0</version>
  </dependency>
</dependencies>

Releases

Version: 0.18.0 ( 95e5f0 | zip | jar ) / Date: 2019-08-21 / License: Apache-2.0 / Scala version: 2.11

Version: 0.17.0 ( 694382 | zip | jar ) / Date: 2019-07-18 / License: Apache-2.0 / Scala version: 2.11