soda (homepage)

Solr Dictionary Annotator (Microservice for Spark)

A REST microservice to annotate entities in text using lexicons (dictionaries, controlled vocabularies) stored on Solr. Two major annotation modes are provided, exact and fuzzy. Exact matching is supported using the SolrTextTagger plugin, which wraps Lucene's FST technology to provide a fast and low-memory matcher implementation. Fuzzy matching is done by chunking incoming text using OpenNLP and matching normalized versions of the resulting phrases against similarly pre-normalized versions of lexicon entries. Microservice is language agnostic and can be called from Spark with either Scala, Java or Python (or other supported language in the future). Service returns zero or more annotations as 4-tuples (begin, end, coveredText, confidence) for a given document. System has been tested with dictionary sizes of 8M+ entries and can be scaled horizontally to meet capacity as needed.


Tags

  • 1|application
  • 1|tools

How to

This package doesn't have any releases published in the Spark Packages repo, or with maven coordinates supplied. You may have to build this package from source, or it may simply be a script. To use this Spark Package, please follow the instructions in the README.

Releases

No releases yet.