spark-dirty-cat (homepage)

Similarity encoding of dirty categorical variables (strings)

@rakutentech / (1)

DirtyCat(Scala) is a package that leverage Spark ML to perform large scale Machine Learning, and provides an alternative to encode string variables. This package is largely based on the python original code,


  • 1|ml
  • 1|machine learning
  • 1|pyspark
  • 1|scala

How to

This package doesn't have any releases published in the Spark Packages repo, or with maven coordinates supplied. You may have to build this package from source, or it may simply be a script. To use this Spark Package, please follow the instructions in the README.


No releases yet.