Efficient string matching in Apache Spark

后端未结

关注

 1  1173

Using an OCR tool I extracted texts from screenshots (about 1-5 sentences each). However, when manually verifying the extracted text, I noticed several errors that occur fro

相关标签:

1条回答

北恋

2020-11-22 16:55

I wouldn't use Spark in the first place, but if you are really committed to the particular stack, you can combine a bunch of ml transformers to get best matches. You'll need Tokenizer (or split):

import org.apache.spark.ml.feature.RegexTokenizer

val tokenizer = new RegexTokenizer().setPattern("").setInputCol("text").setMinTokenLength(1).setOutputCol("tokens")

NGram (for example 3-gram)

import org.apache.spark.ml.feature.NGram

val ngram = new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams")

Vectorizer (for example CountVectorizer or HashingTF):

import org.apache.spark.ml.feature.HashingTF

val vectorizer = new HashingTF().setInputCol("ngrams").setOutputCol("vectors")

and LSH:

import org.apache.spark.ml.feature.{MinHashLSH, MinHashLSHModel}

// Increase numHashTables in practice.
val lsh = new MinHashLSH().setInputCol("vectors").setOutputCol("lsh")

Combine with Pipeline

import org.apache.spark.ml.Pipeline

val pipeline = new Pipeline().setStages(Array(tokenizer, ngram, vectorizer, lsh))

Fit on example data:

val query = Seq("Hello there 7l | real|y like Spark!").toDF("text")
val db = Seq(
  "Hello there