Simplest method for text lemmatization in Scala and Spark

后端 未结 3 1871
梦谈多话
梦谈多话 2020-12-30 13:49

I want to use lemmatization on a text file:

surprise heard thump opened door small seedy man clasping package wrapped.

upgrading system found review spring          


        
相关标签:
3条回答
  • 2020-12-30 14:40

    There is a function from the book Adavanced analitics in Spark, chapter about Lemmatization:

      val plainText =  sc.parallelize(List("Sentence to be precessed."))
    
      val stopWords = Set("stopWord")
    
      import edu.stanford.nlp.pipeline._
      import edu.stanford.nlp.ling.CoreAnnotations._
      import scala.collection.JavaConversions._
    
      def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
        val props = new Properties()
        props.put("annotators", "tokenize, ssplit, pos, lemma")
        val pipeline = new StanfordCoreNLP(props)
        val doc = new Annotation(text)
        pipeline.annotate(doc)
        val lemmas = new ArrayBuffer[String]()
        val sentences = doc.get(classOf[SentencesAnnotation])
        for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
          val lemma = token.get(classOf[LemmaAnnotation])
          if (lemma.length > 2 && !stopWords.contains(lemma)) {
            lemmas += lemma.toLowerCase
          }
        }
        lemmas
      }
    
      val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
      lemmatized.foreach(println)
    

    Now just use this for every line in mapper.

    val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
    

    EDIT:

    I added to the code line

    import scala.collection.JavaConversions._
    

    this is needed because otherwise sentences are Java not Scala List. This should now compile without problems.

    I used scala 2.10.4 and fallowing stanford.nlp dependencies:

    <dependency>
      <groupId>edu.stanford.nlp</groupId>
      <artifactId>stanford-corenlp</artifactId>
      <version>3.5.2</version>
    </dependency>
    <dependency>
      <groupId>edu.stanford.nlp</groupId>
      <artifactId>stanford-corenlp</artifactId>
      <version>3.5.2</version>
      <classifier>models</classifier>
    </dependency>
    

    You can also look at stanford.nlp page there is a lot of examples (in Java) http://nlp.stanford.edu/software/corenlp.shtml.

    EDIT:

    MapPartition version:

    Although i dont know if its gonna speed up job significantly.

      def plainTextToLemmas(text: String, stopWords: Set[String], pipeline: StanfordCoreNLP): Seq[String] = {
        val doc = new Annotation(text)
        pipeline.annotate(doc)
        val lemmas = new ArrayBuffer[String]()
        val sentences = doc.get(classOf[SentencesAnnotation])
        for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
          val lemma = token.get(classOf[LemmaAnnotation])
          if (lemma.length > 2 && !stopWords.contains(lemma)) {
            lemmas += lemma.toLowerCase
          }
        }
        lemmas
      }
    
      val lemmatized = plainText.mapPartitions(p => {
        val props = new Properties()
        props.put("annotators", "tokenize, ssplit, pos, lemma")
        val pipeline = new StanfordCoreNLP(props)
        p.map(q => plainTextToLemmas(q, stopWords, pipeline))
      })
      lemmatized.foreach(println)
    
    0 讨论(0)
  • 2020-12-30 14:42

    I think @user52045 has the right idea. The only modification I would make would be to use mapPartitions instead of map -- this allows you to only do the potentially expensive pipeline creation once per partition. This may not be a huge hit on a lemmatization pipeline, but it will be extremely important if you want to do something that requires a model, like the NER portion of the pipeline.

    def plainTextToLemmas(text: String, stopWords: Set[String], pipeline:StanfordCoreNLP): Seq[String] = {
      val doc = new Annotation(text)
      pipeline.annotate(doc)
      val lemmas = new ArrayBuffer[String]()
      val sentences = doc.get(classOf[SentencesAnnotation])
      for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
        val lemma = token.get(classOf[LemmaAnnotation])
        if (lemma.length > 2 && !stopWords.contains(lemma)) {
          lemmas += lemma.toLowerCase
        }
      }
      lemmas
    }
    
    val lemmatized = plainText.mapPartitions(strings => {
      val props = new Properties()
      props.put("annotators", "tokenize, ssplit, pos, lemma")
      val pipeline = new StanfordCoreNLP(props)
      strings.map(string => plainTextToLemmas(string, stopWords, pipeline))
    })
    lemmatized.foreach(println)
    
    0 讨论(0)
  • 2020-12-30 14:42

    I would suggest using the Stanford CoreNLP wrapper for Apache Spark as it gives the official API for the basic core nlp function such as Lemmatization, tokenization, etc.

    I have used the same for lemmatization on a spark dataframe.

    Link to use :https://github.com/databricks/spark-corenlp

    0 讨论(0)
提交回复
热议问题