How to combine n-grams into one vocabulary in Spark?

前端 未结 1 796
旧巷少年郎
旧巷少年郎 2021-02-04 14:00

Wondering if there is a built-in Spark feature to combine 1-, 2-, n-gram features into a single vocabulary. Setting n=2 in NGram followed by invocation

相关标签:
1条回答
  • 2021-02-04 14:52

    You can train separate NGram and CountVectorizer models and merge using VectorAssembler.

    from pyspark.ml.feature import NGram, CountVectorizer, VectorAssembler
    from pyspark.ml import Pipeline
    
    
    def build_ngrams(inputCol="tokens", n=3):
    
        ngrams = [
            NGram(n=i, inputCol="tokens", outputCol="{0}_grams".format(i))
            for i in range(1, n + 1)
        ]
    
        vectorizers = [
            CountVectorizer(inputCol="{0}_grams".format(i),
                outputCol="{0}_counts".format(i))
            for i in range(1, n + 1)
        ]
    
        assembler = [VectorAssembler(
            inputCols=["{0}_counts".format(i) for i in range(1, n + 1)],
            outputCol="features"
        )]
    
        return Pipeline(stages=ngrams + vectorizers + assembler)
    

    Example usage:

    df = spark.createDataFrame([
      (1, ["a", "b", "c", "d"]),
      (2, ["d", "e", "d"])
    ], ("id", "tokens"))
    
    build_ngrams().fit(df).transform(df) 
    
    0 讨论(0)
提交回复
热议问题