What is the best way to remove accents with Apache Spark dataframes in PySpark?

前端 未结 4 916
-上瘾入骨i
-上瘾入骨i 2020-12-06 16:20

I need to delete accents from characters in Spanish and others languages from different datasets.

I already did a function based in the code provided in this post t

相关标签:
4条回答
  • 2020-12-06 16:53

    Here's my implementation. Apart from accents I also remove speciach characters. Because I needed to pivot and save a table, and you can't save a table with column name that has " ,;{}()\n\t=\/" characters.

    
    import re
    
    from pyspark.sql import SparkSession
    from pyspark.sql.types import IntegerType, StringType, StructType, StructField
    from unidecode import unidecode
    
    spark = SparkSession.builder.getOrCreate()
    data = [(1, "  \\ / \\ {____} aŠdá_ \t =  \n () asd ____aa 2134_ 23_"), (1, "N"), (2, "false"), (2, "1"), (3, "NULL"),
            (3, None)]
    schema = StructType([StructField("id", IntegerType(), True), StructField("txt", StringType(), True)])
    df = SparkSession.builder.getOrCreate().createDataFrame(data, schema)
    df.show()
    
    for col_name in ["txt"]:
        tmp_dict = {}
        for col_value in [row[0] for row in df.select(col_name).distinct().toLocalIterator()
                          if row[0] is not None]:
            new_col_value = re.sub("[ ,;{}()\\n\\t=\\\/]", "_", col_value)
            new_col_value = re.sub('_+', '_', new_col_value)
            if new_col_value.startswith("_"):
                new_col_value = new_col_value[1:]
            if new_col_value.endswith("_"):
                new_col_value = new_col_value[:-1]
            new_col_value = unidecode(new_col_value)
            tmp_dict[col_value] = new_col_value.lower()
        df = df.na.replace(to_replace=tmp_dict, subset=[col_name])
    df.show()
    

    if you can't access external librares (like me) you can replace unidecode with

    new_col_value = new_col_value.translate(str.maketrans(
                        "ä,ö,ü,ẞ,á,ä,č,ď,é,ě,í,ĺ,ľ,ň,ó,ô,ŕ,š,ť,ú,ů,ý,ž,Ä,Ö,Ü,ẞ,Á,Ä,Č,Ď,É,Ě,Í,Ĺ,Ľ,Ň,Ó,Ô,Ŕ,Š,Ť,Ú,Ů,Ý,Ž",
                        "a,o,u,s,a,a,c,d,e,e,i,l,l,n,o,o,r,s,t,u,u,y,z,A,O,U,S,A,A,C,D,E,E,I,L,L,N,O,O,R,S,T,U,U,Y,Z"))
    
    0 讨论(0)
  • 2020-12-06 17:00

    One possible improvement is to build a custom Transformer, which will handle Unicode normalization, and corresponding Python wrapper. It should reduce overall overhead of passing data between JVM and Python and doesn't require any modifications in Spark itself or access to private API.

    On JVM side you'll need a transformer similar to this one:

    package net.zero323.spark.ml.feature
    
    import java.text.Normalizer
    import org.apache.spark.ml.UnaryTransformer
    import org.apache.spark.ml.param._
    import org.apache.spark.ml.util._
    import org.apache.spark.sql.types.{DataType, StringType}
    
    class UnicodeNormalizer (override val uid: String)
      extends UnaryTransformer[String, String, UnicodeNormalizer] {
    
      def this() = this(Identifiable.randomUID("unicode_normalizer"))
    
      private val forms = Map(
        "NFC" -> Normalizer.Form.NFC, "NFD" -> Normalizer.Form.NFD,
        "NFKC" -> Normalizer.Form.NFKC, "NFKD" -> Normalizer.Form.NFKD
      )
    
      val form: Param[String] = new Param(this, "form", "unicode form (one of NFC, NFD, NFKC, NFKD)",
        ParamValidators.inArray(forms.keys.toArray))
    
      def setN(value: String): this.type = set(form, value)
    
      def getForm: String = $(form)
    
      setDefault(form -> "NFKD")
    
      override protected def createTransformFunc: String => String = {
        val normalizerForm = forms($(form))
        (s: String) => Normalizer.normalize(s, normalizerForm)
      }
    
      override protected def validateInputType(inputType: DataType): Unit = {
        require(inputType == StringType, s"Input type must be string type but got $inputType.")
      }
    
      override protected def outputDataType: DataType = StringType
    }
    

    Corresponding build definition (adjust Spark and Scala versions to match your Spark deployment):

    name := "unicode-normalization"
    
    version := "1.0"
    
    crossScalaVersions := Seq("2.11.12", "2.12.8")
    
    organization := "net.zero323"
    
    val sparkVersion = "2.4.0"
    
    libraryDependencies ++= Seq(
      "org.apache.spark" %% "spark-core" % sparkVersion,
      "org.apache.spark" %% "spark-sql" % sparkVersion,
      "org.apache.spark" %% "spark-mllib" % sparkVersion
    )
    

    On Python side you'll need a wrapper similar to this one.

    from pyspark.ml.param.shared import *
    # from pyspark.ml.util import keyword_only  # in Spark < 2.0
    from pyspark import keyword_only 
    from pyspark.ml.wrapper import JavaTransformer
    
    class UnicodeNormalizer(JavaTransformer, HasInputCol, HasOutputCol):
    
        @keyword_only
        def __init__(self, form="NFKD", inputCol=None, outputCol=None):
            super(UnicodeNormalizer, self).__init__()
            self._java_obj = self._new_java_obj(
                "net.zero323.spark.ml.feature.UnicodeNormalizer", self.uid)
            self.form = Param(self, "form",
                "unicode form (one of NFC, NFD, NFKC, NFKD)")
            # kwargs = self.__init__._input_kwargs  # in Spark < 2.0
            kwargs = self._input_kwargs
            self.setParams(**kwargs)
    
        @keyword_only
        def setParams(self, form="NFKD", inputCol=None, outputCol=None):
            # kwargs = self.setParams._input_kwargs  # in Spark < 2.0
            kwargs = self._input_kwargs
            return self._set(**kwargs)
    
        def setForm(self, value):
            return self._set(form=value)
    
        def getForm(self):
            return self.getOrDefault(self.form)
    

    Build Scala package:

    sbt +package
    

    include it when you start shell or submit. For example for Spark build with Scala 2.11:

    bin/pyspark --jars path-to/target/scala-2.11/unicode-normalization_2.11-1.0.jar \
     --driver-class-path path-to/target/scala-2.11/unicode-normalization_2.11-1.0.jar
    

    and you should be ready to go. All what is left is a little bit of regexp magic:

    from pyspark.sql.functions import regexp_replace
    
    normalizer = UnicodeNormalizer(form="NFKD",
        inputCol="text", outputCol="text_normalized")
    
    df = sc.parallelize([
        (1, "Maracaibó"), (2, "New York"),
        (3, "   São Paulo   "), (4, "~Madrid")
    ]).toDF(["id", "text"])
    
    (normalizer
        .transform(df)
        .select(regexp_replace("text_normalized", "\p{M}", ""))
        .show())
    
    ## +--------------------------------------+
    ## |regexp_replace(text_normalized,\p{M},)|
    ## +--------------------------------------+
    ## |                             Maracaibo|
    ## |                              New York|
    ## |                          Sao Paulo   |
    ## |                               ~Madrid|
    ## +--------------------------------------+
    

    Please note that this follows the same conventions as built in text transformers and is not null safe. You can easily correct for that by check for null in createTransformFunc.

    0 讨论(0)
  • 2020-12-06 17:02

    Another way for doing using python Unicode Database :

    import unicodedata
    import sys
    
    from pyspark.sql.functions import translate, regexp_replace
    
    def make_trans():
        matching_string = ""
        replace_string = ""
    
        for i in range(ord(" "), sys.maxunicode):
            name = unicodedata.name(chr(i), "")
            if "WITH" in name:
                try:
                    base = unicodedata.lookup(name.split(" WITH")[0])
                    matching_string += chr(i)
                    replace_string += base
                except KeyError:
                    pass
    
        return matching_string, replace_string
    
    def clean_text(c):
        matching_string, replace_string = make_trans()
        return translate(
            regexp_replace(c, "\p{M}", ""), 
            matching_string, replace_string
        ).alias(c)
    

    So now let's test it :

    df = sc.parallelize([
    (1, "Maracaibó"), (2, "New York"),
    (3, "   São Paulo   "), (4, "~Madrid"),
    (5, "São Paulo"), (6, "Maracaibó")
    ]).toDF(["id", "text"])
    
    df.select(clean_text("text")).show()
    ## +---------------+
    ## |           text|
    ## +---------------+
    ## |      Maracaibo|
    ## |       New York|
    ## |   Sao Paulo   |
    ## |        ~Madrid|
    ## |      Sao Paulo|
    ## |      Maracaibo|
    ## +---------------+
    

    acknowledge @zero323

    0 讨论(0)
  • 2020-12-06 17:14

    This solution is Python only, but is only useful if the number of possible accents is low (e.g. one single language like Spanish) and the character replacements are manually specified.

    There seems to be no built-in way to do what you asked for directly without UDFs, however you can chain many regexp_replace calls to replace each possible accented character. I tested the performance of this solution and it turns out that it only runs faster if you have a very limited set of accents to replace. If that's the case it can be faster than UDFs because it is optimized outside of Python.

    from pyspark.sql.functions import col, regexp_replace
    
    accent_replacements_spanish = [
        (u'á', 'a'), (u'Á', 'A'),
        (u'é', 'e'), (u'É', 'E'),
        (u'í', 'i'), (u'Í', 'I'),
        (u'ò', 'o'), (u'Ó', 'O'),
        (u'ú|ü', 'u'), (u'Ú|Ű', 'U'),
        (u'ñ', 'n'),
        # see http://stackoverflow.com/a/18123985/3810493 for other characters
    
        # this will convert other non ASCII characters to a question mark:
        ('[^\x00-\x7F]', '?') 
    ]
    
    def remove_accents(column):
        r = col(column)
        for a, b in accent_replacements_spanish:
            r = regexp_replace(r, a, b)
        return r.alias('remove_accents(' + column + ')')
    
    df = sqlContext.createDataFrame([['Olà'], ['Olé'], ['Núñez']], ['str'])
    df.select(remove_accents('str')).show()
    

    I haven't compared the performance with the other responses and this function is not as general, but it is at least worth considering because you don't need to add Scala or Java to your build process.

    0 讨论(0)
提交回复
热议问题