spark extract columns from string

后端 未结 1 544
面向向阳花
面向向阳花 2021-01-23 09:42

Need help in parsing a string, where it contains values for each attribute. below is my sample string...

otherPartofString Name= Type=<1Ac4>         


        
相关标签:
1条回答
  • 2021-01-23 10:20

    As we discussed, to use str_to_map function on your sample data, we can setup pairDelim and keyValueDelim to the following:

    pairDelim: '(?i)>? *(?=Name|Type|SqVal|conn ID|conn Loc|dest|$)'
    keyValueDelim: '=<?'
    

    Where pariDelim is case-insensitive (?i) with an optional > followed by zero or more SPACEs, then followed by one of the pre-defined keys (we use '|'.join(keys) to generate it dynamically) or the end of string anchor $. keyValueDelim is an '=' with an optional <.

    from pyspark.sql import functions as F
    
    df = spark.createDataFrame([                                               
       ("otherPartofString Name=<Series VR> Type=<1Ac4> SqVal=<34> conn ID=<2>",),   
       ("otherPartofString Name=<Series X> Type=<1B3> SqVal=<34> conn ID=<2> conn Loc=sfo dest=chc bridge otherpartofString..",)
    ],["value"])
    
    keys = ["Name", "Type", "SqVal", "conn ID", "conn Loc", "dest"]
    
    # add the following conf for Spark 3.0 to overcome duplicate map key ERROR
    #spark.conf.set("spark.sql.mapKeyDedupPolicy", "LAST_WIN")
    
    df.withColumn("m", F.expr("str_to_map(value, '(?i)>? *(?={}|$)', '=<?')".format('|'.join(keys)))) \
        .select([F.col('m')[k].alias(k) for k in keys]) \
        .show()
    +---------+----+-----+-------+--------+--------------------+
    |     Name|Type|SqVal|conn ID|conn Loc|                dest|
    +---------+----+-----+-------+--------+--------------------+
    |Series VR|1Ac4|   34|      2|    null|                null|
    | Series X| 1B3|   34|      2|     sfo|chc bridge otherp...|
    +---------+----+-----+-------+--------+--------------------+
    

    We will need to do some post-processing to the values of the last mapped-key, since there is no anchor or pattern to distinguish them from other unrelated text (this could be a problem as it might happen on any keys), please let me know if you can specify any pattern.

    Edit: If using map is less efficient for case-insensitive search since it requires some expensive pre-processing, try the following:

    ptn = '|'.join(keys)
    df.select("*", *[F.regexp_extract('value', r'(?i)\b{0}=<?([^=>]+?)>? *(?={1}|$)'.format(k,ptn), 1).alias(k) for k in keys]).show()
    

    In case the angle brackets < and > are used only when values or their next adjacent key contain any non-word chars, it can be simplified with some pre-processing:

    df.withColumn('value', F.regexp_replace('value','=(\w+)','=<$1>')) \
        .select("*", *[F.regexp_extract('value', r'(?i)\b{0}=<([^>]+)>'.format(k), 1).alias(k) for k in keys]) \
        .show()
    

    Edit-2: added a dictionary to handle key aliases:

    keys = ["Name", "Type", "SqVal", "ID", "Loc", "dest"]
    
    # aliases are case-insensitive and added only if exist
    key_aliases = {
        'Type': [ 'ThisType', 'AnyName' ],
        'ID': ['conn ID'],
        'Loc': ['conn Loc']
    }
    
    # set up regex pattern for each key differently
    key_ptns = [ (k, '|'.join([k, *key_aliases[k]]) if k in key_aliases else k) for k in keys ]  
    #[('Name', 'Name'),
    # ('Type', 'Type|ThisType|AnyName'),
    # ('SqVal', 'SqVal'),
    # ('ID', 'ID|conn ID'),
    # ('Loc', 'Loc|conn Loc'),
    # ('dest', 'dest')]  
    
    df.withColumn('value', F.regexp_replace('value','=(\w+)','=<$1>')) \
        .select("*", *[F.regexp_extract('value', r'(?i)\b(?:{0})=<([^>]+)>'.format(p), 1).alias(k) for k,p in key_ptns]) \
        .show()
    +--------------------+---------+----+-----+---+---+----+
    |               value|     Name|Type|SqVal| ID|Loc|dest|
    +--------------------+---------+----+-----+---+---+----+
    |otherPartofString...|Series VR|1Ac4|   34|  2|   |    |
    |otherPartofString...| Series X| 1B3|   34|  2|sfo| chc|
    +--------------------+---------+----+-----+---+---+----+
    
    0 讨论(0)
提交回复
热议问题