How can I declare a Column as a categorical feature in a DataFrame for use in ml

后端 未结 2 714
一个人的身影
一个人的身影 2020-12-06 03:50

How can I declare that a given Column in my DataFrame contains categorical information?

I have a Spark SQL DataFrame which I loaded from a

相关标签:
2条回答
  • 2020-12-06 04:00

    Hey zero323 I used the same technique to look at the metadata and I coded up this Transformer.

    def _transform(self, data):
        maxValues = self.getOrDefault(self.maxValues)
        categoricalCols = self.getOrDefault(self.categoricalCols)
    
        new_schema = types.StructType(data.schema.fields)
        new_data = data
        for (col, maxVal) in zip(categoricalCols, maxValues):
            # I have not decided if I should make a new column or
            # overwrite the original column
            new_col_name = col + "_categorical"
    
            new_data = new_data.withColumn(new_col_name,
                                           data[col].astype(types.DoubleType()))
    
            # metadata for a categorical column                                                                                                                                 
            meta = {u'ml_attr' : {u'vals' : [unicode(i) for i in range(maxVal + 1)],
                                  u'type' : u'nominal',
                                  u'name' : new_col_name}}
    
            new_schema.add(new_col_name, types.DoubleType(), True, meta)
    
        return data.sql_ctx.createDataFrame(new_data.rdd, new_schema)
    
    0 讨论(0)
  • 2020-12-06 04:08

    I would prefer to avoid the hassle of encoding and decoding,

    You cannot really avoid this completely. Required metadata for categorical variable is actually a mapping between value and index. Still, there is no need to do it manually or to create a custom transformer. Lets assume you have data frame like this:

    import numpy as np
    import pandas as pd
    
    df = sqlContext.createDataFrame(pd.DataFrame({
        "x1": np.random.random(1000),
        "x2": np.random.choice(3, 1000),
        "x4": np.random.choice(5, 1000)
    }))
    

    All you need is an assembler and indexer:

    from pyspark.ml.feature import VectorAssembler, VectorIndexer
    from pyspark.ml import Pipeline
    
    pipeline = Pipeline(stages=[
        VectorAssembler(inputCols=df.columns, outputCol="features_raw"),
        VectorIndexer(
            inputCol="features_raw", outputCol="features", maxCategories=10)])
    
    transformed = pipeline.fit(df).transform(df)
    transformed.schema.fields[-1].metadata
    
    ## {'ml_attr': {'attrs': {'nominal': [{'idx': 1,
    ##      'name': 'x2',
    ##      'ord': False,
    ##      'vals': ['0.0', '1.0', '2.0']},
    ##     {'idx': 2,
    ##      'name': 'x4',
    ##      'ord': False,
    ##      'vals': ['0.0', '1.0', '2.0', '3.0', '4.0']}],
    ##    'numeric': [{'idx': 0, 'name': 'x1'}]},
    ##   'num_attrs': 3}}
    

    This example also shows what type information you provide to mark given element of the vector as categorical variable

    {
        'idx': 2,  # Index (position in vector)
        'name': 'x4',  # name
        'ord': False,  # is ordinal?
        # Mapping between value and label
        'vals': ['0.0', '1.0', '2.0', '3.0', '4.0']  
    }
    

    So if you want to build this from scratch all you have to do is correct schema:

    from pyspark.sql.types import *
    from pyspark.mllib.linalg import VectorUDT
    
    # Lets assume we have only a vector
    raw = transformed.select("features_raw")
    
    # Dictionary equivalent to transformed.schema.fields[-1].metadata shown abov
    meta = ... 
    schema = StructType([StructField("features", VectorUDT(), metadata=meta)])
    
    sqlContext.createDataFrame(raw.rdd, schema)
    

    But it is quite inefficient due to required serialization, deserialization.

    Since Spark 2.2 you can also use metadata argument:

    df.withColumn("features", col("features").alias("features", metadata=meta))
    

    See also Attach metadata to vector column in Spark

    0 讨论(0)
提交回复
热议问题