How to convert a column of string to numerical?

后端 未结 3 1184
暖寄归人
暖寄归人 2021-01-21 06:53

I have this pandas dataframe from a query:

|    name    |    event    |
----------------------------
| name_1     | event_1     |
| name_1     | event_2     |
|          


        
相关标签:
3条回答
  • 2021-01-21 07:02

    Some ways of doing it

    1)

    In [366]: pd.crosstab(df.name, df.event)
    Out[366]:
    event   event_1  event_2
    name
    name_1        1        1
    name_2        1        0
    

    2)

    In [367]: df.groupby(['name', 'event']).size().unstack(fill_value=0)
    Out[367]:
    event   event_1  event_2
    name
    name_1        1        1
    name_2        1        0
    

    3)

    In [368]: df.pivot_table(index='name', columns='event', aggfunc=len, fill_value=0)
    Out[368]:
    event   event_1  event_2
    name
    name_1        1        1
    name_2        1        0
    

    4)

    In [369]: df.assign(v=1).pivot(index='name', columns='event', values='v').fillna(0)
    Out[369]:
    event   event_1  event_2
    name
    name_1      1.0      1.0
    name_2      1.0      0.0
    
    0 讨论(0)
  • 2021-01-21 07:07

    Option 1
    pir1 and pir1_5

    df.set_index('name').event.str.get_dummies()
    
            event_1  event_2
    name                    
    name_1        1        0
    name_1        0        1
    name_2        1        0
    

    Then you could sum across the index

    df.set_index('name').event.str.get_dummies().sum(level=0)
    
            event_1  event_2
    name                    
    name_1        1        1
    name_2        1        0
    

    Option 2
    pir2
    Or you could dot product

    pd.get_dummies(df.name).T.dot(pd.get_dummies(df.event))
    
            event_1  event_2
    name_1        1        1
    name_2        1        0
    

    Option 3
    pir3
    Advanced Mode

    i, r = pd.factorize(df.name.values)
    j, c = pd.factorize(df.event.values)
    n, m = r.size, c.size
    
    b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
    
    pd.DataFrame(b, r, c)
    
            event_1  event_2
    name_1        1        1
    name_2        1        0
    

    Timing

    res.plot(loglog=True)
    

    res.div(res.min(1), 0)
    
                pir1      pir2  pir3      john1     john2      john3
    10      9.948396  3.399913   1.0  20.478368  4.460466  10.642113
    30      9.350524  2.681178   1.0  16.589248  3.847666   9.168907
    100    11.414536  3.079463   1.0  18.076040  4.277752   9.949305
    300    15.769594  2.940529   1.0  16.745889  3.945470   9.069265
    1000   26.869451  2.617564   1.0  12.789570  3.236390   7.279205
    3000   42.229542  2.099541   1.0   8.716600  2.429847   4.785814
    10000  52.571678  1.716088   1.0   4.597598  1.691989   2.800455
    30000  58.644764  1.469827   1.0   2.818744  1.535012   1.929452
    

    Functions

    pir1 = lambda df: df.set_index('name').event.str.get_dummies().sum(level=0)
    pir1_5 = lambda df: pd.get_dummies(df.set_index('name').event).sum(level=0)
    pir2 = lambda df: pd.get_dummies(df.name).T.dot(pd.get_dummies(df.event))
    
    def pir3(df):
        i, r = pd.factorize(df.name.values)
        j, c = pd.factorize(df.event.values)
        n, m = r.size, c.size
    
        b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
    
        return pd.DataFrame(b, r, c)
    
    john1 = lambda df: pd.crosstab(df.name, df.event)
    john2 = lambda df: df.groupby(['name', 'event']).size().unstack(fill_value=0)
    john3 = lambda df: df.pivot_table(index='name', columns='event', aggfunc='size', fill_value=0)
    

    Test

    res = pd.DataFrame(
        index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
        columns='pir1 pir2 pir3 john1 john2 john3'.split(),
        dtype=float
    )
    
    for i in res.index:
        d = pd.concat([df] * i, ignore_index=True)
        for j in res.columns:
            stmt = '{}(d)'.format(j)
            setp = 'from __main__ import d, {}'.format(j)
            res.at[i, j] = timeit(stmt, setp, number=100)
    
    0 讨论(0)
  • 2021-01-21 07:22

    You are asking for the pythonic ways , i think in python this way is to use a technic called one-hot encoding this technic is well implemented in libraries likes sklearn and after one hot encoding you will need to group your dataframe by the first column and apply sum function.

    here is a code :

    import pandas as pd #the useful libraries
    import numpy as np
    from sklearn.preprocessing import LabelBinarizer #form sklmearn
    dataset = pd.DataFrame([['name_1', 'event_1' ], ['name_1', 'event_2'], ['name_2', 'event_1']], columns=['name', 'event'], index=[1, 2, 3])
    data = dataset['event'] #just reproduce your dataframe
    enc = LabelBinarizer(neg_label=0)
    dataset['event_2'] = enc.fit_transform(data)
    event_two = dataset['event_2']
    dataset['event_1'] = (~event_two.astype(np.bool)).astype(np.int64) #this is a tip to reproduce the event_1 columns
    dataset = dataset.groupby('name').sum()
    dataset.reset_index(inplace=True)
    

    and the output is :

    0 讨论(0)
提交回复
热议问题