How to convert a column of string to numerical?

后端 未结 3 1187
暖寄归人
暖寄归人 2021-01-21 06:53

I have this pandas dataframe from a query:

|    name    |    event    |
----------------------------
| name_1     | event_1     |
| name_1     | event_2     |
|          


        
3条回答
  •  无人及你
    2021-01-21 07:07

    Option 1
    pir1 and pir1_5

    df.set_index('name').event.str.get_dummies()
    
            event_1  event_2
    name                    
    name_1        1        0
    name_1        0        1
    name_2        1        0
    

    Then you could sum across the index

    df.set_index('name').event.str.get_dummies().sum(level=0)
    
            event_1  event_2
    name                    
    name_1        1        1
    name_2        1        0
    

    Option 2
    pir2
    Or you could dot product

    pd.get_dummies(df.name).T.dot(pd.get_dummies(df.event))
    
            event_1  event_2
    name_1        1        1
    name_2        1        0
    

    Option 3
    pir3
    Advanced Mode

    i, r = pd.factorize(df.name.values)
    j, c = pd.factorize(df.event.values)
    n, m = r.size, c.size
    
    b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
    
    pd.DataFrame(b, r, c)
    
            event_1  event_2
    name_1        1        1
    name_2        1        0
    

    Timing

    res.plot(loglog=True)
    

    res.div(res.min(1), 0)
    
                pir1      pir2  pir3      john1     john2      john3
    10      9.948396  3.399913   1.0  20.478368  4.460466  10.642113
    30      9.350524  2.681178   1.0  16.589248  3.847666   9.168907
    100    11.414536  3.079463   1.0  18.076040  4.277752   9.949305
    300    15.769594  2.940529   1.0  16.745889  3.945470   9.069265
    1000   26.869451  2.617564   1.0  12.789570  3.236390   7.279205
    3000   42.229542  2.099541   1.0   8.716600  2.429847   4.785814
    10000  52.571678  1.716088   1.0   4.597598  1.691989   2.800455
    30000  58.644764  1.469827   1.0   2.818744  1.535012   1.929452
    

    Functions

    pir1 = lambda df: df.set_index('name').event.str.get_dummies().sum(level=0)
    pir1_5 = lambda df: pd.get_dummies(df.set_index('name').event).sum(level=0)
    pir2 = lambda df: pd.get_dummies(df.name).T.dot(pd.get_dummies(df.event))
    
    def pir3(df):
        i, r = pd.factorize(df.name.values)
        j, c = pd.factorize(df.event.values)
        n, m = r.size, c.size
    
        b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
    
        return pd.DataFrame(b, r, c)
    
    john1 = lambda df: pd.crosstab(df.name, df.event)
    john2 = lambda df: df.groupby(['name', 'event']).size().unstack(fill_value=0)
    john3 = lambda df: df.pivot_table(index='name', columns='event', aggfunc='size', fill_value=0)
    

    Test

    res = pd.DataFrame(
        index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
        columns='pir1 pir2 pir3 john1 john2 john3'.split(),
        dtype=float
    )
    
    for i in res.index:
        d = pd.concat([df] * i, ignore_index=True)
        for j in res.columns:
            stmt = '{}(d)'.format(j)
            setp = 'from __main__ import d, {}'.format(j)
            res.at[i, j] = timeit(stmt, setp, number=100)
    

提交回复
热议问题