Combine two columns of text in pandas dataframe

后端 未结 18 1051
-上瘾入骨i
-上瘾入骨i 2020-11-22 01:32

I have a 20 x 4000 dataframe in Python using pandas. Two of these columns are named Year and quarter. I\'d like to create a variable called p

相关标签:
18条回答
  • 2020-11-22 01:43

    if both columns are strings, you can concatenate them directly:

    df["period"] = df["Year"] + df["quarter"]
    

    If one (or both) of the columns are not string typed, you should convert it (them) first,

    df["period"] = df["Year"].astype(str) + df["quarter"]
    

    Beware of NaNs when doing this!


    If you need to join multiple string columns, you can use agg:

    df['period'] = df[['Year', 'quarter', ...]].agg('-'.join, axis=1)
    

    Where "-" is the separator.

    0 讨论(0)
  • 2020-11-22 01:43

    Although the @silvado answer is good if you change df.map(str) to df.astype(str) it will be faster:

    import pandas as pd
    df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
    
    In [131]: %timeit df["Year"].map(str)
    10000 loops, best of 3: 132 us per loop
    
    In [132]: %timeit df["Year"].astype(str)
    10000 loops, best of 3: 82.2 us per loop
    
    0 讨论(0)
  • 2020-11-22 01:43

    more efficient is

    def concat_df_str1(df):
        """ run time: 1.3416s """
        return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)
    

    and here is a time test:

    import numpy as np
    import pandas as pd
    
    from time import time
    
    
    def concat_df_str1(df):
        """ run time: 1.3416s """
        return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)
    
    
    def concat_df_str2(df):
        """ run time: 5.2758s """
        return df.astype(str).sum(axis=1)
    
    
    def concat_df_str3(df):
        """ run time: 5.0076s """
        df = df.astype(str)
        return df[0] + df[1] + df[2] + df[3] + df[4] + \
               df[5] + df[6] + df[7] + df[8] + df[9]
    
    
    def concat_df_str4(df):
        """ run time: 7.8624s """
        return df.astype(str).apply(lambda x: ''.join(x), axis=1)
    
    
    def main():
        df = pd.DataFrame(np.zeros(1000000).reshape(100000, 10))
        df = df.astype(int)
    
        time1 = time()
        df_en = concat_df_str4(df)
        print('run time: %.4fs' % (time() - time1))
        print(df_en.head(10))
    
    
    if __name__ == '__main__':
        main()
    

    final, when sum(concat_df_str2) is used, the result is not simply concat, it will trans to integer.

    0 讨论(0)
  • 2020-11-22 01:44

    Small data-sets (< 150rows)

    [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
    

    or slightly slower but more compact:

    df.Year.str.cat(df.quarter)
    

    Larger data sets (> 150rows)

    df['Year'].astype(str) + df['quarter']
    

    UPDATE: Timing graph Pandas 0.23.4

    Let's test it on 200K rows DF:

    In [250]: df
    Out[250]:
       Year quarter
    0  2014      q1
    1  2015      q2
    
    In [251]: df = pd.concat([df] * 10**5)
    
    In [252]: df.shape
    Out[252]: (200000, 2)
    

    UPDATE: new timings using Pandas 0.19.0

    Timing without CPU/GPU optimization (sorted from fastest to slowest):

    In [107]: %timeit df['Year'].astype(str) + df['quarter']
    10 loops, best of 3: 131 ms per loop
    
    In [106]: %timeit df['Year'].map(str) + df['quarter']
    10 loops, best of 3: 161 ms per loop
    
    In [108]: %timeit df.Year.str.cat(df.quarter)
    10 loops, best of 3: 189 ms per loop
    
    In [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
    1 loop, best of 3: 567 ms per loop
    
    In [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
    1 loop, best of 3: 584 ms per loop
    
    In [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
    1 loop, best of 3: 24.7 s per loop
    

    Timing using CPU/GPU optimization:

    In [113]: %timeit df['Year'].astype(str) + df['quarter']
    10 loops, best of 3: 53.3 ms per loop
    
    In [114]: %timeit df['Year'].map(str) + df['quarter']
    10 loops, best of 3: 65.5 ms per loop
    
    In [115]: %timeit df.Year.str.cat(df.quarter)
    10 loops, best of 3: 79.9 ms per loop
    
    In [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
    1 loop, best of 3: 230 ms per loop
    
    In [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
    1 loop, best of 3: 230 ms per loop
    
    In [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
    1 loop, best of 3: 9.38 s per loop
    

    Answer contribution by @anton-vbr

    0 讨论(0)
  • 2020-11-22 01:45
    def madd(x):
        """Performs element-wise string concatenation with multiple input arrays.
    
        Args:
            x: iterable of np.array.
    
        Returns: np.array.
        """
        for i, arr in enumerate(x):
            if type(arr.item(0)) is not str:
                x[i] = x[i].astype(str)
        return reduce(np.core.defchararray.add, x)
    

    For example:

    data = list(zip([2000]*4, ['q1', 'q2', 'q3', 'q4']))
    df = pd.DataFrame(data=data, columns=['Year', 'quarter'])
    df['period'] = madd([df[col].values for col in ['Year', 'quarter']])
    
    df
    
        Year    quarter period
    0   2000    q1  2000q1
    1   2000    q2  2000q2
    2   2000    q3  2000q3
    3   2000    q4  2000q4
    
    0 讨论(0)
  • 2020-11-22 01:46

    Use .combine_first.

    df['Period'] = df['Year'].combine_first(df['Quarter'])
    
    0 讨论(0)
提交回复
热议问题