Get first letter of a string from column

前端 未结 2 739
我寻月下人不归
我寻月下人不归 2020-11-29 02:15

I\'m fighting with pandas and for now I\'m loosing. I have source table similar to this:

import pandas as pd

a=pd.Series([123,22,32,453,45,453,56])
b=pd.Ser         


        
相关标签:
2条回答
  • 2020-11-29 02:54

    .str.get

    This is the simplest to specify string methods

    # Setup
    df = pd.DataFrame({'A': ['xyz', 'abc', 'foobar'], 'B': [123, 456, 789]})
    df
    
            A    B
    0     xyz  123
    1     abc  456
    2  foobar  789
    
    df.dtypes
    
    A    object
    B     int64
    dtype: object
    

    For string (read:object) type columns, use

    df['C'] = df['A'].str[0]
    # Similar to,
    df['C'] = df['A'].str.get(0)
    

    .str handles NaNs by returning NaN as the output.

    For non-numeric columns, an .astype conversion is required beforehand, as shown in @Ed Chum's answer.

    # Note that this won't work well if the data has NaNs. 
    # It'll return lowercase "n"
    df['D'] = df['B'].astype(str).str[0]
    

    df
            A    B  C  D
    0     xyz  123  x  1
    1     abc  456  a  4
    2  foobar  789  f  7
    

    List Comprehension and Indexing

    There is enough evidence to suggest a simple list comprehension will work well here and probably be faster.

    # For string columns
    df['C'] = [x[0] for x in df['A']]
    
    # For numeric columns
    df['D'] = [str(x)[0] for x in df['B']]
    

    df
            A    B  C  D
    0     xyz  123  x  1
    1     abc  456  a  4
    2  foobar  789  f  7
    

    If your data has NaNs, then you will need to handle this appropriately with an if/else in the list comprehension,

    df2 = pd.DataFrame({'A': ['xyz', np.nan, 'foobar'], 'B': [123, 456, np.nan]})
    df2
    
            A      B
    0     xyz  123.0
    1     NaN  456.0
    2  foobar    NaN
    
    # For string columns
    df2['C'] = [x[0] if isinstance(x, str) else np.nan for x in df2['A']]
    
    # For numeric columns
    df2['D'] = [str(x)[0] if pd.notna(x) else np.nan for x in df2['B']]
    
            A      B    C    D
    0     xyz  123.0    x    1
    1     NaN  456.0  NaN    4
    2  foobar    NaN    f  NaN
    

    Let's do some timeit tests on some larger data.

    df_ = df.copy()
    df = pd.concat([df_] * 5000, ignore_index=True) 
    
    %timeit df.assign(C=df['A'].str[0])
    %timeit df.assign(D=df['B'].astype(str).str[0])
    
    %timeit df.assign(C=[x[0] for x in df['A']])
    %timeit df.assign(D=[str(x)[0] for x in df['B']])
    

    12 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    27.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    3.77 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    7.84 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    List comprehensions are 4x faster.

    0 讨论(0)
  • 2020-11-29 03:04

    Cast the dtype of the col to str and you can perform vectorised slicing calling str:

    In [29]:
    df['new_col'] = df['First'].astype(str).str[0]
    df
    
    Out[29]:
       First  Second new_col
    0    123     234       1
    1     22    4353       2
    2     32     355       3
    3    453     453       4
    4     45     345       4
    5    453     453       4
    6     56      56       5
    

    if you need to you can cast the dtype back again calling astype(int) on the column

    0 讨论(0)
提交回复
热议问题