Multiple values in single column of a pandas DataFrame

前端 未结 1 1813
北恋
北恋 2021-01-14 01:46

I have some data that I\'m parsing from XML to a pandas DataFrame. The XML data roughly looks like this:


  

        
相关标签:
1条回答
  • 2021-01-14 02:30

    Assuming you have enough memory, your task will be more easily accomplished if your DataFrame held one variant per row:

    track_name     variants  time     route_id  stop_id  serial
    "trackname1"   1         "21:23"         5      103       1
    "trackname1"   2         "21:23"         5      103       1
    "trackname1"   3         "21:23"         5      103       1
    "trackname1"   1         "21:26"         5       17       2
    "trackname1"   2         "21:26"         5       17       2
    "trackname1"   3         "21:26"         5       17       2
    ...
    "trackname1"   4         "21:20"         5      103       1
    "trackname1"   5         "21:20"         5      103       1
    ...
    "trackname2"   1         "20:59"         3       45       1
    

    Then you could find "all rows for variant 3 on route_id 5 with

    df.loc[(df['variants']==3) & (df['route_id']==5)]
    

    If you pack many variants into one row, such as

    "trackname1"   "1,2,3"   "21:23"  "5"       "103"    "1"
    

    then you could find such rows using

    df.loc[(df['variants'].str.contains("3")) & (df['route_id']=="5")]
    

    assuming that the variants are always single digits. If there are also 2-digit variants like "13" or "30", then you would need to pass a more complicated regex pattern to str.contains.

    Alternatively, you could use apply to split each variant on commas:

    df['variants'].apply(lambda x: "3" in x.split(','))
    

    but this is very inefficent since you would now be calling a Python function once for every row, and doing string splitting and a test for membership in a list compared to a vectorized integer comparision.

    Thus, to avoid possibly complicated regex or a relatively slow call to apply, I think your best bet is to build the DataFrame with one integer variant per row.

    0 讨论(0)
提交回复
热议问题