I have some data that I\'m parsing from XML to a pandas DataFrame. The XML data roughly looks like this:
Assuming you have enough memory, your task will be more easily accomplished if your DataFrame held one variant per row:
track_name variants time route_id stop_id serial
"trackname1" 1 "21:23" 5 103 1
"trackname1" 2 "21:23" 5 103 1
"trackname1" 3 "21:23" 5 103 1
"trackname1" 1 "21:26" 5 17 2
"trackname1" 2 "21:26" 5 17 2
"trackname1" 3 "21:26" 5 17 2
...
"trackname1" 4 "21:20" 5 103 1
"trackname1" 5 "21:20" 5 103 1
...
"trackname2" 1 "20:59" 3 45 1
Then you could find "all rows for variant 3 on route_id 5 with
df.loc[(df['variants']==3) & (df['route_id']==5)]
If you pack many variants into one row, such as
"trackname1" "1,2,3" "21:23" "5" "103" "1"
then you could find such rows using
df.loc[(df['variants'].str.contains("3")) & (df['route_id']=="5")]
assuming that the variants are always single digits. If there are also 2-digit variants like "13" or "30", then you would need to pass a more complicated regex pattern to str.contains
.
Alternatively, you could use apply
to split each variant on commas:
df['variants'].apply(lambda x: "3" in x.split(','))
but this is very inefficent since you would now be calling a Python function once for every row, and doing string splitting and a test for membership in a list compared to a vectorized integer comparision.
Thus, to avoid possibly complicated regex or a relatively slow call to apply
, I think your best bet is to build the DataFrame with one integer variant per row.