Splitting a List inside a Pandas DataFrame

后端 未结 4 1787
南方客
南方客 2021-02-02 01:09

I have a csv file that contains a number of columns. Using pandas, I read this csv file into a dataframe and have a datetime index and five or six other columns.

One of

相关标签:
4条回答
  • 2021-02-02 01:47

    If you want to stay in pure pandas you can throw in a tricky groupby and apply which ends up boiling down to a one liner if you don't count the column rename.

    In [1]: import pandas as pd
    
    In [2]: d = {'date': ['4/1/11', '4/2/11'], 'ts': [[pd.Timestamp('2012-02-29 00:00:00'), pd.Timestamp('2012-03-31 00:00:00'), pd.Timestamp('2012-04-25 00:00:00'), pd.Timestamp('2012-06-30 00:00:00')], [pd.Timestamp('2014-01-31 00:00:00')]]}
    
    In [3]: df = pd.DataFrame(d)
    
    In [4]: df.head()
    Out[4]: 
         date                                                 ts
    0  4/1/11  [2012-02-29 00:00:00, 2012-03-31 00:00:00, 201...
    1  4/2/11                              [2014-01-31 00:00:00]
    
    In [5]: df_new = df.groupby('date').ts.apply(lambda x: pd.DataFrame(x.values[0])).reset_index().drop('level_1', axis = 1)
    
    In [6]: df_new.columns = ['date','ts']
    
    In [7]: df_new.head()
    Out[7]: 
         date         ts
    0  4/1/11 2012-02-29
    1  4/1/11 2012-03-31
    2  4/1/11 2012-04-25
    3  4/1/11 2012-06-30
    4  4/2/11 2014-01-31
    

    Since the goal is to take the value of a column (in this case date) and repeat it for all values of the multiple rows you intend to create from the list it's useful to think of pandas indexing.

    We want the date to become the single index for the new rows so we use groupby which puts the desired row value into an index. Then inside that operation I want to split only this list for this date which is what apply will do for us.

    I'm passing apply a pandas Series which consists of a single list but I can access that list via a .values[0] which pushes the sole row of the Series to an array with a single entry.

    To turn the list into a set of rows that will be passed back to the indexed date I can just make it a DataFrame. This incurs the penalty of picking up an extra index but we end up dropping that. We could make this an index itself but that would preclude dupe values.

    Once this is passed back out I have a multi-index but I can force this into the row format we desire by reset_index. Then we simply drop the unwanted index.

    It sounds involved but really we're just leverage the natural behaviors of pandas functions to avoid explicitly iterating or looping.

    Speed wise this tends to be pretty good and since it relies on apply any parallelization tricks that work with apply work here.

    Optionally if you want it to be robust to multiple dates each with a nested list:

    df_new = df.groupby('date').ts.apply(lambda x: pd.DataFrame([item for sublist in x.values for item in sublist]))
    

    at which point the one liner is getting dense and you should probably throw into a function.

    0 讨论(0)
  • 2021-02-02 01:47

    The way I did it was split the list into seperate columns, and then melted it to put each timestamp in a separate row.

    In [48]: df = pd.DataFrame([[1,2,[1,2,4]],[4,5,[1,3]],],columns=['a','b','TimeStamp'])
        ...: df
    Out[48]: 
       a  b  TimeStamp
    0  1  2  [1, 2, 4]
    1  4  5     [1, 3]
    

    You can convert the column to a list and then back to a DataFrame to split it into columns:

    In [53]: TScolumns = pd.DataFrame(df.TimeStamp.tolist(), )
        ...: TScolumns
    Out[53]: 
       0  1   2
    0  1  2   4
    1  1  3 NaN
    

    And then splice it onto the original dataframe

    In [90]: df = df.drop('TimeStamp',axis=1)
    In [58]: split = pd.concat([df, TScolumns], axis=1)
        ...: split
    Out[58]: 
       a  b  0  1   2
    0  1  2  1  2   4
    1  4  5  1  3 NaN
    

    Finally, use melt to get it into the shape you want:

    In [89]: pd.melt(split, id_vars=['a', 'b'], value_name='TimeStamp')
    Out[89]: 
       a  b variable  TimeStamp
    0  1  2        0          1
    1  4  5        0          1
    2  1  2        1          2
    3  4  5        1          3
    4  1  2        2          4
    5  4  5        2        NaN
    
    0 讨论(0)
  • 2021-02-02 02:07

    Probably not the best way from performance perspective, but still, you can leverage itertools package:

    from pandas import DataFrame, Timestamp
    import itertools
    
    d = {'date': ['4/1/11', '4/2/11'], 'ts': [[Timestamp('2012-02-29 00:00:00'), Timestamp('2012-03-31 00:00:00'), Timestamp('2012-04-25 00:00:00'), Timestamp('2012-06-30 00:00:00')], [Timestamp('2014-01-31 00:00:00')]]}
    df = DataFrame(d)
    
    res = df.to_dict()
    data = []
    for x in res['date'].keys():
      data.append(itertools.izip_longest([res['date'][x]], res['ts'][x], fillvalue=res['date'][x]))
    
    new_data = list(itertools.chain.from_iterable(data))
    df2 = DataFrame(new_data, columns=['date', 'timestamp'])
    print df2
    

    Will print :

         date  timestamp
    0  4/1/11 2012-02-29
    1  4/1/11 2012-03-31
    2  4/1/11 2012-04-25
    3  4/1/11 2012-06-30
    4  4/2/11 2014-01-31
    
    0 讨论(0)
  • 2021-02-02 02:09

    This doesn't feel very pythonic, but it works (provided your createDate is unique!)

    Apply will only return more rows than it gets with a groupby, so we're going to use groupby artificially (i.e. groupby a column of unique values, so each group is one line).

    def splitRows(x):
    
        # Extract the actual list of time-stamps. 
        theList = x.TimeStamps.iloc[0]
    
        # Each row will be a dictionary in this list.
        listOfNewRows = list()
    
        # Iterate over items in list of timestamps, 
        # putting each one in a dictionary to later convert to a row, 
        # then adding the dictionary to a list. 
    
        for i in theList:
            newRow = dict()
            newRow['CreateDate'] = x.CreateDate.iloc[0]
            newRow['TimeStamps'] = i
            listOfNewRows.append(newRow)
    
        # Now convert these dictionaries into rows in a new dataframe and return it. 
        return pd.DataFrame(listOfNewRows)
    
    
    df.groupby('CreateDate', as_index = False, group_keys = False).apply(splitRows)
    

    Followup: If CreateDate is NOT unique, you can just reset the index to a new column and groupby that.

    0 讨论(0)
提交回复
热议问题