Duplicate row based on value in different column

前端 未结 3 1196
感动是毒
感动是毒 2020-12-06 05:07

I have a dataframe of transactions. Each row represents a transaction of two item (think of it like a transaction of 2 event tickets or something). I want to duplicate each

相关标签:
3条回答
  • 2020-12-06 05:52

    Answer by using repeat

    df.loc[df.index.repeat(df.Quantity)]
    Out[448]: 
      Price City Quantity
    1    20  NYC        2
    1    20  NYC        2
    2    30  NYC        2
    2    30  NYC        2
    3     5  NYC        2
    3     5  NYC        2
    4   300   LA        2
    4   300   LA        2
    5    30   LA        2
    5    30   LA        2
    6   100   LA        2
    6   100   LA        2
    
    0 讨论(0)
  • 2020-12-06 05:57

    First, I recreated your data using integers instead of text. I also varied the quantity so that one can more easily understand the problem.

    d = {1: [20, 'NYC', 1], 2: [30, 'NYC', 2], 3: [5, 'SF', 3],      
         4: [300, 'LA', 1], 5: [30, 'LA', 2],  6: [100, 'SF', 3]}
    
    columns=['Price', 'City', 'Quantity'] 
    # create dataframe and rename columns
    
    df = pd.DataFrame.from_dict(data=d, orient='index').sort_index()
    df.columns = columns
    
    >>> df
       Price City  Quantity
    1     20  NYC         1
    2     30  NYC         2
    3      5   SF         3
    4    300   LA         1
    5     30   LA         2
    6    100   SF         3
    

    I created a new DataFrame by using a nested list comprehension structure.

    df_new = pd.DataFrame([df.ix[idx] 
                           for idx in df.index 
                           for _ in range(df.ix[idx]['Quantity'])]).reset_index(drop=True)
    >>> df_new
        Price City  Quantity
    0      20  NYC         1
    1      30  NYC         2
    2      30  NYC         2
    3       5   SF         3
    4       5   SF         3
    5       5   SF         3
    6     300   LA         1
    7      30   LA         2
    8      30   LA         2
    9     100   SF         3
    10    100   SF         3
    11    100   SF         3
    
    0 讨论(0)
  • 2020-12-06 05:57

    How about this approach. I changed your data slightly to call out a sale of 4 tickets.

    We use a helper np.ones() array, suitably sized ,and then the key line of code is: a[np.arange(a.shape[1])[:] > a[:,0,np.newaxis]] = 0

    I was shown this technique here: numpy - update values using slicing given an array value

    Then its simply a call to .stack() and some basic filtering to complete.

    d = {'1': ['20', 'NYC', '2'], '2': ['30', 'NYC', '2'], '3': ['5', 'NYC', '2'], \
         '4': ['300', 'LA', '2'], '5': ['30', 'LA', '4'],  '6': ['100', 'LA', '2']}
    
    columns=['Price', 'City', 'Quantity']
    df = pd.DataFrame.from_dict(data=d, orient='index')
    df.columns = columns
    df['Quantity'] = df['Quantity'].astype(int)
    
    # make a ones array 
    my_ones = np.ones(shape=(len(df),df['Quantity'].max()))
    
    # turn my_ones into a dataframe same index as df so we can join it to the right hand side. Plenty of other ways to achieve the same outcome. 
    df_my_ones = pd.DataFrame(data =my_ones,index = df.index)
    
    df = df.join(df_my_ones)
    

    which looks like:

      Price City  Quantity  0  1  2  3
    1    20  NYC         2  1  1  1  1
    3     5  NYC         2  1  1  1  1
    2    30  NYC         2  1  1  1  1
    5    30   LA         4  1  1  1  1
    4   300   LA         2  1  1  1  1
    

    now get the Quantity column and the ones into a numpy array

    a = df.iloc[:,2:].values
    

    this is the clever bit

    a[np.arange(a.shape[1])[:] > a[:,0,np.newaxis]] = 0
    

    and re-assign back to df.

    df.iloc[:,2:] = a
    

    and now df looks like following, notice how we have set to zero past the number in Quantity:

      Price City  Quantity  0  1  2  3
    1    20  NYC         2  1  1  0  0
    3     5  NYC         2  1  1  0  0
    2    30  NYC         2  1  1  0  0
    5    30   LA         4  1  1  1  1
    4   300   LA         2  1  1  0  0
    
    df.set_index(['Price','City','Quantity'],inplace=True)
    df =  df.stack().to_frame()
    df.columns = ['sale_flag']
    df.reset_index(inplace=True)
    print df[['Price','City', 'Quantity']][df['sale_flag'] !=0]
    print df
    

    which produces:

    Price City  Quantity
    0     20  NYC         2
    1     20  NYC         2
    4      5  NYC         2
    5      5  NYC         2
    8     30  NYC         2
    9     30  NYC         2
    12    30   LA         4
    13    30   LA         4
    14    30   LA         4
    15    30   LA         4
    16   300   LA         2
    17   300   LA         2
    
    0 讨论(0)
提交回复
热议问题