pandas three-way joining multiple dataframes on columns

前端 未结 11 1781
醉梦人生
醉梦人生 2020-11-22 08:35

I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.

How can

相关标签:
11条回答
  • 2020-11-22 08:51

    You could try this if you have 3 dataframes

    # Merge multiple dataframes
    df1 = pd.DataFrame(np.array([
        ['a', 5, 9],
        ['b', 4, 61],
        ['c', 24, 9]]),
        columns=['name', 'attr11', 'attr12'])
    df2 = pd.DataFrame(np.array([
        ['a', 5, 19],
        ['b', 14, 16],
        ['c', 4, 9]]),
        columns=['name', 'attr21', 'attr22'])
    df3 = pd.DataFrame(np.array([
        ['a', 15, 49],
        ['b', 4, 36],
        ['c', 14, 9]]),
        columns=['name', 'attr31', 'attr32'])
    
    pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')
    

    alternatively, as mentioned by cwharland

    df1.merge(df2,on='name').merge(df3,on='name')
    
    0 讨论(0)
  • 2020-11-22 08:53

    Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary. Also it fills in missing values if needed:

    This is the function to merge a dict of data frames

    def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
      keys = dfDict.keys()
      for i in range(len(keys)):
        key = keys[i]
        df0 = dfDict[key]
        cols = list(df0.columns)
        valueCols = list(filter(lambda x: x not in (onCols), cols))
        df0 = df0[onCols + valueCols]
        df0.columns = onCols + [(s + '_' + key) for s in valueCols] 
    
        if (i == 0):
          outDf = df0
        else:
          outDf = pd.merge(outDf, df0, how=how, on=onCols)   
    
      if (naFill != None):
        outDf = outDf.fillna(naFill)
    
      return(outDf)
    

    OK, lets generates data and test this:

    def GenDf(size):
      df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True),
                          'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True), 
                          'col1':np.random.uniform(low=0.0, high=100.0, size=size), 
                          'col2':np.random.uniform(low=0.0, high=100.0, size=size)
                          })
      df = df.sort_values(['categ2', 'categ1', 'col1', 'col2'])
      return(df)
    
    
    size = 5
    dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)}   
    MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)
    
    0 讨论(0)
  • 2020-11-22 08:56

    There is another solution from the pandas documentation (that I don't see here),

    using the .append

    >>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
       A  B
    0  1  2
    1  3  4
    >>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
       A  B
    0  5  6
    1  7  8
    >>> df.append(df2, ignore_index=True)
       A  B
    0  1  2
    1  3  4
    2  5  6
    3  7  8
    

    The ignore_index=True is used to ignore the index of the appended dataframe, replacing it with the next index available in the source one.

    If there are different column names, Nan will be introduced.

    0 讨论(0)
  • 2020-11-22 08:57

    Simple Solution:

    If the column names are similar:

     df1.merge(df2,on='col_name').merge(df3,on='col_name')
    

    If the column names are different:

    df1.merge(df2,left_on='col_name1', right_on='col_name2').merge(df3,left_on='col_name1', right_on='col_name3').drop(columns=['col_name2', 'col_name3']).rename(columns={'col_name1':'col_name'})
    
    0 讨论(0)
  • 2020-11-22 08:57

    One does not need a multiindex to perform join operations. One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name') for example)

    The join operation is by default performed on index. In your case, you just have to specify that the Name column corresponds to your index. Below is an example

    A tutorial may be useful.

    # Simple example where dataframes index are the name on which to perform
    # the join operations
    import pandas as pd
    import numpy as np
    name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
    df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
    df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'],         index=name)
    df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'],     index=name)
    df = df1.join(df2)
    df = df.join(df3)
    
    # If you have a 'Name' column that is not the index of your dataframe,
    # one can set this column to be the index
    # 1) Create a column 'Name' based on the previous index
    df1['Name'] = df1.index
    # 1) Select the index from column 'Name'
    df1 = df1.set_index('Name')
    
    # If indexes are different, one may have to play with parameter how
    gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
    gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
    gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))
    
    gf = gf1.join(gf2, how='outer')
    gf = gf.join(gf3, how='outer')
    
    0 讨论(0)
  • 2020-11-22 09:01

    The three dataframes are

    Let's merge these frames using nested pd.merge

    Here we go, we have our merged dataframe.

    Happy Analysis!!!

    0 讨论(0)
提交回复
热议问题