Get summary data columns in new pandas dataframe from existing dataframe based on other column-ID

北慕城南 提交于 2021-01-29 16:32:05

问题


I'm want to summarize the data in a dataframe and add the new columns to another dataframe. My data contains appartments with an ID-number and it has surface and volume values for each room in the appartment. What I want is having a dataframe that summarizes this and gives me the total surface and volume per appartment. There are two conditions for the original dataframe:

Two conditions:
- the dataframe can contain empty cells
- when the values of surface or volume are equal for all of the rows within that ID 
(so all the same values for the same ID), then the data (surface, volumes) is not 
summed but one value/row is passed to the new summary column (example: 'ID 4')(as 
this could be a mistake in the original dataframe and the total surface/volume was 
inserted for all the rooms by the government-employee)

Initial dataframe 'data':

print(data)

    ID  Surface  Volume
0    2     10.0    25.0
1    2     12.0    30.0
2    2     24.0    60.0
3    2      8.0    20.0
4    4     84.0   200.0
5    4     84.0   200.0
6    4     84.0   200.0
7   52      NaN     NaN
8   52     96.0   240.0
9   95      8.0    20.0
10  95      6.0    15.0
11  95     12.0    30.0
12  95     30.0    75.0
13  95     12.0    30.0

Desired output from 'df':

print(df)
    ID  Surface  Volume
0    2     54.0   135.0
1    4     84.0   200.0  #-> as the values are the same for each row of this ID in the original data, the sum is not taken, but only one of the rows is passed (see the second condition)
2   52     96.0   240.0
3   95     68.0   170.0

Tried code:

import pandas as pd

import numpy as np



df = pd.DataFrame({"ID": [2,4,52,95]})



data = pd.DataFrame({"ID":  [2,2,2,2,4,4,4,52,52,95,95,95,95,95],
                
                "Surface":  [10,12,24,8,84,84,84,np.nan,96,8,6,12,30,12],
                 
                 "Volume":  [25,30,60,20,200,200,200,np.nan,240,20,15,30,75,30]})


print(data)




#Tried something, but no idea how to do this actually:

df["Surface"] = data.groupby("ID").agg(sum)

df["Volume"] = data.groupby("ID").agg(sum)
print(df)


回答1:


Here are necessary 2 conditions - first testing if unique values per groups for each columns separately by GroupBy.transform and DataFrameGroupBy.nunique and compare by eq for equal with 1 and then second condition - it used DataFrame.duplicated by each column with ID column.

Chain both masks by & for bitwise AND and repalce matched values by NaNs by DataFrame.mask and last aggregate sum:

cols = ['Surface','Volume']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())

df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum().reset_index()
print(df)
   ID  Surface  Volume
0   2     54.0   135.0
1   4     84.0   200.0
2  52     96.0   240.0
3  95     68.0   170.0

If need new columns filled by aggregate sum values use GroupBy.transform :

cols = ['Surface','Volume']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())

data[cols] = data[cols].mask(m1 & m2).groupby(data["ID"]).transform('sum')
print(data)
    ID  Surface  Volume
0    2     54.0   135.0
1    2     54.0   135.0
2    2     54.0   135.0
3    2     54.0   135.0
4    4     84.0   200.0
5    4     84.0   200.0
6    4     84.0   200.0
7   52     96.0   240.0
8   52     96.0   240.0
9   95     68.0   170.0
10  95     68.0   170.0
11  95     68.0   170.0
12  95     68.0   170.0
13  95     68.0   170.0


来源:https://stackoverflow.com/questions/61300891/get-summary-data-columns-in-new-pandas-dataframe-from-existing-dataframe-based-o

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!