问题
Let's say that I have some customer data over some dates and I want to see if for example their address has changed. Over those dates. Ideally, i'd like to copy the two columns where the changes occurred into a new table or just get a metric for the amount of total changes.
So, if I had a table like
Date , Customer , Address
12/31/14, Cust1, 12 Rocky Hill Rd
12/31/15, Cust1, 12 Rocky Hill Rd
12/31/16, Cust1, 14 Rocky Hill Rd
12/31/14, Cust2, 12 Testing Rd
12/31/15, Cust2, 12 Testing Ln
12/31/16, Cust2, 12 Testing Rd
I would end up with a count of two changes, Customer 1's change between12 Rocky Hill Rd between 12/31/15 and 12/31/16 and Cust2's change between 12/31/14 and 12/31/15.
Ideally I could get a table like this
Dates , Customer , Change
12/31/15 to 12/31/16, Cust1, 12 Rocky Hill Rd to 14 Rocky Hill Rd
12/31/14 to 12/31/15, Cust2, 12 Testing Rd to 12 Testing Ln
Or even just a total count of changes would be great. Any ideas? Ideally, i'd have any more dates, possibly multiple changes across those dates, and potentially additional columns i'd like to be checking for changes in as well. Really, just a summation of changes to a customer record over some date period for each column would suffice.
I'm new to Panda's and not really sure exactly where to start on this.
Edit: As I note on the solution below, i'd like to be able to pass a larger dataframe with more than just an address to detect changes. In example, I've accomplished this in R with something like the following: `enter code here
`#How many changes have occured (unique values - 1)
UniLen <- function(x){
x <- length(unique(x))-1
return(x)
}
#Create a vector of Address Features to check for changes in
Address_Features <- c("AddrLine1", "AddrLine2", "AddrLine3", "CityName", "State", "ZipCodeNum", "County")
#Check for changes in each address 'use this address for description' for each customer
AddressChanges_Detail <- mktData[,c("CustomerNumEID","AddressUniqueRelationDesc",Address_Features)] %>%
group_by(CustomerNumEID, AddressUniqueRelationDesc) %>%
summarise_each(funs(UniLen))
#Summarise results (how many changes for each feature)
AddressChanges_Summary <- AddressChanges_Detail[,Address_Features] %>%
summarise_each(funs(sum))
This allows us to count how many changes occur, but i'm missing out on the date the change occurred and what the feature was changed from and to... It seems the Python solution you've proposed solves for that with the use of .shift instead of just a summary of unique values on some group. Ideally i'd like the best of both worlds :).
回答1:
df
Input dataframe
Date Customer Address
0 12/31/14 Cust1 12 Rocky Hill Rd
1 12/31/15 Cust1 12 Rocky Hill Rd
2 12/31/16 Cust1 14 Rocky Hill Rd
3 12/31/14 Cust2 12 Testing Rd
4 12/31/15 Cust2 12 Testing Ln
5 12/31/16 Cust2 12 Testing Rd
Address change function:
def changeAdd(x):
x=x[x.Address != x.shift(-1).Address]
df1 = pd.DataFrame({'Date':x.shift(1).Date + ' to '+ x.Date,
'Customer':x.Customer.max(),
'Address':x.shift(1).Address +' to ' + x.Address})
return df1[df1.Address.notnull()]
dm = df.groupby('Customer')\
.apply(changeAdd)\
.reset_index(drop=True)[['Date','Customer','Address']]
dm
Output dataframe:
Date Customer Address
0 12/31/15 to 12/31/16 Cust1 12 Rocky Hill Rd to 14 Rocky Hill Rd
1 12/31/14 to 12/31/15 Cust2 12 Testing Rd to 12 Testing Ln
2 12/31/15 to 12/31/16 Cust2 12 Testing Ln to 12 Testing Rd
来源:https://stackoverflow.com/questions/42959330/how-to-tell-if-a-value-changed-over-dimensions-in-pandas