问题
I have a datframe with around 200 features and 3000 rows. These data samples are logged in different time, basically one per month, as shown in the below example in “col101”:
0 col1 (id) col2. col3 …. col100 col101 (date) … col2000 (target value)
1 001 653. 675 …. 343.3 01-02-2017. … 1
2 001 673. 432 …. 387.3 01-03-2017. … 0
3 001 679. 528 …. 401.2 01-04-2017. … 1
4 001 685 223 …. 503.4 01-05-2017. … 1
5 002 343 428 …. 432.5 01-02-2017. … 0
6 002 479. 421 …. 455.3 01-03-2017. … 0
7 … … … …. … …. … ..
Within these features some of are cumulative data so that in every month their values have been increased. For example, col2 and col100 are the cumulative features in my dataframe. So I want to add one more column for each cumulative feature, with the difference with respect to the previous month. So my desired dataframe should be something like this:
0 col1 (id) col2. col2c …. col100 col100c col101 (date) … col2000 (targeva)
1 001 653. 653 …. 343.3 343.3 01-02-2017. … 1
2 001 673. 23 …. 387.3 44 01-03-2017. … 0
3 001 679. 6 …. 401.2 13.9 01-04-2017. … 1
4 001 685 6 …. 503.4 102.2 01-05-2017. … 1
5 002 343 343 …. 432.5 432.5 01-02-2017. … 0
6 002 479. 136 …. 455.3 23.2 01-03-2017. … 0
7 … … … …. … …. … ..
Now, I have two problems here: 1) how can I automatically recognize those cumulative features with 200 features? and how to add that extra feature (e.g., col22c and col100c) for each cumulative attribute? Does anyone know how I can handle this?
回答1:
About differentiating two columns, you can use pandas built-in diff()
function. diff()
calculates the difference of each element compared to the previous one. But note that because the first element doesn't have any previous element, the first element in the result of diff()
would be NaN
. So we use the built-in function dropna()
to drop all NaN
values.
But as for detecting the cumulative columns, I don't think there would be any way. You CAN find all the columns that are always increasing (monotonic), but that doesn't mean they are cumulative necessarily.
Anyway for detecting the monotonic columns, you can first get their diff().dropna()
and then check if all if these values are positive:
df = some_data_frame
col_diff = df['some_column'].diff().dropna()
is_monotonic = all(col_diff > 0)
Note that if you forget the dropna()
, the result of all(col_diff > 0)
would always be False
(because NaN
is a Falsy value)
来源:https://stackoverflow.com/questions/55195045/finding-cumulative-features-in-dataframe