How to scale dataframes consistently MinMaxScaler() sklearn

佐手、 提交于 2020-05-28 13:43:53

问题


I have three data frames that are each scaled individually with MinMaxScaler().

def scale_dataframe(values_to_be_scaled)
    values = values_to_be_scaled.astype('float64')
    scaler = MinMaxScaler(feature_range=(0, 1))
    scaled = scaler.fit_transform(values)

    return scaled

scaled_values = []
for i in range(0,num_df):
    scaled_values.append(scale_dataframe(df[i].values))

The problem I am having is that each dataframe gets scaled according to its own individual set of column min and max values. I need all of my dataframes to scale to the same values as if they all shared the same set of column min and max values for the data overall. Is there a way to accomplish this with MinMaxScaler()? One option would be to make one large dataframe, then scale the dataframe before partitioning, but this would not be ideal.


回答1:


Check out the excellent docs of sklearn.

As you see, there is support for partial_fit()! This allows online-scaling/minibatch-scaling and you can control the minibatches!

Example:

import numpy as np
from sklearn.preprocessing import MinMaxScaler

a = np.array([[1,2,3]])
b = np.array([[10,20,30]])
c = np.array([[5, 10, 15]])

""" Scale on all datasets together in one batch """
offline_scaler = MinMaxScaler()
offline_scaler.fit(np.vstack((a, b, c)))                # fit on whole data at once
a_offline_scaled = offline_scaler.transform(a)
b_offline_scaled = offline_scaler.transform(b)
c_offline_scaled = offline_scaler.transform(c)
print('Offline scaled')
print(a_offline_scaled)
print(b_offline_scaled)
print(c_offline_scaled)

""" Scale on all datasets together in minibatches """
online_scaler = MinMaxScaler()
online_scaler.partial_fit(a)                            # partial fit 1
online_scaler.partial_fit(b)                            # partial fit 2
online_scaler.partial_fit(c)                            # partial fit 3
a_online_scaled = online_scaler.transform(a)
b_online_scaled = online_scaler.transform(b)
c_online_scaled = online_scaler.transform(c)
print('Online scaled')
print(a_online_scaled)
print(b_online_scaled)
print(c_online_scaled)

Output:

Offline scaled
[[ 0.  0.  0.]]
[[ 1.  1.  1.]]
[[ 0.44444444  0.44444444  0.44444444]]
Online scaled
[[ 0.  0.  0.]]
[[ 1.  1.  1.]]
[[ 0.44444444  0.44444444  0.44444444]]


来源:https://stackoverflow.com/questions/47732108/how-to-scale-dataframes-consistently-minmaxscaler-sklearn

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!