My data consists of a mix of continuous and categorical features. Below is a small snippet of how my data looks like in the csv format (Consider it as data collected by a su
You may also consider rendering the Categorical variables numerical, e.g. via indicator variables, a procedure also known as one hot encoding
Try
from sklearn.preprocessing import OneHotEncoder
and fit it to your categorical data, followed by a numerical estimation method such as linear regression. As long as there aren't too many categories (city may be a little too much), this can work well.
As for discretization of continuous variables, you may consider binning using an adapted bin size, or, equivalently, uniform binning after histogram normalization. numpy.histogram may be helpful here. Also, while Fayyad-Irani clustering isn't implemented in sklearn
, feel free to check out sklearn.cluster
for adaptive discretizations of your data (even if it is only 1D), e.g. via KMeans .