Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

后端 未结 3 1326
逝去的感伤
逝去的感伤 2020-12-02 05:56

I\'m using scikit-learn in Python to develop a classification algorithm to predict the gender of certain customers. Amongst others, I want to use the Naive Bayes classifier

相关标签:
3条回答
  • 2020-12-02 06:13

    The simple answer: multiply result!! it's the same.

    Naive Bayes based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features - meaning you calculate the Bayes probability dependent on a specific feature without holding the others - which means that the algorithm multiply each probability from one feature with the probability from the second feature (and we totally ignore the denominator - since it is just a normalizer).

    so the right answer is:

    1. calculate the probability from the categorical variables.
    2. calculate the probability from the continuous variables.
    3. multiply 1. and 2.
    0 讨论(0)
  • 2020-12-02 06:14

    You have at least two options:

    • Transform all your data into a categorical representation by computing percentiles for each continuous variables and then binning the continuous variables using the percentiles as bin boundaries. For instance for the height of a person create the following bins: "very small", "small", "regular", "big", "very big" ensuring that each bin contains approximately 20% of the population of your training set. We don't have any utility to perform this automatically in scikit-learn but it should not be too complicated to do it yourself. Then fit a unique multinomial NB on those categorical representation of your data.

    • Independently fit a gaussian NB model on the continuous part of the data and a multinomial NB model on the categorical part. Then transform all the dataset by taking the class assignment probabilities (with predict_proba method) as new features: np.hstack((multinomial_probas, gaussian_probas)) and then refit a new model (e.g. a new gaussian NB) on the new features.

    0 讨论(0)
  • 2020-12-02 06:15

    Hope I'm not too late. I recently wrote a library called Mixed Naive Bayes, written in NumPy. It can assume a mix of Gaussian and categorical (multinoulli) distributions on the training data features.

    https://github.com/remykarem/mixed-naive-bayes

    The library is written such that the APIs are similar to scikit-learn's.

    In the example below, let's assume that the first 2 features are from a categorical distribution and the last 2 are Gaussian. In the fit() method, just specify categorical_features=[0,1], indicating that Columns 0 and 1 are to follow categorical distribution.

    from mixed_naive_bayes import MixedNB
    X = [[0, 0, 180.9, 75.0],
         [1, 1, 165.2, 61.5],
         [2, 1, 166.3, 60.3],
         [1, 1, 173.0, 68.2],
         [0, 2, 178.4, 71.0]]
    y = [0, 0, 1, 1, 0]
    clf = MixedNB(categorical_features=[0,1])
    clf.fit(X,y)
    clf.predict(X)
    

    Pip installable via pip install mixed-naive-bayes. More information on the usage in the README.md file. Pull requests are greatly appreciated :)

    0 讨论(0)
提交回复
热议问题