How can I use sklearn.naive_bayes with (multiple) categorical features?

前端 未结 2 1749
时光说笑
时光说笑 2021-01-31 19:04

I want to learn a Naive Bayes model for a problem where the class is boolean (takes on one of two values). Some of the features are boolean, but other features are categorical a

2条回答
  •  陌清茗
    陌清茗 (楼主)
    2021-01-31 19:35

    Some of the features are boolean, but other features are categorical and can take on a small number of values (~5).

    This is an interesting question, but it is actually more than a single one:

    1. How to deal with a categorical feature in NB.
    2. How to deal with non-homogeneous features in NB (and, as I'll point out in the following, even two categorical features are non-homogeneous).
    3. How to do this in sklearn.

    Consider first a single categorical feature. NB assumes/simplifies that the features are independent. Your idea of transforming this into several binary variables is exactly that of dummy variables. Clearly, these dummy variables are anything but independent. Your idea of then running a Bernoulli NB on the result implicitly assumes independence. While it is known that, in practice, NB does not necessarily break when faced with dependent variables, there is no reason to try to transform the problem into the worst configuration for NB, especially as multinomial NB is a very easy alternative.

    Conversely, suppose that after transforming the single categorical variable into a multi-column dataset using the dummy variables, you use a multinomial NB. The theory for multinomial NB states:

    With a multinomial event model, samples (feature vectors) represent the frequencies with which certain events have been generated by a multinomial ... where p i is the probability that event i occurs. A feature vector ... is then a histogram, with x i {\displaystyle x_{i}} x_{i} counting the number of times event i was observed in a particular instance. This is the event model typically used for document classification, with events representing the occurrence of a word in a single document (see bag of words assumption).

    So, here, each instance of your single categorical variable is a "length-1 paragraph", and the distribution is exactly multinomial. Specifically, each row has 1 in one position and 0 in all the rest because a length-1 paragraph must have exactly one word, and so those will be the frequencies.

    Note that from the point of view of sklearn's multinomial NB, the fact that the dataset is 5-columned, does not now imply an assumption of independence.


    Now consider the case where you have a dataset consisting of several features:

    1. Categorical
    2. Bernoulli
    3. Normal

    Under the very assumption of using NB, these variables are independent. Consequently, you can do the following:

    1. Build a NB classifier for each of the categorical data separately, using your dummy variables and a multinomial NB.
    2. Build a NB classifier for all of the Bernoulli data at once - this is because sklearn's Bernoulli NB is simply a shortcut for several single-feature Bernoulli NBs.
    3. Same as 2 for all the normal features.

    By the definition of independence, the probability for an instance, is the product of the probabilities of instances by these classifiers.

提交回复
热议问题