问题
Here is what I am trying to do. I have a csv. file with column 1 with people's names (ie: "Michael Jordan", "Anderson Silva", "Muhammad Ali") and column 2 with people's ethnicity (ie: English, French, Chinese).
In my code, I create the pandas data frame using all the data. Then create additional data frames: one with only Chinese names and another one with only non-Chinese names. And then I create separate lists.
The three_split function extracts the feature of each name by splitting them into three-character substrings. For example, "Katy Perry" into "kat", "aty", "ty ", "y p" ... etc.
Then I train with Naive Bayes and finally test the results.
There isn't any errors when running my codes, but when I try to use the non-Chinese names directly from the database and expect the program to return False (not Chinese), it returns True (Chinese) for any name I test. Any idea?
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.classify import PositiveNaiveBayesClassifier
# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Dropbox\Python exercises_KW\_Scraping\BeautifulSoup\FamilySearch.org\FamSearch_Analysis\OddNames_sampleData3.csv",
encoding="utf-8")
df = DataFrame(data)
df.columns = ["name", "ethnicity"]
# Recategorize different ethnicities into 1) Chinese or 2) non-Chinese; and then create separate lists
df_chinese = df[(df["ethnicity"] == "chinese") | (df["ethnicity"] == "Chinese")]
chinese_names = list(df_chinese["name"])
df_nonchinese = df[(df["ethnicity"] != "chinese") & (df["ethnicity"] != "Chinese") & (df["ethnicity"].notnull() == True)]
nonchinese_names = list(df_nonchinese["name"])
# Function to split word string into three-character substrings
def three_split(word):
word = str(word).lower().replace(" ", "_")
split = 3
return dict(("contains(%s)" % word[start:start+split], True)
for start in range(0, len(word)-2))
# Training naive bayes machine learning algorithm
positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, unlabeled_featuresets)
# Testing results
name = "Hubert Gillies" # A non-Chinese name from the dataset
print classifier.classify(three_split(name))
>>> True # Wrong output
回答1:
There could be many problems when it comes why you don't get the desired results, most often it's either:
- Features are not strong enough
- Not enough training data
- Wrong classifier
- Code bugs in NLTK classifiers
For the first 3 reasons, there's no way to verify/resolve unless you post a link to your dataset and we take a look at how to fix it. As for the last reason, there shouldn't be one for the basic NaiveBayes
and PositiveNaiveBayes
classifier.
So the question to ask is:
- How many training data instances (i.e. rows) do you have?
- Why didn't you normalize your labels (i.e. chinese|Chinese -> chinese) after you've read the dataset before extracting the features?
- What other features to consider?
- Have you considered using NaiveBayes instead of PositiveNaiveBayes?
来源:https://stackoverflow.com/questions/29441324/unable-to-use-pandas-and-nltk-to-train-naive-bayes-machine-learning-in-python