问题
I'm a new-bee to AI and want to perform the below exercise. Can you please suggest the way to achieve it using python:
Scenario - I have list of businesses of some companies as below like:
1. AI
2. Artificial Intelligence
3. VR
4. Virtual reality
5. Mobile application
6. Desktop softwares
and want to categorize them as below:
Technology ---> Category
1. AI ---> Category Artificial Intelligence
2. Artificial Intelligence ---> Category Artificial Intelligence
3. VR ---> Category Virtual Reality
4. Virtual reality ---> Category Virtual Reality
5. Mobile application ---> Category Application
6. Desktop softwares ---> Category Application
i.e when I receive a text like AI or Artificial Intelligence, then it must identify AI & Artificial Intelligence as one and the same and put both keywords under Artificial Intelligence category.
The current approach I follow is using the lookup a table but, I want to apply TEXT CLASSIFICATION on the technologies/business for the above input using python where I can segregate the technologies instead of using the lookup table.
Please suggest me any relevant approach.
回答1:
Here's one approach using sklearn. In past cases, I would use LabelBinarizer() but it won't work in a pipeline because it no-longer accepts X, y as inputs.
If you are a newbie, pipelines can be a bit confusing but essentially they just process the data in steps before passing to a classifier. Here, I am converting X
into an ngram "matrix" (a table) of word and character tokens, and then passing that to a classifier.
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
X = np.array([['AI'],
['Artificial Intelligence'],
['VR'],
['Virtual Reality'],
['Mobile application'],
['Desktop softwares']])
y = np.array(['Artificial Intelligence', 'Artificial Intelligence',
'Virtual Reality', 'Virtual Reality', 'Application', 'Application'])
pipeline = Pipeline(steps=[
('union', FeatureUnion([
('word_vec', CountVectorizer(binary=True, analyzer='word', ngram_range=(1,2))),
('char_vec', CountVectorizer(analyzer='char', ngram_range=(2,5)))
])),
('lreg', LogisticRegression())
])
pipeline.fit(X.ravel(), y)
print(pipeline.predict(['web application', 'web app', 'dog', 'super intelligence']))
Predicts:
['Application' 'Application' 'Virtual Reality' 'Artificial Intelligence']
来源:https://stackoverflow.com/questions/46924600/categories-busineesses-with-text-analytics-in-python