Machine learning with multiple feature types in python

泪湿孤枕 提交于 2019-12-23 05:42:18

问题


I am able to do some simple machine learning using scikit-learn and NLTK modules in Python. But I have problems when it comes to training with multiple features that have different value types (number, list of string, yes/no, etc). In the following data, I have a word/phrase column in which I extract the information and create relevant columns (for example, the length column is the character lengths of 'word/phrase'). Label column is the label.

Word/phrase Length  '2-letter substring'    'First letter'  'With space?'       Label
take action 10  ['ta', 'ak', 'ke', 'ac', 'ct', 'ti', 'io', 'on']    t   Yes     A
sure    4   ['su', 'ur', 're']  s   No      A
That wasn't     10  ['th', 'ha', 'at', 'wa', 'as', 'sn', 'nt']  t   Yes     B
simply  6   ['si', 'im', 'mp', 'pl', 'ly']  s   No      C
a lot of    6   ['lo', 'ot', 'of']  a   Yes     D
said    4   ['sa', 'ai', 'id']  s   No      B

Should I make them into one dictionary and then use sklearn's DictVectorizer to hold them in a working memory? And then treat these features as one X vector when training the ML algorithms?


回答1:


Majority of machine learning algorithms work with numbers, so you can to transform your categorical values and string into numbers.

Popular python machine-learning library scikit-learn has the whole chapter dedicated to preprocessing of the data. With 'yes/no' everything is easy - just put 0/1 instead of it.

Among many other important things it explains the process of categorical data preprocessing using their OneHotEncoder.

When you work with text, you also have to transform your data in a suitable way. One of the common feature extraction strategy for text is a tf-idf score, and I wrote a tutorial here.



来源:https://stackoverflow.com/questions/32685445/machine-learning-with-multiple-feature-types-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!