Text classification performance

余生颓废 提交于 2019-12-08 06:11:21

问题


So i am using textblob python library, but the performance is lacking.

I already serialize it and load it before the loop( using pickle ).

It currently takes ~ 0.1( for small training data ) and ~ 0.3 on 33'000 test data. I need to make it faster, is it even possible ?

Some code:

# Pass trainings before loop, so we can make performance a lot better
trained_text_classifiers = load_serialized_classifier_trainings(config["ALL_CLASSIFICATORS"])

# Specify witch classifiers are used by witch classes
filter_classifiers = get_classifiers_by_resource_names(trained_text_classifiers, config["FILTER_CLASSIFICATORS"])
signal_classifiers = get_classifiers_by_resource_names(trained_text_classifiers, config["SIGNAL_CLASSIFICATORS"])

for (url, headers, body) in iter_warc_records(warc_file, **warc_filters):
    start_time = time.time()
    body_text = strip_html(body);

    # Check if url body passess filters, if yes, index, if no, ignore
    if Filter.is_valid(body_text, filter_classifiers):
        print "Indexing", url.url
        resp = indexer.index_document(body, body_text, signal_classifiers, url=url, headers=headers, links=bool(args.save_linkgraph_domains))
    else:
        print "\n"
        print "Filtered out", url.url
        print "\n"
        resp = 0

This is the loop witch performs check on each of the warc file's body and metadata.

there are 2 text classification checks here.

1) In Filter( very small training data ):

if trained_text_classifiers.classify(body_text) == "True":
        return True
    else:
        return False

2) In index_document( 33'000 training data ):

prob_dist = trained_text_classifier.prob_classify(body)
        prob_dist.max()

        # Return the propability of spam
        return round(prob_dist.prob("spam"), 2)

The classify and prob_classify are the methods that take the tool on performance.


回答1:


You can use feature selection for your data. some good feature selection can reduce features up to 90% and persist the classification performance. In feature selection you select top feature(in Bag Of Word model, you select top influence words), and train model based on these words(features). this reduce the dimension of your data(also it prevent Curse Of Dimensionality) here is a good survey: Survey on feature selection

In Brief:

Two feature selection approach is available: Filtering and Wrapping

Filtering approach is almost based on information theory. search "Mutual Information", "chi2" and... for this type of feature selection

Wrapping approach use the classification algorithm to estimate the most important features in the library. for example you select some words and evaluate classification performance(recall,precision).

Also some others approch can be usefull. LSA and LSI can outperform the classification performance and time: https://en.wikipedia.org/wiki/Latent_semantic_analysis

You can use sickit for feature selection and LSA:

http://scikit-learn.org/stable/modules/feature_selection.html

http://scikit-learn.org/stable/modules/decomposition.html



来源:https://stackoverflow.com/questions/37969425/text-classification-performance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!