I have just started to work on a Classification problem. Its a two class problem, My Trained model(Machine Learning) will have to decide/predict either to allow a URL or Block i
I assume you do not have access to the content of the URL thus you can only extract features from the url string itself. Otherwise it makes more sense to use the content of the URL.
Here are some features I will try. See this paper for more ideas:
All url components. For example, this page has the below url:
http://stackoverflow.com/questions/26456904/how-to-classify-urls-what-are-urls-features-how-to-select-and-extract-features
All tokens that occurs in different parts of URLs should have variable value to the classification. In this case, the last part after tokenization contributes great features for this page. (e.g., classify, urls, select, extract, features)
* stackoverflow
* com
* questions
* 26456904
* how to classify urls what are urls features how to select and extract features