How to classify URLs? what are URLs features? How to select and Extract features from URL

前端 未结 1 2001
天涯浪人
天涯浪人 2021-02-02 03:32

I have just started to work on a Classification problem. Its a two class problem, My Trained model(Machine Learning) will have to decide/predict either to allow a URL or Block i

相关标签:
1条回答
  • 2021-02-02 04:14

    I assume you do not have access to the content of the URL thus you can only extract features from the url string itself. Otherwise it makes more sense to use the content of the URL.

    Here are some features I will try. See this paper for more ideas:

    1. All url components. For example, this page has the below url:

      http://stackoverflow.com/questions/26456904/how-to-classify-urls-what-are-urls-features-how-to-select-and-extract-features

    All tokens that occurs in different parts of URLs should have variable value to the classification. In this case, the last part after tokenization contributes great features for this page. (e.g., classify, urls, select, extract, features)

     * stackoverflow
     * com
     * questions
     * 26456904
     * how to classify urls what are urls features how to select and extract features
    
    1. The length of a url;
    2. n-grams (2-grams as examples below)
      • stackoverflow-com
      • com-questions
      • questions-26456904
      • 26456904-how
      • how-to
      • ....
    0 讨论(0)
提交回复
热议问题