Domain name regex

后端 未结 3 1955
孤街浪徒
孤街浪徒 2021-01-21 19:15

Trying to extract the domain name out of URL. For example:

x <-\"https://stackoverflow.com/questions/ask\"

to: stackoverflow.com

3条回答
  •  旧时难觅i
    2021-01-21 19:50

    A TLD extraction is not as simple as you might think. There's a nice list of what are deemed "public TLDs" i.e. what are, effectively, true top-level domains. I work with these every day (mining domains for cybersecurity).

    We've got a tldextract R package (more info here) that does a great job parsing these for further data mining. You can use parse_url from httr to extract the hostname component, then run our tldextract function over it:

    library(httr)
    library(rvest)
    library(tldextract)
    
    # get some URLs - I encourage you to bump up "10" to "100" or more to see how
    # tldextract deals with "public TLDs"
    pg <- html("http://httparchive.org/urls.php?start=1&end=10")
    
    # clean up the 
     output and make it a character list
    urls <- pg %>% html_nodes("pre") %>% html_text() %>% strsplit("\n") %>% unlist
    urls <- urls[urls != ""] # that site has a blank first line we don't need
    
    # extract the hostname part
    urls <- as.character(unlist(sapply(lapply(urls, parse_url), "[", "hostname")))
    urls
    
    ##  [1] "www.google.com"    "www.facebook.com"  "www.youtube.com"  
    ##  [4] "www.yahoo.com"     "www.baidu.com"     "www.wikipedia.org"
    ##  [7] "www.amazon.com"    "www.twitter.com"   "www.qq.com"       
    ## [10] "www.taobao.com"
    
    # extract the TLDs
    tlds <- tldextract(urls)
    tlds
    
    ##                 host subdomain    domain tld
    ## 1     www.google.com       www    google com
    ## 2   www.facebook.com       www  facebook com
    ## 3    www.youtube.com       www   youtube com
    ## 4      www.yahoo.com       www     yahoo com
    ## 5      www.baidu.com       www     baidu com
    ## 6  www.wikipedia.org       www wikipedia org
    ## 7     www.amazon.com       www    amazon com
    ## 8    www.twitter.com       www   twitter com
    ## 9         www.qq.com       www        qq com
    ## 10    www.taobao.com       www    taobao com
    
    # piece what we need together
    sprintf("%s.%s", tlds$domain, tlds$tld)
    
    ##  [1] "google.com"    "facebook.com"  "youtube.com"   "yahoo.com"    
    ##  [5] "baidu.com"     "wikipedia.org" "amazon.com"    "twitter.com"  
    ##  [9] "qq.com"        "taobao.com"
    

提交回复
热议问题