Trying to extract the domain name out of URL. For example:
x <-\"https://stackoverflow.com/questions/ask\"
to: stackoverflow.com
A TLD extraction is not as simple as you might think. There's a nice list of what are deemed "public TLDs" i.e. what are, effectively, true top-level domains. I work with these every day (mining domains for cybersecurity).
We've got a tldextract
R package (more info here) that does a great job parsing these for further data mining. You can use parse_url
from httr
to extract the hostname
component, then run our tldextract
function over it:
library(httr)
library(rvest)
library(tldextract)
# get some URLs - I encourage you to bump up "10" to "100" or more to see how
# tldextract deals with "public TLDs"
pg <- html("http://httparchive.org/urls.php?start=1&end=10")
# clean up the output and make it a character list
urls <- pg %>% html_nodes("pre") %>% html_text() %>% strsplit("\n") %>% unlist
urls <- urls[urls != ""] # that site has a blank first line we don't need
# extract the hostname part
urls <- as.character(unlist(sapply(lapply(urls, parse_url), "[", "hostname")))
urls
## [1] "www.google.com" "www.facebook.com" "www.youtube.com"
## [4] "www.yahoo.com" "www.baidu.com" "www.wikipedia.org"
## [7] "www.amazon.com" "www.twitter.com" "www.qq.com"
## [10] "www.taobao.com"
# extract the TLDs
tlds <- tldextract(urls)
tlds
## host subdomain domain tld
## 1 www.google.com www google com
## 2 www.facebook.com www facebook com
## 3 www.youtube.com www youtube com
## 4 www.yahoo.com www yahoo com
## 5 www.baidu.com www baidu com
## 6 www.wikipedia.org www wikipedia org
## 7 www.amazon.com www amazon com
## 8 www.twitter.com www twitter com
## 9 www.qq.com www qq com
## 10 www.taobao.com www taobao com
# piece what we need together
sprintf("%s.%s", tlds$domain, tlds$tld)
## [1] "google.com" "facebook.com" "youtube.com" "yahoo.com"
## [5] "baidu.com" "wikipedia.org" "amazon.com" "twitter.com"
## [9] "qq.com" "taobao.com"