问题
Given website addresses, e.g.
http://www.example.com/page1/#
https://subdomain.example2.co.uk/asdf?retrieve=2
How do I return the root domain in R
, e.g.
example.com
example2.co.uk
For my purposes I would define the root domain to have structure
example_name.public_suffix
where example_name excludes "www" and public_suffix is on the list here:
https://publicsuffix.org/list/effective_tld_names.dat
Is this still the best regex based solution:
https://stackoverflow.com/a/8498629/2109289
What about something in R
that parses root domain based off the public suffix list, something like:
http://simonecarletti.com/code/publicsuffix/
Edited: Adding extra info based on Richard's comment
Using XML::parseURI
seems to return the stuff between the first "//" and "/". e.g.
> parseURI("http://www.blog.omegahat.org:8080/RCurl/index.html")$server
[1] "www.blog.omegahat.org"
Thus, the question reduces to having an R
function that can return the public suffix from the URI, or implementing the following algorithm on the public suffix list:
- Match domain against all rules and take note of the matching ones.
- If no rules match, the prevailing rule is "*".
- If more than one rule matches, the prevailing rule is the one which is an exception rule.
- If there is no matching exception rule, the prevailing rule is the one with the most labels.
- If the prevailing rule is a exception rule, modify it by removing the leftmost label.
- The public suffix is the set of labels from the domain which directly match the labels of the prevailing rule (joined by dots).
- The registered or registrable domain is the public suffix plus one additional label.
回答1:
There are two tasks here. The first is parsing the URL to get the host name, which can be done with the httr package's parse_url
function:
host <- parse_url("https://subdomain.example2.co.uk/asdf?retrieve=2")$hostname
host
# [1] "subdomain.example2.co.uk"
The second is extracting the organizational domain (or root domain, top private domain--whatever you want to call it). This can be done using the tldextract package (which is inspired by the Python package of the same name and uses Mozilla's public suffix list):
domain.info <- tldextract(host)
domain.info
# host subdomain domain tld
# 1 subdomain.example2.co.uk subdomain example2 co.uk
tldextract
returns a data frame, with a row for each domain you give it, but you can easily paste together the relevant parts:
paste(domain.info$domain, domain.info$tld, sep=".")
# [1] "example2.co.uk"
回答2:
Somthing lik this should help
> strsplit(gsub("http://|https://|www\\.", "", "http://www.example.com/page1/#"), "/")[[c(1, 1)]]
[1] "example.com"
> strsplit(gsub("http://|https://|www\\.", "", "https://subdomain.example2.co.uk/asdf?retrieve=2"), "/")[[c(1, 1)]]
[1] "subdomain.example2.co.uk"
来源:https://stackoverflow.com/questions/26291025/return-root-domain-from-url-in-r