Does R have any package for parsing out the parts of a URL?

后端 未结 6 709
无人及你
无人及你 2020-12-30 06:32

I have a list of urls that I would like to parse and normalize.

I\'d like to be able to split each address into parts so that I can identify \"www.google.com/test/in

6条回答
  •  生来不讨喜
    2020-12-30 06:55

    Since parse_url() uses regular expressions anyway, we may as well reinvent the wheel and create a single regular expression replacement in order to build a sweet and fancy gsub call.

    Let's see. A URL consists of a protocol, a "netloc" which may include username, password, hostname and port components, and a remainder which we happily strip away. Let's assume first there's no username nor password nor port.

    • ^(?:(?:[[:alpha:]+.-]+)://)? will match the protocol header (copied from parse_url()), we are stripping this away if we find it
    • Also, a potential www. prefix is stripped away, but not captured: (?:www\\.)?
    • Anything up to the subsequent slash will be our fully qualified host name, which we capture: ([^/]+)
    • The rest we ignore: .*$

    Now we plug together the regexes above, and the extraction of the hostname becomes:

    PROTOCOL_REGEX <- "^(?:(?:[[:alpha:]+.-]+)://)?"
    PREFIX_REGEX <- "(?:www\\.)?"
    HOSTNAME_REGEX <- "([^/]+)"
    REST_REGEX <- ".*$"
    URL_REGEX <- paste0(PROTOCOL_REGEX, PREFIX_REGEX, HOSTNAME_REGEX, REST_REGEX)
    domain.name <- function(urls) gsub(URL_REGEX, "\\1", urls)
    

    Change host name regex to include (but not capture) the port:

    HOSTNAME_REGEX <- "([^:/]+)(?::[0-9]+)?"
    

    And so forth and so on, until we finally arrive at an RFC-compliant regular expression for parsing URLs. However, for home use, the above should suffice:

    > domain.name(c("test.server.com/test", "www.google.com/test/index.asp",
                    "http://test.com/?ex"))
    [1] "test.server.com" "google.com"      "test.com"       
    

提交回复
热议问题