Does R have any package for parsing out the parts of a URL?

后端 未结 6 708
无人及你
无人及你 2020-12-30 06:32

I have a list of urls that I would like to parse and normalize.

I\'d like to be able to split each address into parts so that I can identify \"www.google.com/test/in

相关标签:
6条回答
  • 2020-12-30 06:32

    If you like tldextract one option would be to use the version on appengine

    require(RJSONIO)
    test <- c("test.server.com/test", "www.google.com/test/index.asp", "http://test.com/?ex")
    lapply(paste0("http://tldextract.appspot.com/api/extract?url=", test), fromJSON)
    [[1]]
       domain subdomain       tld 
     "server"    "test"     "com" 
    
    [[2]]
       domain subdomain       tld 
     "google"     "www"     "com" 
    
    [[3]]
       domain subdomain       tld 
       "test"        ""     "com" 
    
    0 讨论(0)
  • 2020-12-30 06:48

    Building upon R_Newbie's answer, here's a function that will extract the server name from a (vector of) URLs, stripping away a www. prefix if it exists, and gracefully ignoring a missing protocol prefix.

    domain.name <- function(urls) {
        require(httr)
        require(plyr)
        paths <- laply(urls, function(u) with(parse_url(u),
                                              paste0(hostname, "/", path)))
        gsub("^/?(?:www\\.)?([^/]+).*$", "\\1", paths)
    }
    

    The parse_url function is used to extract the path argument, which is further processed by gsub. The /? and (?:www\\.)? parts of the regular expression will match an optional leading slash followed by an optional www., and the [^/]+ matches everything after that but before the first slash -- this is captured and effectively used in the replace text of the gsub call.

    > domain.name(c("test.server.com/test", "www.google.com/test/index.asp",
                    "http://test.com/?ex"))
    [1] "test.server.com" "google.com"      "test.com"       
    
    0 讨论(0)
  • 2020-12-30 06:55

    Since parse_url() uses regular expressions anyway, we may as well reinvent the wheel and create a single regular expression replacement in order to build a sweet and fancy gsub call.

    Let's see. A URL consists of a protocol, a "netloc" which may include username, password, hostname and port components, and a remainder which we happily strip away. Let's assume first there's no username nor password nor port.

    • ^(?:(?:[[:alpha:]+.-]+)://)? will match the protocol header (copied from parse_url()), we are stripping this away if we find it
    • Also, a potential www. prefix is stripped away, but not captured: (?:www\\.)?
    • Anything up to the subsequent slash will be our fully qualified host name, which we capture: ([^/]+)
    • The rest we ignore: .*$

    Now we plug together the regexes above, and the extraction of the hostname becomes:

    PROTOCOL_REGEX <- "^(?:(?:[[:alpha:]+.-]+)://)?"
    PREFIX_REGEX <- "(?:www\\.)?"
    HOSTNAME_REGEX <- "([^/]+)"
    REST_REGEX <- ".*$"
    URL_REGEX <- paste0(PROTOCOL_REGEX, PREFIX_REGEX, HOSTNAME_REGEX, REST_REGEX)
    domain.name <- function(urls) gsub(URL_REGEX, "\\1", urls)
    

    Change host name regex to include (but not capture) the port:

    HOSTNAME_REGEX <- "([^:/]+)(?::[0-9]+)?"
    

    And so forth and so on, until we finally arrive at an RFC-compliant regular expression for parsing URLs. However, for home use, the above should suffice:

    > domain.name(c("test.server.com/test", "www.google.com/test/index.asp",
                    "http://test.com/?ex"))
    [1] "test.server.com" "google.com"      "test.com"       
    
    0 讨论(0)
  • 2020-12-30 06:56

    I'd forgo a package and use regex for this.

    EDIT reformulated after the robot attack from Dason...

    x <- c("talkstats.com", "www.google.com/test/index.asp", 
        "google.com/somethingelse", "www.stackoverflow.com",
        "http://www.bing.com/search?q=google.com&go=&qs=n&form=QBLH&pq=google.com&sc=8-1??0&sp=-1&sk=")
    
    parser <- function(x) gsub("www\\.", "", sapply(strsplit(gsub("http://", "", x), "/"), "[[", 1))
    parser(x)
    
    lst <- lapply(unique(parser(x)), function(var) x[parser(x) %in% var])
    names(lst) <- unique(parser(x))
    lst
    
    ## $talkstats.com
    ## [1] "talkstats.com"
    ## 
    ## $google.com
    ## [1] "www.google.com/test/index.asp" "google.com/somethingelse"     
    ## 
    ## $stackoverflow.com
    ## [1] "www.stackoverflow.com"
    ## 
    ## $bing.com
    ## [1] "http://www.bing.com/search?q=google.com&go=&qs=n&form=QBLH&pq=google.com&sc=8-1??0&sp=-1&sk="
    

    This may need to be extended depending on the structure of the data.

    0 讨论(0)
  • 2020-12-30 06:59

    You can use the function of the R package httr

     parse_url(url) 
     >parse_url("http://google.com/")
    

    You can get more details here: http://cran.r-project.org/web/packages/httr/httr.pdf

    0 讨论(0)
  • 2020-12-30 06:59

    There's also the urltools package, now, which is infinitely faster:

    urltools::url_parse(c("www.google.com/test/index.asp", 
                          "google.com/somethingelse"))
    
    ##                  scheme         domain port           path parameter fragment
    ## 1        www.google.com      test/index.asp                   
    ## 2            google.com       somethingelse                   
    
    0 讨论(0)
提交回复
热议问题