Best ways of parsing a URL using C?

后端 未结 10 1730
死守一世寂寞
死守一世寂寞 2020-11-27 15:36

I have a URL like this:

http://192.168.0.1:8080/servlet/rece

I want to parse the URL to get the values:

IP: 192.168.0.1
Por         


        
相关标签:
10条回答
  • 2020-11-27 15:41

    I wrote this

    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    #include <ctype.h>
    typedef struct
    {
        const char* protocol = 0;
        const char* site = 0;
        const char* port = 0;
        const char* path = 0;
    } URL_INFO;
    URL_INFO* split_url(URL_INFO* info, const char* url)
    {
        if (!info || !url)
            return NULL;
        info->protocol = strtok(strcpy((char*)malloc(strlen(url)+1), url), "://");
        info->site = strstr(url, "://");
        if (info->site)
        {
            info->site += 3;
            char* site_port_path = strcpy((char*)calloc(1, strlen(info->site) + 1), info->site);
            info->site = strtok(site_port_path, ":");
            info->site = strtok(site_port_path, "/");
        }
        else
        {
            char* site_port_path = strcpy((char*)calloc(1, strlen(url) + 1), url);
            info->site = strtok(site_port_path, ":");
            info->site = strtok(site_port_path, "/");
        }
        char* URL = strcpy((char*)malloc(strlen(url) + 1), url);
        info->port = strstr(URL + 6, ":");
        char* port_path = 0;
        char* port_path_copy = 0;
        if (info->port && isdigit(*(port_path = (char*)info->port + 1)))
        {
            port_path_copy = strcpy((char*)malloc(strlen(port_path) + 1), port_path);
            char * r = strtok(port_path, "/");
            if (r)
                info->port = r;
            else
                info->port = port_path;
        }
        else
            info->port = "80";
        if (port_path_copy)
            info->path = port_path_copy + strlen(info->port ? info->port : "");
        else 
        {
            char* path = strstr(URL + 8, "/");
            info->path = path ? path : "/";
        }
        int r = strcmp(info->protocol, info->site) == 0;
        if (r && info->port == "80")
            info->protocol = "http";
        else if (r)
            info->protocol = "tcp";
        return info;
    }
    

    Test

    int main()
    {
        URL_INFO info;
        split_url(&info, "ftp://192.168.0.1:8080/servlet/rece");
        printf("Protocol: %s\nSite: %s\nPort: %s\nPath: %s\n", info.protocol, info.site, info.port, info.path);
        return 0;
    }
    

    Out

    Protocol: ftp
    Site: 192.168.0.1
    Port: 8080
    Path: /servlet/rece
    
    0 讨论(0)
  • 2020-11-27 15:44

    Write a custom parser or use one of the string replace functions to replace the separator ':' and then use sscanf().

    0 讨论(0)
  • 2020-11-27 15:47

    I wrote a simple code using sscanf, which can parse very basic URLs.

    #include <stdio.h>
    
    int main(void)
    {
        const char text[] = "http://192.168.0.2:8888/servlet/rece";
        char ip[100];
        int port = 80;
        char page[100];
        sscanf(text, "http://%99[^:]:%99d/%99[^\n]", ip, &port, page);
        printf("ip = \"%s\"\n", ip);
        printf("port = \"%d\"\n", port);
        printf("page = \"%s\"\n", page);
        return 0;
    }
    
    ./urlparse
    ip = "192.168.0.2"
    port = "8888"
    page = "servlet/rece"
    
    0 讨论(0)
  • 2020-11-27 15:50

    Pure sscanf() based solution:

    //Code
    #include <stdio.h>
    
    int
    main (int argc, char *argv[])
    {
        char *uri = "http://192.168.0.1:8080/servlet/rece"; 
        char ip_addr[12], path[100];
        int port;
        
        int uri_scan_status = sscanf(uri, "%*[^:]%*[:/]%[^:]:%d%s", ip_addr, &port, path);
        
        printf("[info] URI scan status : %d\n", uri_scan_status);
        if( uri_scan_status == 3 )
        {   
            printf("[info] IP Address : '%s'\n", ip_addr);
            printf("[info] Port: '%d'\n", port);
            printf("[info] Path : '%s'\n", path);
        }
        
        return 0;
    }
    
    

    However, keep in mind that this solution is tailor made for [protocol_name]://[ip_address]:[port][/path] type of URI's. For understanding more about the components present in the syntax of URI, you can head over to RFC 3986.

    Now let's breakdown our tailor made format string : "%*[^:]%*[:/]%[^:]:%d%s"

    • %*[^:] helps to ignore the protocol/scheme (eg. http, https, ftp, etc.)

      It basically captures the string from the beginning until it encounters the : character for the first time. And since we have used * right after the % character, therefore the captured string will be ignored.

    • %*[:/] helps to ignore the separator that sits between the protocol and the IP address, i.e. ://

    • %[^:] helps to capture the string present after the separator, until it encounters :. And this captured string is nothing but the IP address.

    • :%d helps to capture the no. sitting right after the : character (the one which was encountered during the capturing of IP address). The no. captured over here is basically your port no.

    • %s as you may know, will help you to capture the remaining string which is nothing but the path of the resource you are looking for.

    0 讨论(0)
  • 2020-11-27 15:53

    Libcurl now has curl_url_get() function that can extract host, path, etc.

    Example code: https://curl.haxx.se/libcurl/c/parseurl.html

    /* extract host name from the parsed URL */ 
    uc = curl_url_get(h, CURLUPART_HOST, &host, 0);
    if(!uc) {
      printf("Host name: %s\n", host);
      curl_free(host);
    }
    
    0 讨论(0)
  • 2020-11-27 15:55

    May be late,... what I have used, is - the http_parser_parse_url() function and the required macros separated out from Joyent/HTTP parser lib - that worked well, ~600LOC.

    0 讨论(0)
提交回复
热议问题