Parsing URI parameter and keyword value pairs

后端 未结 3 1578
情话喂你
情话喂你 2021-01-22 12:45

I would like to parse the parameter and keyword values from URI/L\'s in a text file. Parameters without values should also be included. Python is fine but am open to suggestion

相关标签:
3条回答
  • 2021-01-22 13:42

    You don't need to dive into fragile regex world.

    urlparse.parse_qsl() is the tool for the job (urllib.quote() helps to escape special characters):

    from urllib import quote
    from urlparse import parse_qsl, urlparse
    
    
    with open('links.txt') as f:
        for url in f:
            params = parse_qsl(urlparse(url.strip()).query, keep_blank_values=True)
            for key, value in params:
                print "%s=%s" % (key, quote(value))
    

    Prints:

    date=2012-11-20
    l=user
    x=0
    id=1
    page=http%3A//domain.com/page.html
    unique=123456
    refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob%20test%201.21%20some%26file%3Dname
    text=
    l=adm
    y=5
    id=2
    page=http%3A//support.domain.com/downloads/index.asp
    unique=12345
    view=month
    date=2011-12-10
    

    Hope that helps.

    0 讨论(0)
  • 2021-01-22 13:42

    I would use a regular expression like this (first code then explanation):

    pairs = re.findall(r'(\w+)=(.*?)(?:\n|&)', s, re.S)
    for k, v in pairs:
        print('{0} = {1}'.format(k, v))
    

    The first line is where the action happens. The regular expression finds all occurrences of a word followed by an equal sign and then a string that terminates either by a & or by a new line char. The return pairs is a tuple list, where each tuple contains the word (the keyword) and the value. I didn't capture the = sign, and instead I print it in the loop.

    Explaining the regex:

    \w+ means one or more word chars. The parenthesis around it means to capture it and return that value as a result.

    = - the equal sign that must follow the word

    .*? - zero or more chars in a non-greedy manner, that is until a new line appears or the & sign, which is designated by \n|&. The (?:.. pattern means that the \n or & should not be captured.

    Since we capture 2 things in the regex - the keyword and everything after the = sign, a list of 2-tuples is returned.

    The re.S tells the regex engine to allow the match-all regex code - . - include in the search the new line char as well, that is, allow the search span over multiple lines (which is not default behavior).

    0 讨论(0)
  • 2021-01-22 13:45

    You can use a regular expression to extract all the pairs.

    >>> url = 'www2.domain.edu/folder/folder/page.php?l=user&x=0&id=1&page=http%3A//domain.com/page.html&unique=123456&refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname&text='
    >>> import re
    >>> url = 'www2.domain.edu/folder/folder/page.php?l=user&x=0&id=1&page=http%3A//domain.com/page.html&unique=123456&refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname&text='
    >>> p = re.compile('.*?&(.*?)=(.*?)(?=&|$)')
    >>> m = p.findall(url)
    >>> m
    [('x', '0'), ('id', '1'), ('page', 'http%3A//domain.com/page.html'), ('unique', '123456'), ('refer', 'http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname'), ('text', '')]
    

    You can even use a dict comprehension to package all the data together.

    >>> dic = {k:v for k,v in m}
    >>> dic
    {'text': '', 'page': 'http%3A//domain.com/page.html', 'x': '0', 'unique': '123456', 'id': '1', 'refer': 'http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname'}
    

    And then if all you want to do is print them out:

    >>> for k,v in dic.iteritems():
        print k,'-->',v
    
    text --> 
    page --> http%3A//domain.com/page.html
    x --> 0
    unique --> 123456
    id --> 1
    refer --> http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname
    
    0 讨论(0)
提交回复
热议问题