Different behavior between re.finditer and re.findall

前端 未结 4 383
逝去的感伤
逝去的感伤 2021-01-31 04:59

I am using the following code:

CARRIS_REGEX=r\'(\\d+)([\\s\\w\\.\\-]+)(\\d+:\\d+)(\\d+m)         


        
相关标签:
4条回答
  • 2021-01-31 05:29

    I can't reproduce this here. Have tried it with both Python 2.7 and 3.1.

    One difference between finditer and findall is that the former returns regex match objects whereas the other returns a tuple of the matched capturing groups (or the entire match if there are no capturing groups).

    So

    import re
    CARRIS_REGEX=r'<th>(\d+)</th><th>([\s\w\.\-]+)</th><th>(\d+:\d+)</th><th>(\d+m)</th>'
    pattern = re.compile(CARRIS_REGEX, re.UNICODE)
    mailbody = open("test.txt").read()
    for match in pattern.finditer(mailbody):
        print(match)
    print()
    for match in pattern.findall(mailbody):
        print(match)
    

    prints

    <_sre.SRE_Match object at 0x00A63758>
    <_sre.SRE_Match object at 0x00A63F98>
    <_sre.SRE_Match object at 0x00A63758>
    <_sre.SRE_Match object at 0x00A63F98>
    <_sre.SRE_Match object at 0x00A63758>
    <_sre.SRE_Match object at 0x00A63F98>
    <_sre.SRE_Match object at 0x00A63758>
    <_sre.SRE_Match object at 0x00A63F98>
    
    ('790', 'PR. REAL', '21:06', '04m')
    ('758', 'PORTAS BENFICA', '21:10', '09m')
    ('790', 'PR. REAL', '21:14', '13m')
    ('758', 'PORTAS BENFICA', '21:21', '19m')
    ('790', 'PR. REAL', '21:29', '28m')
    ('758', 'PORTAS BENFICA', '21:38', '36m')
    ('758', 'SETE RIOS', '21:49', '47m')
    ('758', 'SETE RIOS', '22:09', '68m')
    

    If you want the same output from finditer as you're getting from findall, you need

    for match in pattern.finditer(mailbody):
        print(tuple(match.groups()))
    
    0 讨论(0)
  • 2021-01-31 05:38

    I get this example from Regular expression operations in Python 2.* Documentation and that example well described here in details with some modification. To explain whole example, let's get string type variable call,

    text = "He was carefully disguised but captured quickly by police."
    

    and the compile type regular expression pattern as,

    regEX = r"\w+ly"
    pattern = re.compile(regEX)
    

    \w mean matches any word character (alphanumeric & underscore), + mean matches 1 or more of the preceding token and the whole meaning is select any word which is end-up with ly. There are only two 2 words('carefully' and 'quickly') which is satisfied the above regular expression.

    Before move into re.findall() or re.finditer(), let's see what does re.search() mean in Python 2.* Documentation.

    Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

    Following code lines gives you the basic understand of re.search().

    search = pattern.search(text)
    print(search)
    print(type(search))
    
    #output
    <re.Match object; span=(7, 16), match='carefully'>
    <class 're.Match'>
    

    It will generate re.MatchObject of class type object which have 13 of supported methods and attributes according to Python 2.* Documentation. This span() method consist with the start and end points(7 and 16 present in the above example) of the matched word in text variable. re.search() method only consider about the very first match, otherwise return None.

    Let's move into the question, before that see what does re.finditer() mean in Python 2.* Documentation.

    Return an iterator yielding MatchObject instances over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.

    Coming next code lines gives you the basic understand of re.finditer().

    finditer = pattern.finditer(text)
    print(finditer)
    print(type(finditer))
    
    #output
    <callable_iterator object at 0x040BB690>
    <class 'callable_iterator'>
    

    The above example gives us the Iterator Objects which need to be loop. This is obviously not the result we want. Let's loop finditer and see what's inside this Iterator Objects.

    for anObject in finditer:
        print(anObject)
        print(type(anObject))
        print()
    
    #output
    <re.Match object; span=(7, 16), match='carefully'>
    <class 're.Match'>
    
    <re.Match object; span=(40, 47), match='quickly'>
    <class 're.Match'>
    

    This results are much similar to the re.search() result which we get earlier. But we can see the new result in above output, <re.Match object; span=(40, 47), match='quickly'>. As I mention earlier in Python 2.* Documentation, re.search() will scan through string looking for the first location where the regular expression pattern produces a match and re.finditer() will scan through string looking for all the locations where the regular expression pattern produces matches and return more details than re.findall() method.

    Here what does re.findall() mean in Python 2.* Documentation.

    Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.

    Let's understand what happen in re.findall().

    findall = pattern.findall(text)
    print(findall)
    print(type(findall))
    
    #output
    ['carefully', 'quickly']
    <class 'list'>
    

    This output only gives us the matched words in text variable, otherwise return an empty list. That list in the output which is similar to the match attribute in re.MatchObject.

    Here is the full code and I tried in Python 3.7.

    import re
    
    text = "He was carefully disguised but captured quickly by police."
    
    regEX = r"\w+ly"
    pattern = re.compile(regEX)
    
    search = pattern.search(text)
    print(search)
    print(type(search))
    print()
    
    findall = pattern.findall(text)
    print(findall)
    print(type(findall))
    print()
    
    finditer = pattern.finditer(text)
    print(finditer)
    print(type(finditer))
    print()
    for anObject in finditer:
        print(anObject)
        print(type(anObject))
        print()
    
    0 讨论(0)
  • 2021-01-31 05:50

    You can't make them behave the same way, because they're different. If you really want to create a list of results from finditer, then you could use a list comprehension:

    >>> [match for match in pattern.finditer(mailbody)]
    [...]
    

    In general, use a for loop to access the matches returned by re.finditer:

    >>> for match in pattern.finditer(mailbody):
    ...     ...
    
    0 讨论(0)
  • 2021-01-31 05:52

    re.findall(pattern.string)

    findall() returns all non-overlapping matches of pattern in string as a list of strings.

    re.finditer()

    finditer() returns callable object.

    In both functions, the string is scanned from left to right and matches are returned in order found.

    0 讨论(0)
提交回复
热议问题