Regex with lookahead does not match in Python

前端 未结 1 1736
醉梦人生
醉梦人生 2021-01-22 16:19

I have composed a regex pattern aiming to capture one date and one number from a sentence. But it does not.

My code is:

txt = \'Την 02/12/2013 καταχωρήθηκ         


        
相关标签:
1条回答
  • 2021-01-22 16:51

    Issues:

    • \.+ matches one or more dots, you need to use .+ (no escaping)
    • (?=(κωδικ.\s?αριθμ.\s?καταχ.ριση.)|(κ\.?α\.?κ\.?:?\s*))(?P<KEK_number>\d+) will always prevent any match since the positive lookahead requires some text that is not 1 or more digits. You need to convert the lookahead to a consuming pattern.

    I suggest fixing your pattern as

    p = re.compile(r'''Την\s? # matches Την with a possible space afterwards
    (?P<KEK_date>\d{2}/\d{2}/\d{4}) #matches a date of the given format and captures it with a named group
    .+ # Allow for an arbitrary sequence of characters 
    (?:κωδικ.\s?αριθμ.\s?καταχ.ριση.|κ\.?α\.κ\.:?)\s+ # defines two lookaheads, either of which suffices
    (?P<KEK_number>\d+) # captures a sequence of numbers''', re.I | re.X)
    

    See the regex demo

    Details

    • Την\s? - Την string and an optional whitespace
    • (?P<KEK_date>\d{2}/\d{2}/\d{4}) - Group "KEK_date": a date pattern, 2 digits, /, 2 digits, / and 4 digits
    • .+ - 1 or more chars other than line break chars as many as possible
    • (?:κωδικ.\s?αριθμ.\s?καταχ.ριση.|κ\.?α\.κ\.:?) - either of
      • κωδικ.\s?αριθμ.\s?καταχ.ριση. - κωδικ, any char, an optional whitespace, αριθμ, any one char, an optional whitespace, καταχ, any 1 char, ριση and any 1 char (but line break char)
      • | - or
      • κ\.?α\.κ\.:? - κ, an optional ., α, an optional ., κ a . and then an optional :
    • \s+ - 1+ whitespaces
    • (?P<KEK_number>\d+) - Group "KEK_number": 1+ digits

    See a Python demo:

    import re
    txt = 'Την 02/12/2013 καταχωρήθηκε στο Γενικό Εμπορικό Μητρώο της Υπηρεσίας Γ.Ε.ΜΗ. του Επιμελητηρίου Βοιωτίας, με κωδικόαριθμό καταχώρισης Κ.Α.Κ.: 110035'
    p = re.compile(r'''Την\s? # matches Την with a possible space afterwards
    (?P<KEK_date>\d{2}/\d{2}/\d{4}) #matches a date of the given format and captures it with a named group
    .+ # Allow for an arbitrary sequence of characters 
    (?:κωδικ.\s?αριθμ.\s?καταχ.ριση.|κ\.?α\.κ\.:?)\s+ # defines two lookaheads, either of which suffices
    (?P<KEK_number>\d+) # captures a sequence of numbers''', re.I | re.X)
    print(p.findall(txt)) # => [('02/12/2013', '110035')]
    
    0 讨论(0)
提交回复
热议问题