negative lookahead assertion not working in python

本小妞迷上赌 提交于 2019-11-29 02:20:39

(Darn, Jon beat me. Oh well, you can look at the examples anyway)

Like the other guys have said, regex is not the best tool for this job. If you are working with filepaths, take a look at os.path.

As for filtering files you don't want, you can do if 'thumb' not in filename: ... once you have dissected the path (where filename is a str).

And for posterity, here are my thoughts on those regex. r".*(?!thumb).*" does not work as because .* is greedy and the lookahead is given a very low priority. Take a look at this:

>>> re.search('(.*)((?!thumb))(.*)', '/tmp/somewhere/thumb').groups()
('/tmp/somewhere/thumb', '', '')
>>> re.search('(.*?)((?!thumb))(.*)', '/tmp/somewhere/thumb').groups()
('', '', '/tmp/somewhere/thumb')
>>> re.search('(.*?)((?!thumb))(.*?)', '/tmp/somewhere/thumb').groups()
('', '', '')

The last one is quite strange...

The other regex (r"^(?!.*thumb).*") works because .* is inside the lookahead, so you don't have any issues with characters being stolen. You actually don't even need the ^, depending on if you are using re.match or re.search:

>>> re.search('((?!.*thumb))(.*)', '/tmp/somewhere/thumb').groups()
('', 'humb')
>>> re.search('^((?!.*thumb))(.*)', '/tmp/somewhere/thumb').groups()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groups'
>>> re.match('((?!.*thumb))(.*)', '/tmp/somewhere/thumb').groups()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groups'

Could someone please explain why r".*(?!thumb).*" does not work but r"^(?!.*thumb).*" does?

The first will always match as the .* will consume all the string (so it can't be followed by anything for the negative lookahead to fail). The second is a bit convoluted and will match from the start of the line, the most amount of characters until it encounters 'thumb' and if that's present, then the entire match fails, as the line does begin with something followed by 'thumb'.

Number two is more easily written as:

  • 'thumb' not in string
  • not re.search('thumb', string) (instead of match)

Also as I mentioned in the comments, your question says:

filenames not containing the word "thumb"

So you may wish to consider whether or not thumbs up is supposed to be excluded or not.

Ignoring all the bits about regular expressions, your task seems relatively simple:

  • given: a list of images filenames
  • todo: create a new list with filenames not containing the word "thumb" - i.e. only target the non-thumbnail images (with PIL - Python Imaging Library).

Assuming you have a list of filenames that looks something like this:

filenames = [ 'file1.jpg', 'file1-thumb.jpg', 'file2.jpg', 'file2-thumb.jpg' ]

Then you can get a list of files not containing the word thumb like this:

not_thumb_filenames = [ filename for filename in filenames if not 'thumb' in filename ]

That's what we call a list comprehension, and is essentially shorthand for:

not_thumb_filenames = []
for filename in filenames:
  if not 'thumb' in filename:
    not_thumb_filenames.append(filename)

Regular expressions aren't really necessary for this simple task.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!