Regex lookaround construct in Java: advise on optimization needed

亡梦爱人 提交于 2019-12-12 13:37:34

问题


I am trying to search for filenames in a comma-separated list in:

text.txt,temp_doc.doc,template.tmpl,empty.zip

I use Java's regex implementation. Requirements for output are as follows:

  1. Display only filenames and not their respective extensions
  2. Exclude files that begin with "temp_"

It should look like:

text

template

empty

So far I have managed to write more or less satisfactory regex to cope with the first task:

[^\\.,]++(?=\\.[^,]*+,?+)

I believe to make it comply with the second requirement best option is to use lookaround constructs, but not sure how to write a reliable and optimized expression. While the following regex does seem to do what is required, it is obviously a flawed solution if for no other reason than it relies on explicit maximum filename length.

(?!temp_|emp_|mp_|p_|_)(?<!temp_\\w{0,50})[^\\.,]++(?=\\.[^,]*+,?+)

P.S. I've been studying regexes only for a few days, so please don't laugh at this newbie-style overcomplicated code :)


回答1:


How about this:

Pattern regex = Pattern.compile(
    "\\b        # Start at word boundary\n" +
    "(?!temp_)  # Exclude words starting with temp_\n" +
    "[^,]+      # Match one or more characters except comma\n" +
    "(?=\\.)    # until the last available dot", 
    Pattern.COMMENTS);

This also allows dots within filenames.




回答2:


  • Display only filenames and not their respective extensions
  • Exclude files that begin with "temp_"

One variant would be like this:

(?:^|,)(?!temp_)((?:(?!\.[^.]*(?:,|$)).)+)

This allows

  • file names that do not begin with a "word character" (Tim Pietzcker's solution does not)
  • file names that contain a dot (sth. like file.name.ext will be matched as file.name)

But actually, this is really complex. You'll be better off writing a small function that splits the input at the commas and strips the extension from the parts.

Anyway, here's the tear-down:

(?:^|,)        # filename start: either start of the string or comma
(?!temp_)      # negative look-ahead: disallow filenames starting with "temp_"
(              # match group 1 (will contain your file name)
  (?:          #   non-capturing group (matches one allowed character)
    (?!        #     negative look-ahead (not followed by):
      \.       #       a dot
      [^.]*    #       any number of non-dots (this matches the extension)
      (?:,|$)  #       filename-end (either end of string or comma)
    )          #     end negative look-ahead
    .          #     this character is valid, match it
  )+           #   end non-capturing group, repeat
)              # end group 1

http://rubular.com/r/4jeHhsDuJG




回答3:


Another option:

(?:temp_[^,.]*|([^,.]*))\.[^,]*

That pattern will match all file names, but will capture only valid names.

  • If at the current position the pattern can match temp_file.ext, it matches it and does not capture.
  • It it cannot match temp_, it tires to match ([^,.]*)\.[^,]*, and capture the file's name.

You can see an example here: http://www.rubular.com/r/QywiDgFxww



来源:https://stackoverflow.com/questions/11817249/regex-lookaround-construct-in-java-advise-on-optimization-needed

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!