Negative lookbehind in a regex with an optional prefix

送分小仙女□ 提交于 2021-01-27 15:37:25


We are using the following regex to recognize urls (derived from this gist by Jim Gruber). This is being executed in Scala using scala.util.matching which in turn uses java.util.regex:


This version has escaped forward slashes, for Rubular:


Previously the front-end was only sending plaintext to the back end, however now they're allowing users to create anchor tags for urls. Therefore the back end now needs to recognize urls except for those that are already in anchor tags. I initially tried to accomplish this with a negative loohbehind, ignoring urls with a href=" prefix

(?i)\b((?<!href=")((?:https?: ... etc

The problem is that our url regex is very liberal, recognizing,, and - given

 <a href="">Google</a>

the negative lookbehind will ignore, but then the regex will still recognize I'm wondering if there's a succinct way to tell the regex "ignore and if they are substrings of an ignored http(s)://"

At present I'm using a filter on the url regex matches (code is in Scala) - this also ignores urls in link text (<a href=""></a>) by ignoring urls with a > prefix and </a> suffix. I'd rather stick with the filter if doing this in a regex would make an already complicated regex even more unreadable.

urlPattern.findAllMatchIn(text).toList.filter(m => {
  val start: Int = m.start(1)
  val end: Int = m.end(1)
  val isHref: Boolean = (start - 6 > 0) && 
    text.substring(start - 6, start) == """href=""""
  val isAnchor: Boolean = (start - 1 > 0 && end + 3 < text.length && 
    text.substring(start - 1, start) == ">" && 
    text.substring(end, end + 3) == "</a>")
  !(isHref || isAnchor) && Option(


<a href=\S+|\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@)))


<a href=(?:(?!<\/a>).)*<\/a>|\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@)))

Try this. What it essentially does is:

  1. Consumes all href links so that it cannot be matched later

  2. Does not capture it so it will not appear in groups anyways.

  3. Process the rest as before.

See demo.


It seems that you're not only wanting to ignore and if they are substrings of an ignored http(s)://", but instead any substring fragments from a previously ignored section... In which case, you can use a bit of code to work around this! Please see the regex:

(a href=")?(?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@))))

I'm not good at scala but you can probably do this:

val links = new Regex("""(a href=")?(?i)\b(((?:https?:... """.r, "unwanted")
val unwanted = for (o <- links findAllMatchIn text) yield o group "unwanted"

If unwanted is scala.Null, then the match is useful.

You can workaround for a need of replacement by replacing an alternative:

a href="(?i)\b(?:(?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@)))|((?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@)))))

The second part of the regex behind the pipe | is grouped as a capturing group. You can replace by this regex with the first group: \1

Similar question:

  • Regex Pattern to Match, Excluding when... / Except between


How about just adding the <a href= part as an optional group, then when checking your matching, you only return those matches in which that group is empty?

