Extracting email addresses in an html block in ruby/rails

I am creating a parser that wards off against spamming and harvesting of emails from a block of text that comes from tinyMCE (so it may or may not have html tags in it)

I've tried regexes and so far this has been successful:

/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i

problem is, i need to ignore all email addresses with mailto hrefs. for example:

<a href="mailto:test@mail.com">test@mail.com</a>

should only return the second email add.

To get a background of what im doing, im reversing the email addresses in a block so the above example would look like this:

<a href="mailto:test@mail.com">moc.liam@tset</a>

problem with my current regex is that it also replaces the one in href. Is there a way for me to do this with a single regex? Or do i have to check for one then the other? Is there a way for me to do this just by using gsub or do I have to use some nokogiri/hpricot magicks and whatnot to parse the mailtos? Thanks in advance!

Here were my references btw:

so.com/questions/504860/extract-email-addresses-from-a-block-of-text

so.com/questions/1376149/regexp-for-extracting-a-mailto-address

im also testing using this:

http://rubular.com/

edit

here's my current helper code:

def email_obfuscator(text)
  text.gsub(/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i) { |m|
    m = "<span class='anti-spam'>#{m.reverse}</span>"
  }
end

which results in this:

<a target="_self" href="mailto:<span class='anti-spam'>moc.liamg@tset</span>"><span class="anti-spam">moc.liamg@tset</span></a>

Another option if lookbehind doesn't work:

/\b(mailto:)?([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})\b/i

This would match all emails, then you can manually check if first captured group is "mailto:" then skip this match.

Would this work?

/\b(?<!mailto:)[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i

The (?<!mailto:) is a negative lookbehind, which will ignore any matches starting with mailto:

I don't have Ruby set up at work, unfortunately, but it worked with PHP when I tested it...

Why not just store all the matched emails in an array and remove any duplicates? You can do this easily with the ruby standard library and (I imagine) it's probably quicker/more maintainable than adding more complexity to your regex.

emails = ["email_one@example.com", "email_one@example.com", "email_two@example.com"]
emails.uniq # => ["email_one@example.com", "email_two@example.com"]

来源：https://stackoverflow.com/questions/2782031/extracting-email-addresses-in-an-html-block-in-ruby-rails

标签

ruby-on-rails

ruby

regex

html-parsing

email-integration