How to tokenize, scan or split this string of email addresses

随声附和 提交于 2019-12-11 04:42:44

问题


For Simple Java Mail I'm trying to deal with a somewhat free-format of delimited email addresses. Note that I'm specifically not validating, just getting the addresses out of a list of addresses. For this use case the addresses can be assumed to be valid.

Here is an example of a valid input:

"name@domain.com,Sixpack, Joe 1 <name@domain.com>, Sixpack, Joe 2 <name@domain.com> ;Sixpack, Joe, 3<name@domain.com> , nameFoo@domain.com,nameBar@domain.com;nameBaz@domain.com;"

So there are two basic forms "name@domain.com" and "Joe Sixpack ", which can appear in a comma / semicolon delimited string, ignoring white space padding. The problem is that the names can contains delimiters as valid characters.

The following array shows the data needed (trailing spaces or delimiters would not be a big problem):

["name@domain.com",
"Sixpack, Joe 1 <name@domain.com>",
"Sixpack, Joe 2 <name@domain.com>",
"Sixpack, Joe, 3<name@domain.com>",
"nameFoo@domain.com",
"nameBar@domain.com",
"nameBaz@domain.com"]

I can't think of a clean way to deal with this. Any suggestion how I can reliably recognize whether a comma is part of a name or is a delimiter?


Final solution (variation on the accepted answer):

var string = "name@domain.com,Sixpack, Joe 1 <name@domain.com>, Sixpack, Joe 2 <name@domain.com> ;Sixpack, Joe, 3<name@domain.com> , nameFoo@domain.com,nameBar@domain.com;nameBaz@domain.com;"

// recognize value tails and replace the delimiters there, disambiguating delimiters
const result = string
  .replace(/(@.*?>?)\s*[,;]/g, "$1<|>")
  .replace(/<\|>$/,"") // remove trailing delimiter
  .split(/\s*<\|>\s*/) // split on delimiter including surround space

console.log(result)

Or in Java:

public static String[] extractEmailAddresses(String emailAddressList) {
    return emailAddressList
            .replaceAll("(@.*?>?)\\s*[,;]", "$1<|>")
            .replaceAll("<\\|>$", "")
            .split("\\s*<\\|>\\s*");
}

回答1:


Using Java's replaceAll and split functions (mimicked in javascript below), I would say lock onto what you know ends an item (the ".com"), replace separator characters with a unique temp (a uuid or something like <|>), and then split using your refactored delimiter.

Here is a javascript example, but Java's repalceAll and split can do the same job.

var string = "name@domain.com,Joe Sixpack <name@domain.com>, Sixpack, Joe <name@domain.com> ;Sixpack, Joe<name@domain.com> , name@domain.com,name@domain.com;name@domain.com;"


const result = string.replace(/(\.com>?)[\s,;]+/g, "$1<|>").replace(/<\|>$/,"").split("<|>")
console.log(result)



回答2:


since you are not validating, i assume that the email addresses are valid. Based on this assumption, i will look up an email address followed by ; or , this way i know its valid.

    var string = "name@domain.com,Sixpack, Joe 1 <name@domain.com>, Sixpack, Joe 2 <name@domain.com> ;Sixpack, Joe, 3<name@domain.com> , nameFoo@domain.com,nameBar@domain.com;nameBaz@domain.com;"



    const result = string.match(/(.*?@.*?\..*?)[,;]/g)
    console.log(result)



回答3:


This pattern works for your provided examples:

([^@,;\s]+@[^@,;\s]+)|(?:$|\s*[,;])(?:\s*)(.*?)<([^@,;\s]+@[^@,;\s]+)>

([^@,;\s]+@[^@,;\s]+)   # email defined by an @ with connected chars except ',' ';' and white-space
|                       # OR
(?:$|\s*[,;])(?:\s*)    # start of line OR 0 or more spaces followed by a separator, then 0 or more white-space chars
(.*?)                   # name
<([^@,;\s]+@[^@,;\s]+)> # email enclosed by lt-gt

PCRE Demo



来源:https://stackoverflow.com/questions/45825426/how-to-tokenize-scan-or-split-this-string-of-email-addresses

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!