Best way to parse string of email addresses

后端 未结 13 2612
悲哀的现实
悲哀的现实 2021-02-14 04:10

So i am working with some email header data, and for the to:, from:, cc:, and bcc: fields the email address(es) can be expressed in a number of different ways:

F         


        
相关标签:
13条回答
  • 2021-02-14 04:49

    Here is the solution i came up with to accomplish this:

    String str = "Last, First <name@domain.com>, name@domain.com, First Last <name@domain.com>, \"First Last\" <name@domain.com>";
    
    List<string> addresses = new List<string>();
    int atIdx = 0;
    int commaIdx = 0;
    int lastComma = 0;
    for (int c = 0; c < str.Length; c++)
    {
    if (str[c] == '@')
        atIdx = c;
    
    if (str[c] == ',')
        commaIdx = c;
    
    if (commaIdx > atIdx && atIdx > 0)
    {
        string temp = str.Substring(lastComma, commaIdx - lastComma);
        addresses.Add(temp);
        lastComma = commaIdx;
        atIdx = commaIdx;
    }
    
    if (c == str.Length -1)
    {
        string temp = str.Substring(lastComma, str.Legth - lastComma);
        addresses.Add(temp);
    }
    }
    
    if (commaIdx < 2)
    {
        // if we get here we can assume either there was no comma, or there was only one comma as part of the last, first combo
        addresses.Add(str);
    }
    
    0 讨论(0)
  • 2021-02-14 04:53

    Here is how I would do it:

    • You can try to standardize the data as much as possible i.e. get rid of such things as the < and > symbols and all of the commas after the '.com.' You will need the commas that separate the first and last names.
    • After getting rid of the extra symbols, put every grouped email record in a list as a string. You can use the .com to determine where to split the string if need be.
    • After you have the list of email addresses in the list of strings, you can then further split the email addresses using only whitespace as the delimeter.
    • The final step is to determine what is the first name, what is the last name, etc. This would be done by checking the 3 components for: a comma, which would indicate that it is the last name; a . which would indicate the actual address; and whatever is left is the first name. If there is no comma, then the first name is first, last name is second, etc.

      I don't know if this is the most concise solution, but it would work and does not require any advanced programming techniques
    0 讨论(0)
  • 2021-02-14 04:57

    I decided that I was going to draw a line in the sand at two restrictions:

    1. The To and Cc headers have to be csv parseable strings.
    2. Anything MailAddress couldn't parse, I'm just not going to worry about it.

    I also decided I'm just interested in email addresses and not display name, since display name is so problematic and hard to define, whereas email address I can validate. So I used MailAddress to validate my parsing.

    I treated the To and Cc headers like a csv string, and again, anything not parseable in that way I don't worry about it.

    private string GetProperlyFormattedEmailString(string emailString)
        {
            var emailStringParts = CSVProcessor.GetFieldsFromString(emailString);
    
            string emailStringProcessed = "";
    
            foreach (var part in emailStringParts)
            {
                try
                {
                    var address = new MailAddress(part);
                    emailStringProcessed += address.Address + ",";
                }
                catch (Exception)
                {
                    //wasn't an email address
                    throw;
                }
            }
    
            return emailStringProcessed.TrimEnd((','));
        }
    

    EDIT

    Further research has showed me that my assumptions are good. Reading through the spec RFC 2822 pretty much shows that the To, Cc, and Bcc fields are csv-parseable fields. So yeah it's hard and there are a lot of gotchas, as with any csv parsing, but if you have a reliable way to parse csv fields (which TextFieldParser in the Microsoft.VisualBasic.FileIO namespace is, and is what I used for this), then you are golden.

    Edit 2

    Apparently they don't need to be valid CSV strings...the quotes really mess things up. So your csv parser has to be fault tolerant. I made it try to parse the string, if it failed, it strips all quotes and tries again:

    public static string[] GetFieldsFromString(string csvString)
        {
            using (var stringAsReader = new StringReader(csvString))
            {
                using (var textFieldParser = new TextFieldParser(stringAsReader))
                {
                    SetUpTextFieldParser(textFieldParser, FieldType.Delimited, new[] {","}, false, true);
    
                    try
                    {
                        return textFieldParser.ReadFields();
                    }
                    catch (MalformedLineException ex1)
                    {
                        //assume it's not parseable due to double quotes, so we strip them all out and take what we have
                        var sanitizedString = csvString.Replace("\"", "");
    
                        using (var sanitizedStringAsReader = new StringReader(sanitizedString))
                        {
                            using (var textFieldParser2 = new TextFieldParser(sanitizedStringAsReader))
                            {
                                SetUpTextFieldParser(textFieldParser2, FieldType.Delimited, new[] {","}, false, true);
    
                                try
                                {
                                    return textFieldParser2.ReadFields().Select(part => part.Trim()).ToArray();
                                }
                                catch (MalformedLineException ex2)
                                {
                                    return new string[] {csvString};
                                }
                            }
                        }
                    }
                }
            }
        }
    

    The one thing it won't handle is quoted accounts in an email i.e. "Monkey Header"@stupidemailaddresses.com.

    And here's the test:

    [Subject(typeof(CSVProcessor))]
    public class when_processing_an_email_recipient_header
    {
        static string recipientHeaderToParse1 = @"""Lastname, Firstname"" <firstname_lastname@domain.com>" + "," +
                                               @"<testto@domain.com>, testto1@domain.com, testto2@domain.com" + "," +
                                               @"<testcc@domain.com>, test3@domain.com" + "," +
                                               @"""""Yes, this is valid""""@[emails are hard to parse!]" + "," +
                                               @"First, Last <name@domain.com>, name@domain.com, First Last <name@domain.com>"
                                               ;
    
        static string[] results1;
        static string[] expectedResults1;
    
        Establish context = () =>
        {
            expectedResults1 = new string[]
            {
                @"Lastname",
                @"Firstname <firstname_lastname@domain.com>",
                @"<testto@domain.com>",
                @"testto1@domain.com",
                @"testto2@domain.com",
                @"<testcc@domain.com>",
                @"test3@domain.com",
                @"Yes",
                @"this is valid@[emails are hard to parse!]",
                @"First",
                @"Last <name@domain.com>",
                @"name@domain.com",
                @"First Last <name@domain.com>"
            };
        };
    
        Because of = () =>
        {
            results1 = CSVProcessor.GetFieldsFromString(recipientHeaderToParse1);
        };
    
        It should_parse_the_email_parts_properly = () => results1.ShouldBeLike(expectedResults1);
    }
    
    0 讨论(0)
  • 2021-02-14 04:57

    I use the following regular expression in Java to get email string from RFC-compliant email address:

    [A-Za-z0-9]+[A-Za-z0-9._-]+@[A-Za-z0-9]+[A-Za-z0-9._-]+[.][A-Za-z0-9]{2,3}
    
    0 讨论(0)
  • 2021-02-14 04:58

    You could use regular expressions to try to separate this out, try this guy:

    ^(?<name1>[a-zA-Z0-9]+?),? (?<name2>[a-zA-Z0-9]+?),? (?<address1>[a-zA-Z0-9.-_<>]+?)$
    

    will match: Last, First test@test.com; Last, First <test@test.com>; First last test@test.com; First Last <test@test.com>. You can add another optional match in the regex at the end to pick up the last segment of First, Last <name@domain.com>, name@domain.com after the email address enclosed in angled braces.

    Hope this helps somewhat!

    EDIT:

    and of course you can add more characters to each of the sections to accept quotations etc for whatever format is being read in. As sjbotha mentioned, this could be difficult as the string that is submitted is not necessarily in a set format.

    This link can give you more information about matching AND validating email addresses using regular expressions.

    0 讨论(0)
  • 2021-02-14 04:59

    The clean and short solution is to use MailAddressCollection:

    var collection = new MailAddressCollection();
    collection.Add(addresses);
    

    This approach parses a list of addresses separated with colon ,, and validates it according to RFC. It throws FormatException in case the addresses are invalid. As suggested in other posts, if you need to deal with invalid addresses, you have to pre-process or parse the value by yourself, otherwise recommending to use what .NET offers without using reflection.

    Sample:

    var collection = new MailAddressCollection();
    collection.Add("Joe Doe <doe@example.com>, postmaster@example.com");
    
    foreach (var addr in collection)
    {
      // addr.DisplayName, addr.User, addr.Host
    }
    
    0 讨论(0)
提交回复
热议问题