Distinct by part of the string in linq

梦想与她 提交于 2019-12-12 12:39:03

问题


Given this collection:

var list = new [] {
    "1.one",
    "2. two",
    "no number",
    "2.duplicate",
    "300. three hundred",
    "4-ignore this"};

How can I get subset of items that start with a number followed by a dot (regex @"^\d+(?=\.)") with distinct numbers? That is:

{"1.one", "2. two", "300. three hundred"}

UPDATE:

My attempt on this was to use an IEqualityComparer to pass to the Distinct method. I borrowed this GenericCompare class and tried the following code to no avail:

var pattern = @"^\d+(?=\.)";
var comparer = new GenericCompare<string>(s => Regex.Match(s, pattern).Value);
list.Where(f => Regex.IsMatch(f, pattern)).Distinct(comparer);

回答1:


If you fancy an approach with Linq, you can try adding a named capture group to the regex, then filter the items that match the regex, group by the captured number and finally get only the first string for each number. I like the readability of the solution but I wouldn´t be surprised if there is a more efficient way of eliminating the duplicates, let´s see if somebody else comes with a different approach.

Something like this:

list.Where(s => regex.IsMatch(s))
    .GroupBy(s => regex.Match(s).Groups["num"].Value)
    .Select(g => g.First())

You can give it a try with this sample:

public class Program
{
    private static readonly Regex regex = new Regex(@"^(?<num>\d+)\.", RegexOptions.Compiled);

    public static void Main()
    {
        var list = new [] {
            "1.one",
            "2. two",
            "no number",
            "2.duplicate",
            "300. three hundred",
            "4-ignore this"
        };

        var distinctWithNumbers = list.Where(s => regex.IsMatch(s))
                                      .GroupBy(s => regex.Match(s).Groups["num"].Value)
                                      .Select(g => g.First());

        distinctWithNumbers.ToList().ForEach(Console.WriteLine);
        Console.ReadKey();
    }       
}

You can try the approach it in this fiddle

As pointed by @orad in the comments, there is a Linq extension DistinctBy() in MoreLinq that could be used instead of grouping and then getting the first item in the group to eliminate the duplicates:

var distinctWithNumbers = list.Where(s => regex.IsMatch(s))
                              .DistinctBy(s => regex.Match(s).Groups["num"].Value);

Try it in this fiddle

EDIT

If you want to use your comparer, you need to implement the GetHashCode so it uses the expression as well:

public int GetHashCode(T obj)
{
    return _expr.Invoke(obj).GetHashCode();
}

Then you can use the comparer with a lambda function that takes a string and gets the number using the regex:

var comparer = new GenericCompare<string>(s => regex.Match(s).Groups["num"].Value);
var distinctWithNumbers = list.Where(s => regex.IsMatch(s)).Distinct(comparer); 

I have created another fiddle with this approach.

Using lookahead regex

You can use any of these 2 approaches with the regex @"^\d+(?=\.)".

Just change the lambda expressions getting the "num" group s => regex.Match(s).Groups["num"].Value with a expression that gets the regex match s => regex.Match(s).Value

Updated fiddle here.




回答2:


(I could mark this as answer too)

This solution works without duplicate regex runs:

var regex = new Regex(@"^\d+(?=\.)", RegexOptions.Compiled);
list.Select(i => {
    var m = regex.Match(i);
    return new KeyValuePair<int, string>( m.Success ? Int32.Parse(m.Value) : -1, i );
})
.Where(i => i.Key > -1)
.GroupBy(i => i.Key)
.Select(g => g.First().Value);

Run it in this fiddle.




回答3:


Your solution is good enough.

You can also use LINQ query syntax to avoid regex re-runs with the help of let keyword as follows:

var result =
        from kvp in
        (
            from s in source
            let m = regex.Match(s)
            where m.Success
            select new KeyValuePair<int, string>(int.Parse(m.Value), s)
        )
        group kvp by kvp.Key into gr
        select new string(gr.First().Value);



回答4:


Something like this should work:

List<string> c = new List<string>()
{
    "1.one",
    "2. two",
    "no number",
    "2.duplicate",
    "300. three hundred",
    "4-ignore this"
};

c.Where(i =>
{
    var match = Regex.Match(i, @"^\d+(?=\.)");
    return match.Success;
});


来源:https://stackoverflow.com/questions/25513372/distinct-by-part-of-the-string-in-linq

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!