What's the most efficient way to check for duplicates in an array of data using Perl?

前端 未结 7 1530
花落未央
花落未央 2020-12-05 14:26

I need to see if there are duplicates in an array of strings, what\'s the most time-efficient way of doing it?

相关标签:
7条回答
  • 2020-12-05 14:55

    similar to @Schwern's second solution, but checks for duplicates a little earlier from within the comparison function of sort:

    use strict;
    use warnings;
    
    @_ = sort { print "dup = $a$/" if $a eq $b; $a cmp $b } @ARGV;
    

    it won't be as fast as the hashing solutions, but it requires less memory and is pretty darn cute

    0 讨论(0)
  • 2020-12-05 14:58

    Please don't ask about the most time efficient way to do something unless you have some specific requirements, such as "I have to dedupe a list of 100,000 integers in under a second." Otherwise, you're worrying about how long something takes for no reason.

    0 讨论(0)
  • 2020-12-05 15:01

    If you need the uniquified array anyway, it is fastest to use the heavily-optimized library List::MoreUtils, and then compare the result to the original:

    use strict;
    use warnings;
    use List::MoreUtils 'uniq';
    
    my @array = qw(1 1 2 3 fibonacci!);
    my @array_uniq = uniq @array;
    print ((scalar(@array) == scalar(@array_uniq)) ? "no dupes" : "dupes") . " found!\n";
    

    Or if the list is large and you want to bail as soon as a duplicate entry is found, use a hash:

    my %uniq_elements;
    foreach my $element (@array)
    {
        die "dupe found!" if $uniq_elements{$element}++;
    }
    
    0 讨论(0)
  • 2020-12-05 15:04

    One of the things I love about Perl is it's ability to almost read like English. It just sort of makes sense.

    use strict;
    use warnings;
    
    my @array = qw/yes no maybe true false false perhaps no/;
    
    my %seen;
    
    foreach my $string (@array) {
    
        next unless $seen{$string}++;
        print "'$string' is duplicated.\n";
    }
    

    Output

    'false' is duplicated.

    'no' is duplicated.

    0 讨论(0)
  • 2020-12-05 15:08

    Not a direct answer, but this will return an array without duplicates:

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    my @arr = ('a','a','a','b','b','c');
    my %count;
    my @arr_no_dups = grep { !$count{$_}++ } @arr;
    
    print @arr_no_dups, "\n";
    
    0 讨论(0)
  • 2020-12-05 15:15

    Turning the array into a hash is the fastest way [O(n)], though its memory inefficient. Using a for loop is a bit faster than grep, but I'm not sure why.

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    my %count;
    my %dups;
    for(@array) {
        $dups{$_}++ if $count{$_}++;
    }
    

    A memory efficient way is to sort the array in place and iterate through it looking for equal and adjacent entries.

    # not exactly sort in place, but Perl does a decent job optimizing it
    @array = sort @array;
    
    my $last;
    my %dups;
    for my $entry (@array) {
        $dups{$entry}++ if defined $last and $entry eq $last;
        $last = $entry;
    }
    

    This is nlogn speed, because of the sort, but only needs to store the duplicates rather than a second copy of the data in %count. Worst case memory usage is still O(n) (when everything is duplicated) but if your array is large and there's not a lot of duplicates you'll win.

    Theory aside, benchmarking shows the latter starts to lose on large arrays (like over a million) with a high percentage of duplicates.

    0 讨论(0)
提交回复
热议问题