Powershell 2 and .NET: Optimize for extremely large hash tables?

后端 未结 3 657
情书的邮戳
情书的邮戳 2021-01-14 12:15

I am dabbling in Powershell and completely new to .NET.

I am running a PS script that starts with an empty hash table. The hash table will grow to at least 15,000 to

相关标签:
3条回答
  • 2021-01-14 12:28

    So it's a few weeks later, and I wasn't able to come up with the perfect solution. A friend at Google suggested splitting the hash into several smaller hashes. He suggested that each time I went to look up a key, I'd have several misses until I found the right "bucket", but he said the read penalty wouldn't be nearly as bad as the write penalty when the collision algorithm ran to insert entries into the (already giant) hash table.

    I took this idea and took it one step further. I split the hash into 16 smaller buckets. When inserting an email address as a key into the data structures, I actually first compute a hash on the email address itself, and do a mod 16 operation to get a consistent value between 0 and 15. I then use that calculated value as the "bucket" number.

    So instead of using one giant hash, I actually have a 16-element array, whose elements are hash tables of email addresses.

    The total speed it takes to build the in-memory representation of my "master list" of 20,000+ email addresses, using split-up hash table buckets, is now roughly 1,000% faster. (10 times faster).

    Accessing all of the data in the hashes has no noticeable speed delays. This is the best solution I've been able to come up with so far. It's slightly ugly, but the performance improvement speaks for itself.

    0 讨论(0)
  • 2021-01-14 12:39

    You're going to spend a lot of the CPU time re-allocating the internal 'arrays' in the Hashtable. Have you tried the .NET constructor for Hashtable that takes a capacity?

    $t = New-Object Hashtable 20000
    ...
    if (!($t.ContainsKey($emailString))) { 
        $t.Add($emailString, $emailString) 
    }
    

    My version uses the same $emailString for the key & value, no .NET boxing of $true to an [object] just as a placeholder. The non-null string will evaluate to $true in PowerShell 'if' conditionals, so other code where you check shouldn't change. Your use of '+= @{...}' would be a big no-no in performance sensitive .NET code. You might be allocating a new Hashtable per email just by using the '@{}' syntax, which could be wasting a lot of time.

    Your approach of breaking up the very large collection into a (relatively small) number of smaller collections is called 'sharding'. You should use the Hashtable constructor that takes a capacity even if you're sharding by 16.

    Also, @Larold is right, if you're not looking up the email addresses, then use 'New-Object ArrayList 20000' to create a pre-allocated list.

    Also, the collections grow expenentially (factor of 1.5 or 2 on each 'growth'). The effect of this is that you should be able to reduce how much you pre-allocate by an order of manitude, and if the collections resize once or twice per 'data load' you probably won't notice. I would bet it is the first 10-20 generations of 'growth' that is taking time.

    0 讨论(0)
  • 2021-01-14 12:44

    I performed some basic tests using Measure-Command, using a set of 20 000 random words.

    The individual results are shown below, but in summary it appears that adding to one hashtable by first allocating a new hashtable with a single entry is incredibly inefficient :) Although there were some minor efficiency gains among options 2 through 5, in general they all performed about the same.

    If I were to choose, I might lean toward option 5 for its simplicity (just a single Add call per string), but all the alternatives I tested seem viable.

    $chars = [char[]]('a'[0]..'z'[0])
    $words = 1..20KB | foreach {
      $count = Get-Random -Minimum 15 -Maximum 35
      -join (Get-Random $chars -Count $count)
    }
    
    # 1) Original, adding to hashtable with "+=".
    #     TotalSeconds: ~800
    Measure-Command {
      $h = @{}
      $words | foreach { if( $h[$_] -ne $true ) { $h += @{ $_ = $true } } }
    }
    
    # 2) Using sharding among sixteen hashtables.
    #     TotalSeconds: ~3
    Measure-Command {
      [hashtable[]]$hs = 1..16 | foreach { @{} }
      $words | foreach {
        $h = $hs[$_.GetHashCode() % 16]
        if( -not $h.ContainsKey( $_ ) ) { $h.Add( $_, $null ) }
      }
    }
    
    # 3) Using ContainsKey and Add on a single hashtable.
    #     TotalSeconds: ~3
    Measure-Command {
      $h = @{}
      $words | foreach { if( -not $h.ContainsKey( $_ ) ) { $h.Add( $_, $null ) } }
    }
    
    # 4) Using ContainsKey and Add on a hashtable constructed with capacity.
    #     TotalSeconds: ~3
    Measure-Command {
      $h = New-Object Collections.Hashtable( 21KB )
      $words | foreach { if( -not $h.ContainsKey( $_ ) ) { $h.Add( $_, $null ) } }
    }
    
    # 5) Using HashSet<string> and Add.
    #     TotalSeconds: ~3
    Measure-Command {
      $h = New-Object Collections.Generic.HashSet[string]
      $words | foreach { $null = $h.Add( $_ ) }
    }
    
    0 讨论(0)
提交回复
热议问题