Sorting gigantic binary files with C#

后端 未结 4 1731
别那么骄傲
别那么骄傲 2021-02-09 14:39

I have a large file of roughly 400 GB of size. Generated daily by an external closed system. It is a binary file with the following format:

byte[8]byte[4]byte[n         


        
相关标签:
4条回答
  • 2021-02-09 15:09

    At great way to speed up this kind of file access is to memory-map the entire file into address space and let the OS take care of reading whatever bits from the file it needs to. So do the same thing as you're doing right now, except read from memory instead of using a BinaryReader/seek/read.

    You've got lots of main memory, so this should provide pretty good performance (as long as you're using a 64-bit OS).

    0 讨论(0)
  • 2021-02-09 15:09

    Use merge sort. It's online and parallelizes well.

    http://en.wikipedia.org/wiki/Merge_sort

    0 讨论(0)
  • 2021-02-09 15:16

    If you can learn Erlang or Go, they could be very powerful and scale extremely well, as you have 24 threads. Utilize Async I/O. Merge Sort. And since you have 32GB of Ram, try to load as much as you can into RAM and sort it there then write back to disk.

    0 讨论(0)
  • 2021-02-09 15:23

    I would do this in several passes. On the first pass, I would create a list of ticks, then distribute them evenly into many (hundreds?) buckets. If you know ahead of time that the ticks are evenly distributed, you can skip this initial pass. On a second pass, I would split the records into these few hundred separate files of about same size (these much smaller files represent groups of ticks in the order that you want). Then I would sort each file separately in memory. Then concatenate the files.

    It is somewhat similar to the hashsort (I think).

    0 讨论(0)
提交回复
热议问题