Sort a file with huge volume of data given memory constraint

前端 未结 12 934
暖寄归人
暖寄归人 2020-11-28 21:47

Points:

  • We process thousands of flat files in a day, concurrently.
  • Memory constraint is a major issue.
  • We use thread for each file process
相关标签:
12条回答
  • 2020-11-28 22:17

    In spite of your restriction, I would use embedded database SQLITE3. Like yourself, I work weekly with 10-15 millions of flat file lines and it is very, very fast to import and generate sorted data, and you only need a little free of charge executable (sqlite3.exe). For example: Once you download the .exe file, in a command prompt you can do this:

    C:> sqlite3.exe dbLines.db
    sqlite> create table tabLines(line varchar(5000));
    sqlite> create index idx1 on tabLines(line);
    sqlite> .separator '\r\n'
    sqlite> .import 'FileToImport' TabLines
    

    then:

    sqlite> select * from tabLines order by line;
    
    or save to a file:
    sqlite> .output out.txt
    sqlite> select * from tabLines order by line;
    sqlite> .output stdout
    
    0 讨论(0)
  • 2020-11-28 22:17

    I would spin up an EC2 cluster and run Hadoop's MergeSort.

    Edit: not sure how much detail you would like, or on what. EC2 is Amazon's Elastic Compute Cloud - it lets you rent virtual servers by the hour at low cost. Here is their website.

    Hadoop is an open-source MapReduce framework designed for parallel processing of large data sets. A job is a good candidate for MapReduce when it can be split into subsets that can be processed individually and then merged together, usually by sorting on keys (ie the divide-and-conquer strategy). Here is its website.

    As mentioned by the other posters, external sorting is also a good strategy. I think the way I would decide between the two depends on the size of the data and speed requirements. A single machine is likely going to be limited to processing a single file at a time (since you will be using up available memory). So look into something like EC2 only if you need to process files faster than that.

    0 讨论(0)
  • 2020-11-28 22:19

    You can read the files in smaller parts, sort these and write them to temporrary files. Then you read two of them sequentially again and merge them to a bigger temporary file and so on. If there is only one left you have your file sorted. Basically that's the Megresort algorithm performed on external files. It scales quite well with aribitrary large files but causes some extra file I/O.

    Edit: If you have some knowledge about the likely variance of the lines in your files you can employ a more efficient algorithm (distribution sort). Simplified you would read the original file once and write each line to a temporary file that takes only lines with the same first char (or a certain range of first chars). Then you iterate over all the (now small) temporary files in ascending order, sort them in memory and append them directly to the output file. If a temporary file turns out to be too big for sorting in memory, you can reapeat the same process for this based on the 2nd char in the lines and so on. So if your first partitioning was good enough to produce small enough files, you will have only 100% I/O overhead regardless how large the file is, but in the worst case it can become much more than with the performance wise stable merge sort.

    0 讨论(0)
  • 2020-11-28 22:30

    It looks like what you are looking for is external sorting.

    Basically, you sort small chunks of data first, write it back to the disk and then iterate over those to sort all.

    0 讨论(0)
  • 2020-11-28 22:30

    You could use the following divide-and-conquer strategy:

    Create a function H() that can assign each record in the input file a number. For a record r2 that will be sorted behind a record r1 it must return a larger number for r2 than for r1. Use this function to partition all the records into separate files that will fit into memory so you can sort them. Once you have done that you can just concatenate the sorted files to get one large sorted file.

    Suppose you have this input file where each line represents a record

    Alan Smith
    Jon Doe
    Bill Murray
    Johnny Cash
    

    Lets just build H() so that it uses the first letter in the record so you might get up to 26 files but in this example you will just get 3:

    <file1>
    Alan Smith
    
    <file2>
    Bill Murray
    
    <file10>
    Jon Doe
    Johnny Cash
    

    Now you can sort each individual file. Which would swap "Jon Doe" and "Johnny Cash" in <file10>. Now, if you just concatenate the 3 files you'll have a sorted version of the input.

    Note that you divide first and only conquer (sort) later. However, you make sure to do the partitioning in a way that the resulting parts which you need to sort don't overlap which will make merging the result much simpler.

    The method by which you implement the partitioning function H() depends very much on the nature of your input data. Once you have that part figured out the rest should be a breeze.

    0 讨论(0)
  • 2020-11-28 22:31

    As other mentionned, you can process in steps.
    I would like to explain this with my own words (differs on point 3) :

    1. Read the file sequentially, process N records at a time in memory (N is arbitrary, depending on your memory constraint and the number T of temporary files that you want).

    2. Sort the N records in memory, write them to a temp file. Loop on T until you are done.

    3. Open all the T temp files at the same time, but read only one record per file. (Of course, with buffers). For each of these T records, find the smaller, write it to the final file, and advance only in that file.


    Advantages:

    • The memory consumption is as low as you want.
    • You only do the double of disk accesses comparing to a everything-in-memory policy. Not bad! :-)

    Exemple with numbers:

    1. Original file with 1 million records.
    2. Choose to have 100 temp files, so read and sort 10 000 records at a time, and drop these in their own temp file.
    3. Open the 100 temp file at a time, read the first record in memory.
    4. Compare the first records, write the smaller and advance this temp file.
    5. Loop on step 5, one million times.

    EDITED

    You mentionned a multi-threaded application, so I wonder ...

    As we seen from these discussions on this need, using less memory gives less performance, with a dramatic factor in this case. So I could also suggest to use only one thread to process only one sort at a time, not as a multi-threaded application.

    If you process ten threads, each with a tenth of the memory available, your performance will be miserable, much much less than a tenth of the initial time. If you use only one thread, and queue the 9 other demands and process them in turn, you global performance will be much better, you will finish the ten tasks much faster.


    After reading this response : Sort a file with huge volume of data given memory constraint I suggest you consider this distribution sort. It could be huge gain in your context.

    The improvement over my proposal is that you don't need to open all the temp files at once, you only open one of them. It saves your day! :-)

    0 讨论(0)
提交回复
热议问题