Sort a file with huge volume of data given memory constraint

前端 未结 12 925
暖寄归人
暖寄归人 2020-11-28 21:47

Points:

  • We process thousands of flat files in a day, concurrently.
  • Memory constraint is a major issue.
  • We use thread for each file process
相关标签:
12条回答
  • 2020-11-28 22:31

    You can use SQL Lite file db, load the data to the db and then let it sort and return the results for you. Advantages: No need to worry about writing the best sorting algorithm. Disadvantage: You will need disk space, slower processing. https://sites.google.com/site/arjunwebworld/Home/programming/sorting-large-data-files

    0 讨论(0)
  • 2020-11-28 22:34

    If your restriction is only to not use an external database system, you could try an embedded database (e.g. Apache Derby). That way, you get all the advantages of a database without any external infrastructure dependencies.

    0 讨论(0)
  • 2020-11-28 22:39

    I know you mentioned not using a database no matter how light... so, maybe this is not an option. But, what about hsqldb in memory... submit it, sort it by query, purge it. Just a thought.

    0 讨论(0)
  • 2020-11-28 22:40

    Here is a way to do it without heavy use of sorting in-side Java and without using DB. Assumptions : You have 1TB space and files contain or start with unique number, but are unsorted

    Divide the files N times.

    Read those N files one by one, and create one file for each line/number

    Name that file with corresponding number.While naming keep a counter updated to store least count.

    Now you can already have the root folder of files marked for sorting by name or pause your program to give you the time to fire command on your OS to sort the files by names. You can do it programmatically too.

    Now you have a folder with files sorted with their name, using the counter start taking each file one by one, put numbers in your OUTPUT file, close it.

    When you are done you will have a large file with sorted numbers.

    0 讨论(0)
  • 2020-11-28 22:44

    You can do it with only two temp files - source and destination - and as little memory as you want. On first step your source is the original file, on last step the destination is the result file.

    On each iteration:

    • read from the source file into a sliding buffer a chunk of data half size of the buffer;
    • sort the whole buffer
    • write to the destination file the first half of the buffer.
    • shift the second half of the buffer to the beginning and repeat

    Keep a boolean flag that says whether you had to move some records in current iteration. If the flag remains false, your file is sorted. If it's raised, repeat the process using the destination file as a source.

    Max number of iterations: (file size)/(buffer size)*2

    0 讨论(0)
  • 2020-11-28 22:44

    If you can move forward/backward in a file (seek), and rewrite parts of the file, then you should use bubble sort.

    You will have to scan lines in the file, and only have to have 2 rows in memory at the moment, and then swap them if they are not in the right order. Repeat the process until there are no files to swap.

    0 讨论(0)
提交回复
热议问题