What is the fastest way to read 10 GB file from the disk?

后端 未结 13 3724
你的背包
你的背包 2021-02-20 01:19

We need to read and count different types of messages/run some statistics on a 10 GB text file, e.g a FIX engine log. We use Linux, 32-bit, 4 CPUs, Intel, coding in Perl but the

相关标签:
13条回答
  • 2021-02-20 01:51

    Perhaps you've already read this forum thread, but if not:

    http://www.perlmonks.org/?node_id=512221

    It describes using Perl to do it line-by-line, and the users seem to think Perl is quite capable of it.

    Oh, is it possible to process the file from a RAID array? If you have several mirrored disks, then the read speed can be improved. Competition for disk resources may be what makes your multiple-threads attempt not work.

    Best of luck.

    0 讨论(0)
  • 2021-02-20 01:51

    Parse the file once, reading line by line. Put the results in a table in a decent database. Run as many queries as you wish. Feed the beast regularly with new incoming data.

    Realize that manipulating a 10 Gb file, transferring it across the (even if local) network, exploring complicated solutions etc all take time.

    0 讨论(0)
  • 2021-02-20 01:52

    Basically need to "Divide and conquer", if you have a network of computers, then copy the 10G file to as many client PCs as possible, get each client PC to read an offset of the file. For added bonus, get EACH pc to implement multi threading in addition to distributed reading.

    0 讨论(0)
  • 2021-02-20 01:55

    Have you thought of streaming the file and filtering out to a secondary file any interesting results? (Repeat until you have a manageble size file).

    0 讨论(0)
  • 2021-02-20 01:56

    Most of the time you will be I/O bound not CPU bound, thus just read this file through normal Perl I/O and process it in single thread. Unless you prove that you can do more I/O than your single CPU work, don't waste your time with anything more. Anyway, you should ask: Why on Earth is this in one huge file? Why on Earth don't they split it in a reasonable way when they generate it? It would be magnitude more worth work. Then you can put it in separate I/O channels and use more CPU's (if you don't use some sort of RAID 0 or NAS or ...).

    Measure, don't assume. Don't forget to flush caches before each test. Remember that serialized I/O is a magnitude faster than random.

    0 讨论(0)
  • 2021-02-20 01:56

    hmmm, but what's wrong with the read() command in C? Usually has a 2GB limit, so just call it 5 times in sequence. That should be fairly fast.

    0 讨论(0)
提交回复
热议问题