Advice on handling large data volumes

前端 未结 11 1197
鱼传尺愫
鱼传尺愫 2020-12-14 05:10

So I have a \"large\" number of \"very large\" ASCII files of numerical data (gigabytes altogether), and my program will need to process the entirety of it sequentially at l

相关标签:
11条回答
  • 2020-12-14 05:23

    You might want to have a look at the entries in the Wide Finder Project (do a google search for "wide finder" java).

    The Wide finder involves reading over lots of lines in log files, so look at the Java implementations and see what worked and didn't work there.

    0 讨论(0)
  • 2020-12-14 05:25

    If your numerical data is regularly sampled and you need to do random access consider to store them in a quadtree.

    0 讨论(0)
  • 2020-12-14 05:28

    If at all possible, get the data into a database. Then you can leverage all the indexing, caching, memory pinning, and other functionality available to you there.

    0 讨论(0)
  • 2020-12-14 05:30

    So then what if the processing requires jumping around in the data for multiple files and multiple buffers? Is constant opening and closing of binary files going to become expensive?

    I'm a big fan of 'memory mapped i/o', aka 'direct byte buffers'. In Java they are called Mapped Byte Buffers are are part of java.nio. (Basically, this mechanism uses the OS's virtual memory paging system to 'map' your files and present them programmatically as byte buffers. The OS will manage moving the bytes to/from disk and memory auto-magically and very quickly.

    I suggest this approach because a) it works for me, and b) it will let you focus on your algorithm and let the JVM, OS and hardware deal with the performance optimization. All to frequently, they know what is best more so than us lowly programmers. ;)

    How would you use MBBs in your context? Just create an MBB for each of your files and read them as you see fit. You will only need to store your results. .

    BTW: How much data are you dealing with, in GB? If it is more than 3-4GB, then this won't work for you on a 32-bit machine as the MBB implementation is defendant on the addressable memory space by the platform architecture. A 64-bit machine & OS will take you to 1TB or 128TB of mappable data.

    If you are thinking about performance, then know Kirk Pepperdine (a somewhat famous Java performance guru.) He is involved with a website, www.JavaPerformanceTuning.com, that has some more MBB details: NIO Performance Tips and other Java performance related things.

    0 讨论(0)
  • 2020-12-14 05:33

    I recommend strongly leveraging Regular Expressions and looking into the "new" IO nio package for faster input. Then it should go as quickly as you can realistically expect Gigabytes of data to go.

    0 讨论(0)
  • 2020-12-14 05:34

    This depends a lot on the data in the file. Big mainframes have been doing sequential data processing for a long time but they don't normally use random access for the data. They just pull it in a line at a time and process that much before continuing.

    For random access it is often best to build objects with caching wrappers which know where in the file the data they need to construct is. When needed they read that data in and construct themselves. This way when memory is tight you can just start killing stuff off without worrying too much about not being able to get it back later.

    0 讨论(0)
提交回复
热议问题