large-data-volumes | 易学教程

Processing apache logs quickly

阅读更多关于 Processing apache logs quickly

问题 I'm currently running an awk script to process a large (8.1GB) access-log file, and it's taking forever to finish. In 20 minutes, it wrote 14MB of the (1000 +- 500)MB I expect it to write, and I wonder if I can process it much faster somehow. Here is the awk script: #!/bin/bash awk '{t=$4" "$5; gsub("[\[\]\/]"," ",t); sub(":"," ",t);printf("%s,",$1);system("date -d \""t"\" +%s");}' $1 EDIT: For non-awkers, the script reads each line, gets the date information, modifies it to a format the

Handling large records in a Java EE application

阅读更多关于 Handling large records in a Java EE application

There is a table phonenumbers with two columns: id , and number . There are about half a million entries in the table. Database is MySQL . The requirement is to develop a simple Java EE application, connected to that database, that allows a user to download all number values in comma separated style by following a specific URL. If we get all the values in a huge String array and then concatenate them (with comma in between all the values) in a String and then send it down to the user, does it sound a proper solution? The application is not public and will be used by a limited no. of people.

Advice on handling large data volumes

阅读更多关于 Advice on handling large data volumes

So I have a "large" number of "very large" ASCII files of numerical data (gigabytes altogether), and my program will need to process the entirety of it sequentially at least once. Any advice on storing/loading the data? I've thought of converting the files to binary to make them smaller and for faster loading. Should I load everything into memory all at once? If not, is opening what's a good way of loading the data partially? What are some Java-relevant efficiency tips? Stu Thompson So then what if the processing requires jumping around in the data for multiple files and multiple buffers? Is

Efficiently storing 7.300.000.000 rows

阅读更多关于 Efficiently storing 7.300.000.000 rows

How would you tackle the following storage and retrieval problem? Roughly 2.000.000 rows will be added each day (365 days/year) with the following information per row: id (unique row identifier) entity_id (takes on values between 1 and 2.000.000 inclusive) date_id (incremented with one each day - will take on values between 1 and 3.650 (ten years: 1*365*10)) value_1 (takes on values between 1 and 1.000.000 inclusive) value_2 (takes on values between 1 and 1.000.000 inclusive) entity_id combined with date_id is unique. Hence, at most one row per entity and date can be added to the table. The

Transferring large payloads of data (Serialized Objects) using wsHttp in WCF with message security

阅读更多关于 Transferring large payloads of data (Serialized Objects) using wsHttp in WCF with message security

I have a case where I need to transfer large amounts of serialized object graphs (via NetDataContractSerializer ) using WCF using wsHttp. I'm using message security and would like to continue to do so. Using this setup I would like to transfer serialized object graph which can sometimes approach around 300MB or so but when I try to do so I've started seeing a exception of type System.InsufficientMemoryException appear. After a little research it appears that by default in WCF that a result to a service call is contained within a single message by default which contains the serialized data and

Advice on handling large data volumes

阅读更多关于 Advice on handling large data volumes

问题 So I have a "large" number of "very large" ASCII files of numerical data (gigabytes altogether), and my program will need to process the entirety of it sequentially at least once. Any advice on storing/loading the data? I've thought of converting the files to binary to make them smaller and for faster loading. Should I load everything into memory all at once? If not, is opening what's a good way of loading the data partially? What are some Java-relevant efficiency tips? 回答1: So then what if

Designing a web crawler

阅读更多关于 Designing a web crawler

I have come across an interview question "If you were designing a web crawler, how would you avoid getting into infinite loops? " and I am trying to answer it. How does it all begin from the beginning. Say Google started with some hub pages say hundreds of them (How these hub pages were found in the first place is a different sub-question). As Google follows links from a page and so on, does it keep making a hash table to make sure that it doesn't follow the earlier visited pages. What if the same page has 2 names (URLs) say in these days when we have URL shorteners etc.. I have taken Google

How to avoid OOM (Out of memory) error when retrieving all records from huge table?

阅读更多关于 How to avoid OOM (Out of memory) error when retrieving all records from huge table?

问题 I am given a task to convert a huge table to custom XML file. I will be using Java for this job. If I simply issue a "SELECT * FROM customer", it may return huge amount of data that eventually causing OOM. I wonder, is there a way i can process the record immediately once it become available, and remove the record from memory after that during sql retrieving process? --- edited on 13 Jul 2009 Let me elaborate my question. I have 1 db server and 1 application server. When I issue a select

Efficiently storing 7.300.000.000 rows

阅读更多关于 Efficiently storing 7.300.000.000 rows

问题 How would you tackle the following storage and retrieval problem? Roughly 2.000.000 rows will be added each day (365 days/year) with the following information per row: id (unique row identifier) entity_id (takes on values between 1 and 2.000.000 inclusive) date_id (incremented with one each day - will take on values between 1 and 3.650 (ten years: 1*365*10)) value_1 (takes on values between 1 and 1.000.000 inclusive) value_2 (takes on values between 1 and 1.000.000 inclusive) entity_id

Is it possible to change argv or do I need to create an adjusted copy of it?

阅读更多关于 Is it possible to change argv or do I need to create an adjusted copy of it?

My application has potentially a huge number of arguments passed in and I want to avoid the memory of hit duplicating the arguments into a filtered list. I would like to filter them in place but I am pretty sure that messing with argv array itself, or any of the data it points to, is probably not advisable. Any suggestions? Once argv has been passed into the main method, you can treat it like any other C array - change it in place as you like, just be aware of what you're doing with it. The contents of the array don't have an effect on the return code or execution of the program other than