large-data-volumes

Processing apache logs quickly

南笙酒味 提交于 2019-11-29 20:51:08
问题 I'm currently running an awk script to process a large (8.1GB) access-log file, and it's taking forever to finish. In 20 minutes, it wrote 14MB of the (1000 +- 500)MB I expect it to write, and I wonder if I can process it much faster somehow. Here is the awk script: #!/bin/bash awk '{t=$4" "$5; gsub("[\[\]\/]"," ",t); sub(":"," ",t);printf("%s,",$1);system("date -d \""t"\" +%s");}' $1 EDIT: For non-awkers, the script reads each line, gets the date information, modifies it to a format the

Handling large records in a Java EE application

南楼画角 提交于 2019-11-29 02:44:30
There is a table phonenumbers with two columns: id , and number . There are about half a million entries in the table. Database is MySQL . The requirement is to develop a simple Java EE application, connected to that database, that allows a user to download all number values in comma separated style by following a specific URL. If we get all the values in a huge String array and then concatenate them (with comma in between all the values) in a String and then send it down to the user, does it sound a proper solution? The application is not public and will be used by a limited no. of people.

Advice on handling large data volumes

时光怂恿深爱的人放手 提交于 2019-11-28 19:55:46
So I have a "large" number of "very large" ASCII files of numerical data (gigabytes altogether), and my program will need to process the entirety of it sequentially at least once. Any advice on storing/loading the data? I've thought of converting the files to binary to make them smaller and for faster loading. Should I load everything into memory all at once? If not, is opening what's a good way of loading the data partially? What are some Java-relevant efficiency tips? Stu Thompson So then what if the processing requires jumping around in the data for multiple files and multiple buffers? Is

Efficiently storing 7.300.000.000 rows

自作多情 提交于 2019-11-28 16:14:22
How would you tackle the following storage and retrieval problem? Roughly 2.000.000 rows will be added each day (365 days/year) with the following information per row: id (unique row identifier) entity_id (takes on values between 1 and 2.000.000 inclusive) date_id (incremented with one each day - will take on values between 1 and 3.650 (ten years: 1*365*10)) value_1 (takes on values between 1 and 1.000.000 inclusive) value_2 (takes on values between 1 and 1.000.000 inclusive) entity_id combined with date_id is unique. Hence, at most one row per entity and date can be added to the table. The

Transferring large payloads of data (Serialized Objects) using wsHttp in WCF with message security

自作多情 提交于 2019-11-28 04:34:59
I have a case where I need to transfer large amounts of serialized object graphs (via NetDataContractSerializer ) using WCF using wsHttp. I'm using message security and would like to continue to do so. Using this setup I would like to transfer serialized object graph which can sometimes approach around 300MB or so but when I try to do so I've started seeing a exception of type System.InsufficientMemoryException appear. After a little research it appears that by default in WCF that a result to a service call is contained within a single message by default which contains the serialized data and

Advice on handling large data volumes

◇◆丶佛笑我妖孽 提交于 2019-11-27 20:36:01
问题 So I have a "large" number of "very large" ASCII files of numerical data (gigabytes altogether), and my program will need to process the entirety of it sequentially at least once. Any advice on storing/loading the data? I've thought of converting the files to binary to make them smaller and for faster loading. Should I load everything into memory all at once? If not, is opening what's a good way of loading the data partially? What are some Java-relevant efficiency tips? 回答1: So then what if

Designing a web crawler

戏子无情 提交于 2019-11-27 16:36:27
I have come across an interview question "If you were designing a web crawler, how would you avoid getting into infinite loops? " and I am trying to answer it. How does it all begin from the beginning. Say Google started with some hub pages say hundreds of them (How these hub pages were found in the first place is a different sub-question). As Google follows links from a page and so on, does it keep making a hash table to make sure that it doesn't follow the earlier visited pages. What if the same page has 2 names (URLs) say in these days when we have URL shorteners etc.. I have taken Google

How to avoid OOM (Out of memory) error when retrieving all records from huge table?

天大地大妈咪最大 提交于 2019-11-27 14:57:52
问题 I am given a task to convert a huge table to custom XML file. I will be using Java for this job. If I simply issue a "SELECT * FROM customer", it may return huge amount of data that eventually causing OOM. I wonder, is there a way i can process the record immediately once it become available, and remove the record from memory after that during sql retrieving process? --- edited on 13 Jul 2009 Let me elaborate my question. I have 1 db server and 1 application server. When I issue a select

Efficiently storing 7.300.000.000 rows

倾然丶 夕夏残阳落幕 提交于 2019-11-27 09:37:28
问题 How would you tackle the following storage and retrieval problem? Roughly 2.000.000 rows will be added each day (365 days/year) with the following information per row: id (unique row identifier) entity_id (takes on values between 1 and 2.000.000 inclusive) date_id (incremented with one each day - will take on values between 1 and 3.650 (ten years: 1*365*10)) value_1 (takes on values between 1 and 1.000.000 inclusive) value_2 (takes on values between 1 and 1.000.000 inclusive) entity_id

Is it possible to change argv or do I need to create an adjusted copy of it?

寵の児 提交于 2019-11-27 04:45:29
My application has potentially a huge number of arguments passed in and I want to avoid the memory of hit duplicating the arguments into a filtered list. I would like to filter them in place but I am pretty sure that messing with argv array itself, or any of the data it points to, is probably not advisable. Any suggestions? Once argv has been passed into the main method, you can treat it like any other C array - change it in place as you like, just be aware of what you're doing with it. The contents of the array don't have an effect on the return code or execution of the program other than