I have a CSV file thats 300MB and has over 1/2 million entries. I want to run reports and make these reports available for download.
Here\'s the data structure:
I'd recommend using Solr to process your data, it supports the indexing of CSV data.
Solr solves the problems associated with uploading, downloading and searching the data. It's performance is such that the index can be easily updated or built again from scratch.
Read the documentation for how to install Solr. The following is a "kick-start" to make this demo work on Linux:
wget http://www.apache.org/dist/lucene/solr/3.5.0/apache-solr-3.5.0.tgz
tar zxvf apache-solr-3.5.0.tgz
cd apache-solr-3.5.0/example
java -jar start.jar
Solr admin screen available on the following URL
http://localhost:8983/solr/admin/
Wrote a groovy script to generate a sample CSV file.
new File("data.csv").withWriter { writer ->
writer.println "id;A_s;B_i;C_s;D_s;E_s;F_s"
for (i in 1..500000) {
writer.println "${i};${i*10};${i*20};${i*30};${i*40};${i*50};${i*60}"
}
}
Example
id;A_s;B_i;C_s;D_s;E_s;F_s
1;10;20;30;40;50;60
2;20;40;60;80;100;120
3;30;60;90;120;150;180
4;40;80;120;160;200;240
5;50;100;150;200;250;300
6;60;120;180;240;300;360
7;70;140;210;280;350;420
8;80;160;240;320;400;480
9;90;180;270;360;450;540
..
Note:
I performed no customisation to the "out of the box" Solr settings. This meant
See the Solr Wiki for details on how to write a custom schema.
The Linux curl command is used to load the data.csv file, by performing a HTTP post operation:
$ curl 'http://localhost:8983/solr/update/csv?separator=;&commit=true' -H 'Content-type:text/plain; charset=utf-8' --data-binary @data.csv
Note:
The following query returns the top 50 rows sorted on column B
http://localhost:8983/solr/select/?q=*:*&rows=50&sort=B_i+desc&fl=id,A_s,B_i,C_s,D_s,E_s,F_s&wt=csv
Output is CSV formatted:
id,A_s,B_i,C_s,D_s,E_s,F_s
500000,5000000,10000000,15000000,20000000,25000000,30000000
499999,4999990,9999980,14999970,19999960,24999950,29999940
499998,4999980,9999960,14999940,19999920,24999900,29999880
499997,4999970,9999940,14999910,19999880,24999850,29999820
499996,4999960,9999920,14999880,19999840,24999800,29999760
..
REST parameters:
|-------------------------------|------------------------------|
| Rest parameter | Description |
|-------------------------------|------------------------------|
| q=*:* | Everything. |
| rows=50 | Number of rows to output |
| sort=B_i+desc | descending order, sort on B_i|
| fl=id,A_s,B_i,C_s,D_s,E_s,F_s | columns to include |
| wt=csv | CSV output |
|-------------------------------|------------------------------|
I'm making one liners today, and Solr seems like overkill. Here's one to solve your problem (assuming no header row):
sort -t\; -k2 < test.csv | head -50
Add -n if you need the second field sorted numerically instead of alphabetically.