Create report and upload to server for download

前端 未结 2 430
悲&欢浪女
悲&欢浪女 2020-12-11 13:50

I have a CSV file thats 300MB and has over 1/2 million entries. I want to run reports and make these reports available for download.

Here\'s the data structure:

相关标签:
2条回答
  • 2020-12-11 14:01

    I'd recommend using Solr to process your data, it supports the indexing of CSV data.

    Solr solves the problems associated with uploading, downloading and searching the data. It's performance is such that the index can be easily updated or built again from scratch.

    Example

    Install Solr

    Read the documentation for how to install Solr. The following is a "kick-start" to make this demo work on Linux:

    wget http://www.apache.org/dist/lucene/solr/3.5.0/apache-solr-3.5.0.tgz
    tar zxvf apache-solr-3.5.0.tgz
    cd apache-solr-3.5.0/example
    java -jar start.jar
    

    Solr admin screen available on the following URL

    http://localhost:8983/solr/admin/
    

    Generate sample data

    Wrote a groovy script to generate a sample CSV file.

    new File("data.csv").withWriter { writer ->
        writer.println "id;A_s;B_i;C_s;D_s;E_s;F_s"
    
        for (i in 1..500000) {
            writer.println "${i};${i*10};${i*20};${i*30};${i*40};${i*50};${i*60}"
        }
    }
    

    Example

    id;A_s;B_i;C_s;D_s;E_s;F_s
    1;10;20;30;40;50;60
    2;20;40;60;80;100;120
    3;30;60;90;120;150;180
    4;40;80;120;160;200;240
    5;50;100;150;200;250;300
    6;60;120;180;240;300;360
    7;70;140;210;280;350;420
    8;80;160;240;320;400;480
    9;90;180;270;360;450;540
    ..
    

    Note:

    I performed no customisation to the "out of the box" Solr settings. This meant

    • Default Solr configuration requires a unique id field
    • Appending "_s" to the column name indicates string fields. See dynamic fields feature in Solr
    • Loaded Column B as an integer (B_i) to enable numeric sorting

    See the Solr Wiki for details on how to write a custom schema.

    Load CSV file

    The Linux curl command is used to load the data.csv file, by performing a HTTP post operation:

    $ curl 'http://localhost:8983/solr/update/csv?separator=;&commit=true' -H 'Content-type:text/plain; charset=utf-8' --data-binary @data.csv
    

    Note:

    • 500,000 rows of data loaded in approx 90 seconds
    • Use separator parameter to indicate the data is not separated by the default ","
    • Use commit parameter to indicate data is committed to index at the end.

    Query the data

    The following query returns the top 50 rows sorted on column B

    http://localhost:8983/solr/select/?q=*:*&rows=50&sort=B_i+desc&fl=id,A_s,B_i,C_s,D_s,E_s,F_s&wt=csv
    

    Output is CSV formatted:

    id,A_s,B_i,C_s,D_s,E_s,F_s
    500000,5000000,10000000,15000000,20000000,25000000,30000000
    499999,4999990,9999980,14999970,19999960,24999950,29999940
    499998,4999980,9999960,14999940,19999920,24999900,29999880
    499997,4999970,9999940,14999910,19999880,24999850,29999820
    499996,4999960,9999920,14999880,19999840,24999800,29999760
    ..
    

    REST parameters:

      |-------------------------------|------------------------------|
      | Rest parameter                | Description                  |                      
      |-------------------------------|------------------------------|
      | q=*:*                         | Everything.                  |
      | rows=50                       | Number of rows to output     |
      | sort=B_i+desc                 | descending order, sort on B_i|
      | fl=id,A_s,B_i,C_s,D_s,E_s,F_s | columns to include           |
      | wt=csv                        | CSV output                   |
      |-------------------------------|------------------------------|
    
    0 讨论(0)
  • 2020-12-11 14:04

    I'm making one liners today, and Solr seems like overkill. Here's one to solve your problem (assuming no header row):

    sort -t\; -k2 < test.csv | head -50
    

    Add -n if you need the second field sorted numerically instead of alphabetically.

    0 讨论(0)
提交回复
热议问题