How can I import bulk data from a CSV file into DynamoDB?

后端 未结 14 1829
我在风中等你
我在风中等你 2021-01-31 15:08

I am trying to import a CSV file data into AWS DynamoDB.

Here\'s what my CSV file looks like:

first_name  last_name
sri ram
Rahul   Dravid
JetPay  Underw         


        
相关标签:
14条回答
  • 2021-01-31 15:59

    Before getting to my code, some notes on testing this locally

    I recommend using a local version of DynamoDB, in case you want to sanity check this before you start incurring charges and what not. I made some small modifications before posting this, so be sure to test with whatever means make sense to you. There is a fake batch upload job I commented out, which you could use in lieu of any DynamoDB service, remote or local, to verify in stdout that this is working to your needs.

    dynamodb-local

    See dynamodb-local on npmjs or manual install

    If you went the manual install route, you can start dynamodb-local with something like this:

    java -Djava.library.path=<PATH_TO_DYNAMODB_LOCAL>/DynamoDBLocal_lib/\
         -jar <PATH_TO_DYNAMODB_LOCAL>/DynamoDBLocal.jar\
         -inMemory\
         -sharedDb
    

    The npm route may be simpler.

    dynamodb-admin

    Along with that, see dynamodb-admin.

    I installed dynamodb-admin with npm i -g dynamodb-admin. It can then be run with:

    dynamodb-admin
    

    Using them:

    dynamodb-local defaults to localhost:8000.

    dynamodb-admin is a web page that defaults to localhost:8001. Once you launch these two services, open localhost:8001 in your browser to view and manipulate the database.

    The script below doesn't create the database. Use dynamodb-admin for this.

    Credit goes to...

    • Ben Nadel.

    The code

    • I'm not as experienced with JS & Node.js as I am with other languages, so please forgive any JS faux pas.
    • You'll notice each group of concurrent batches is purposely slowed down by 900ms. This was a hacky solution, and I'm leaving it here to serve as an example (and because of laziness, and because you're not paying me).
    • If you increase MAX_CONCURRENT_BATCHES, you will want to calculate the appropriate delay amount based on your WCU, item size, batch size, and the new concurrency level.
    • Another approach would be to turn on Auto Scaling and implement exponential backoff for each failed batch. Like I mention below in one of the comments, this really shouldn't be necessary with some back-of-the-envelope calculations to figure out how many writes you can actually do, given your WCU limit and data size, and just let your code run at a predictable rate the entire time.
    • You might wonder why I didn't just let AWS SDK handle concurrency. Good question. Probably would have made this slightly simpler. You could experiment by applying the MAX_CONCURRENT_BATCHES to the maxSockets config option, and modifying the code that creates arrays of batches so that it only passes individual batches forward.
    /**
     * Uploads CSV data to DynamoDB.
     *
     * 1. Streams a CSV file line-by-line.
     * 2. Parses each line to a JSON object.
     * 3. Collects batches of JSON objects.
     * 4. Converts batches into the PutRequest format needed by AWS.DynamoDB.batchWriteItem
     *    and runs 1 or more batches at a time.
     */
    
    const AWS = require("aws-sdk")
    const chalk = require('chalk')
    const fs = require('fs')
    const split = require('split2')
    const uuid = require('uuid')
    const through2 = require('through2')
    const { Writable } = require('stream');
    const { Transform } = require('stream');
    
    const CSV_FILE_PATH = __dirname + "/../assets/whatever.csv"
    
    // A whitelist of the CSV columns to ingest.
    const CSV_KEYS = [
        "id",
        "name", 
        "city"
    ]
    
    // Inadequate WCU will cause "insufficient throughput" exceptions, which in this script are not currently  
    // handled with retry attempts. Retries are not necessary as long as you consistently
    // stay under the WCU, which isn't that hard to predict.
    
    // The number of records to pass to AWS.DynamoDB.DocumentClient.batchWrite
    // See https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html
    const MAX_RECORDS_PER_BATCH = 25
    
    // The number of batches to upload concurrently.  
    // https://docs.aws.amazon.com/sdk-for-javascript/v2/developer-guide/node-configuring-maxsockets.html
    const MAX_CONCURRENT_BATCHES = 1
    
    // MAKE SURE TO LAUNCH `dynamodb-local` EXTERNALLY FIRST IF USING LOCALHOST!
    AWS.config.update({
        region: "us-west-1"
        ,endpoint: "http://localhost:8000"     // Comment out to hit live DynamoDB service.
    });
    const db = new AWS.DynamoDB()
    
    // Create a file line reader.
    var fileReaderStream = fs.createReadStream(CSV_FILE_PATH)
    var lineReaderStream = fileReaderStream.pipe(split())
    
    var linesRead = 0
    
    // Attach a stream that transforms text lines into JSON objects.
    var skipHeader = true
    var csvParserStream = lineReaderStream.pipe(
        through2(
            {
                objectMode: true,
                highWaterMark: 1
            },
            function handleWrite(chunk, encoding, callback) {
    
                // ignore CSV header
                if (skipHeader) {
                    skipHeader = false
                    callback()
                    return
                }
    
                linesRead++
    
                // transform line into stringified JSON
                const values = chunk.toString().split(',')
                const ret = {}
                CSV_KEYS.forEach((keyName, index) => {
                    ret[keyName] = values[index]
                })
                ret.line = linesRead
    
                console.log(chalk.cyan.bold("csvParserStream:", 
                    "line:", linesRead + ".", 
                    chunk.length, "bytes.", 
                    ret.id
                ))
    
                callback(null, ret)
            }
        )
    )
    
    // Attach a stream that collects incoming json lines to create batches. 
    // Outputs an array (<= MAX_CONCURRENT_BATCHES) of arrays (<= MAX_RECORDS_PER_BATCH).
    var batchingStream = (function batchObjectsIntoGroups(source) {
        var batchBuffer = []
        var idx = 0
    
        var batchingStream = source.pipe(
            through2.obj(
                {
                    objectMode: true,
                    writableObjectMode: true,
                    highWaterMark: 1
                },
                function handleWrite(item, encoding, callback) {
                    var batchIdx = Math.floor(idx / MAX_RECORDS_PER_BATCH)
    
                    if (idx % MAX_RECORDS_PER_BATCH == 0 && batchIdx < MAX_CONCURRENT_BATCHES) {
                        batchBuffer.push([])
                    }
    
                    batchBuffer[batchIdx].push(item)
    
                    if (MAX_CONCURRENT_BATCHES == batchBuffer.length &&
                        MAX_RECORDS_PER_BATCH == batchBuffer[MAX_CONCURRENT_BATCHES-1].length) 
                    {
                        this.push(batchBuffer)
                        batchBuffer = []
                        idx = 0
                    } else {
                        idx++
                    }
    
                    callback()
                },
                function handleFlush(callback) {
                    if (batchBuffer.length) {
                        this.push(batchBuffer)
                    }
    
                    callback()
                }
            )
        )
    
        return (batchingStream);
    })(csvParserStream)
    
    // Attach a stream that transforms batch buffers to collections of DynamoDB batchWrite jobs.
    var databaseStream = new Writable({
    
        objectMode: true,
        highWaterMark: 1,
    
        write(batchBuffer, encoding, callback) {
            console.log(chalk.yellow(`Batch being processed.`))
    
            // Create `batchBuffer.length` batchWrite jobs.
            var jobs = batchBuffer.map(batch => 
                buildBatchWriteJob(batch)
            )
    
            // Run multiple batch-write jobs concurrently.
            Promise
                .all(jobs)
                .then(results => {
                    console.log(chalk.bold.red(`${batchBuffer.length} batches completed.`))
                })
                .catch(error => {
                    console.log( chalk.red( "ERROR" ), error )
                    callback(error)
                })
                .then( () => {
                    console.log( chalk.bold.red("Resuming file input.") )
    
                    setTimeout(callback, 900) // slow down the uploads. calculate this based on WCU, item size, batch size, and concurrency level.
                })
    
            // return false
        }
    })
    batchingStream.pipe(databaseStream)
    
    // Builds a batch-write job that runs as an async promise.
    function buildBatchWriteJob(batch) {
        let params = buildRequestParams(batch)
    
        // This was being used temporarily prior to hooking up the script to any dynamo service.
    
        // let fakeJob = new Promise( (resolve, reject) => {
    
        //     console.log(chalk.green.bold( "Would upload batch:", 
        //         pluckValues(batch, "line")
        //     ))
    
        //     let t0 = new Date().getTime()
    
        //     // fake timing
        //     setTimeout(function() {
        //         console.log(chalk.dim.yellow.italic(`Batch upload time: ${new Date().getTime() - t0}ms`))
        //         resolve()
        //     }, 300)
        // })
        // return fakeJob
    
        let promise = new Promise(
            function(resolve, reject) {
                let t0 = new Date().getTime()
    
                let printItems = function(msg, items) {
                    console.log(chalk.green.bold(msg, pluckValues(batch, "id")))
                }
    
                let processItemsCallback = function (err, data) {
                  if (err) { 
                     console.error(`Failed at batch: ${pluckValues(batch, "line")}, ${pluckValues(batch, "id")}`)
                     console.error("Error:", err)
                     reject()
                  } else {
                    var params = {}
                    params.RequestItems = data.UnprocessedItems
    
                    var numUnprocessed = Object.keys(params.RequestItems).length
                    if (numUnprocessed != 0) {
                        console.log(`Encountered ${numUnprocessed}`)
                        printItems("Retrying unprocessed items:", params)
                        db.batchWriteItem(params, processItemsCallback)
                    } else {
                        console.log(chalk.dim.yellow.italic(`Batch upload time: ${new Date().getTime() - t0}ms`))
    
                        resolve()
                    }
                  }
                }
                db.batchWriteItem(params, processItemsCallback)
            }
        )
        return (promise)
    }
    
    // Build request payload for the batchWrite
    function buildRequestParams(batch) {
    
        var params = {
            RequestItems: {}
        }
        params.RequestItems.Provider = batch.map(obj => {
    
            let item = {}
    
            CSV_KEYS.forEach((keyName, index) => {
                if (obj[keyName] && obj[keyName].length > 0) {
                    item[keyName] = { "S": obj[keyName] }
                }
            })
    
            return {
                PutRequest: {
                    Item: item
                }
            }
        })
        return params
    }
    
    function pluckValues(batch, fieldName) {
        var values = batch.map(item => {
            return (item[fieldName])
        })
        return (values)
    }
    
    0 讨论(0)
  • 2021-01-31 15:59

    I've created a gem for this.

    Now you can install it by running gem install dynamocli, then you can use the command:

    dynamocli import your_data.csv --to your_table
    

    Here is the link to the source code: https://github.com/matheussilvasantos/dynamocli

    0 讨论(0)
  • 2021-01-31 16:03

    You can use AWS Data Pipeline which is for things like this. You can upload your csv file to S3 and then use Data Pipeline to retrieve and populate a DynamoDB table. They have a step-by-step tutorial.

    0 讨论(0)
  • 2021-01-31 16:07

    In which language do you want to import the data? I just wrote a function in Node.js that can import a CSV file into a DynamoDB table. It first parses the whole CSV into an array, splits array into (25) chunks and then batchWriteItem into table.

    Note: DynamoDB only allows writing up to 25 records at a time in batchinsert. So we have to split our array into chunks.

        var fs = require('fs');
        var parse = require('csv-parse');
        var async = require('async');
    
        var csv_filename = "YOUR_CSV_FILENAME_WITH_ABSOLUTE_PATH";
    
        rs = fs.createReadStream(csv_filename);
        parser = parse({
            columns : true,
            delimiter : ','
        }, function(err, data) {
    
            var split_arrays = [], size = 25;
    
            while (data.length > 0) {
                split_arrays.push(data.splice(0, size));
            }
            data_imported = false;
            chunk_no = 1;
    
            async.each(split_arrays, function(item_data, callback) {
                ddb.batchWriteItem({
                    "TABLE_NAME" : item_data
                }, {}, function(err, res, cap) {
                    console.log('done going next');
                    if (err == null) {
                        console.log('Success chunk #' + chunk_no);
                        data_imported = true;
                    } else {
                        console.log(err);
                        console.log('Fail chunk #' + chunk_no);
                        data_imported = false;
                    }
                    chunk_no++;
                    callback();
                });
    
            }, function() {
                // run after loops
                console.log('all data imported....');
    
            });
    
        });
        rs.pipe(parser);
    
    0 讨论(0)
  • Updated 2019 Javascript code

    I didn't have much luck with any of the Javascript code samples above. Starting with Hassan Siddique answer above, I've updated to the latest API, included sample credential code, moved all user config to the top, added uuid()'s when missing and stripped out blank strings.

    const fs = require('fs');
    const parse = require('csv-parse');
    const async = require('async');
    const uuid = require('uuid/v4');
    const AWS = require('aws-sdk');
    
    // --- start user config ---
    
    const AWS_CREDENTIALS_PROFILE = 'serverless-admin';
    const CSV_FILENAME = "./majou.csv";
    const DYNAMODB_REGION = 'eu-central-1';
    const DYNAMODB_TABLENAME = 'entriesTable';
    
    // --- end user config ---
    
    const credentials = new AWS.SharedIniFileCredentials({
      profile: AWS_CREDENTIALS_PROFILE
    });
    AWS.config.credentials = credentials;
    const docClient = new AWS.DynamoDB.DocumentClient({
      region: DYNAMODB_REGION
    });
    
    const rs = fs.createReadStream(CSV_FILENAME);
    const parser = parse({
      columns: true,
      delimiter: ','
    }, function(err, data) {
    
      var split_arrays = [],
        size = 25;
    
      while (data.length > 0) {
        split_arrays.push(data.splice(0, size));
      }
      data_imported = false;
      chunk_no = 1;
    
      async.each(split_arrays, function(item_data, callback) {
        const params = {
          RequestItems: {}
        };
        params.RequestItems[DYNAMODB_TABLENAME] = [];
        item_data.forEach(item => {
          for (key of Object.keys(item)) {
            // An AttributeValue may not contain an empty string
            if (item[key] === '')
              delete item[key];
          }
    
          params.RequestItems[DYNAMODB_TABLENAME].push({
            PutRequest: {
              Item: {
                id: uuid(),
                ...item
              }
            }
          });
        });
    
        docClient.batchWrite(params, function(err, res, cap) {
          console.log('done going next');
          if (err == null) {
            console.log('Success chunk #' + chunk_no);
            data_imported = true;
          } else {
            console.log(err);
            console.log('Fail chunk #' + chunk_no);
            data_imported = false;
          }
          chunk_no++;
          callback();
        });
    
      }, function() {
        // run after loops
        console.log('all data imported....');
    
      });
    
    });
    rs.pipe(parser);
    
    0 讨论(0)
  • 2021-01-31 16:07

    Here's a simpler solution. And with this solution, you don't have to remove empty string attributes.

    require('./env'); //contains aws secret/access key
    const parse = require('csvtojson');
    const AWS = require('aws-sdk');
    
    // --- start user config ---
    
    const CSV_FILENAME = __dirname + "/002_subscribers_copy_from_db.csv";
    const DYNAMODB_TABLENAME = '002-Subscribers';
    
    // --- end user config ---
    
    //You could add your credentials here or you could
    //store it in process.env like I have done aws-sdk
    //would detect the keys in the environment
    
    AWS.config.update({
        region: process.env.AWS_REGION
    });
    
    const db = new AWS.DynamoDB.DocumentClient({
        convertEmptyValues: true
    });
    
    (async ()=>{
        const json = await parse().fromFile(CSV_FILENAME);
    
        //this is efficient enough if you're processing small
        //amounts of data. If your data set is large then I
        //suggest using dynamodb method .batchWrite() and send 
        //in data in chunks of 25 (the limit) and find yourself
        //a more efficient loop if there is one
        for(var i=0; i<json.length; i++){
            console.log(`processing item number ${i+1}`);
            let query = {
                TableName: DYNAMODB_TABLENAME,
                Item: json[i]
            };
    
            await db.put(query).promise();
    
            /**
             * Note: If "json" contains other nested objects, you would have to
             *       loop through the json and parse all child objects.
             *       likewise, you would have to convert all children into their
             *       native primitive types because everything would be represented
             *       as a string.
             */
        }
        console.log('\nDone.');
    })();
    
    0 讨论(0)
提交回复
热议问题