How to upload data in bulk to the appengine datastore? Older methods do not work

前端 未结 4 1024
青春惊慌失措
青春惊慌失措 2020-12-09 10:20

This should be a fairly common requirement, and a simple process: upload data in bulk to the appengine datastore.

However, none of the older solutions mentioned on

相关标签:
4条回答
  • 2020-12-09 10:51

    The remote API method, as demonstrated in your link [1], still works fine - although it is very slow if you have more than a few hundred rows.

    I have successfully used GCS in conjunction with the MapReduce framework to download, rather than upload, the contents of the datastore, but the principles should be the same. See the mapreduce documentation: in fact you only need the mapper step, so you can define a simple function which accepts a row from your CSV and creates a datastore entity from that data.

    0 讨论(0)
  • 2020-12-09 11:08

    As of 2018 the best way to go about this is to use the new import/export capability.

    0 讨论(0)
  • 2020-12-09 11:09

    Method 1: Use remote_api

    How to : write a bulkloader.yaml file and run it directly using “appcfg.py upload_data” command from terminal I don’t recommend this method for a couple of reasons: 1. huge latency 2. no support for NDB

    Method 2: GCS and use mapreduce

    Uploading Data File to GCS:

    Use the “storage-file-transfer-json-python” github project (chunked_transfer.py) to upload files to gcs from your local system. Make sure to generate proper “client-secrets.json” file from the app engine admin console.

    Mapreduce:

    Use the "appengine-mapreduce" github project. Copy the "mapreduce" folder to your project top-level folder.

    Add the below line to your app.yaml file:

    includes:
      - mapreduce/include.yaml
    

    Below is your main.py file

    import cgi
    import webapp2
    import logging
    import os, csv
    from models import DataStoreModel
    import StringIO
    from google.appengine.api import app_identity
    from mapreduce import base_handler
    from mapreduce import mapreduce_pipeline
    from mapreduce import operation as op
    from mapreduce.input_readers import InputReader
    
    def testmapperFunc(newRequest):
        f = StringIO.StringIO(newRequest)
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            newEntry = DataStoreModel(attr1=row[0], link=row[1])
            yield op.db.Put(newEntry)
    
    class TestGCSReaderPipeline(base_handler.PipelineBase):
        def run(self, filename):
            yield mapreduce_pipeline.MapreducePipeline(
                    "test_gcs",
                    "testgcs.testmapperFunc",
                    "mapreduce.input_readers.FileInputReader",
                    mapper_params={
                        "files": [filename],
                        "format": 'lines'
                    },
                    shards=1)
    
    class tempTestRequestGCSUpload(webapp2.RequestHandler):
        def get(self):
            bucket_name = os.environ.get('BUCKET_NAME',
                                         app_identity.get_default_gcs_bucket_name())
    
            bucket = '/gs/' + bucket_name
            filename = bucket + '/' + 'tempfile.csv'
    
            pipeline = TestGCSReaderPipeline(filename)
            pipeline.with_params(target="mapreducetestmodtest")
            pipeline.start()
            self.response.out.write('done')
    
    application = webapp2.WSGIApplication([
        ('/gcsupload', tempTestRequestGCSUpload),
    ], debug=True)
    

    To remember:

    1. Mapreduce project uses the now-deprecated “Google Cloud Storage Files API”. So support in future is not guaranteed.
    2. Map reduce adds a small overhead to datastore reads and writes.

    Method 3: GCS and GCS Client Library

    1. Upload the csv/text file to gcs using the above file-transfer method.
    2. Use gcs client library (copy the 'cloudstorage' folder to your application top-level folder).

    Add the below code to the application main.py file.

    import cgi
    import webapp2
    import logging
    import jinja2
    import os, csv
    import cloudstorage as gcs
    from google.appengine.ext import ndb
    from google.appengine.api import app_identity
    from models import DataStoreModel
    
    class UploadGCSData(webapp2.RequestHandler):
        def get(self):
            bucket_name = os.environ.get('BUCKET_NAME',
                                         app_identity.get_default_gcs_bucket_name())
            bucket = '/' + bucket_name
            filename = bucket + '/tempfile.csv'
            self.upload_file(filename)
    
        def upload_file(self, filename):
            gcs_file = gcs.open(filename)
            datareader = csv.reader(gcs_file)
            count = 0
            entities = []
            for row in datareader:
                count += 1
                    newProd = DataStoreModel(attr1=row[0], link=row[1])
                    entities.append(newProd)
    
                if count%50==0 and entities:
                    ndb.put_multi(entities)
                    entities=[]
    
            if entities:
                ndb.put_multi(entities)
    
    application = webapp2.WSGIApplication([
        ('/gcsupload', UploadGCSData),
    ], debug=True)
    
    0 讨论(0)
  • 2020-12-09 11:10

    Some of you might be in my situation: I cannot use the import/export utility of datastore, because my data needs to be transformed before getting into the datastore.

    I ended up using apache-beam (google cloud dataflow).

    You only need to write a few lines of "beam" code to

    • read your data (for example, hosted on cloud storage) - you get a PCollection of strings,
    • do whatever transform you want (so you get a PCollection of datastore Entities),
    • dump them to datastore sink.

    See How to speedup bulk importing into google cloud datastore with multiple workers? for a concrete use case.

    I was able to write with a speed of 800 entities per second into my datastore with 5 workers. This enabled me to finish the importing task (with 16 million rows) in about 5 hours. If you want to make it faster, use more workers :D

    0 讨论(0)
提交回复
热议问题