问题
There are so many ETL tools out there. Not many that are Free. And of the Free choices out there they don't appear to have any knowledge of or support for ArangoDB. If anyone has dealt with the migration of their data over to ArangoDB and automated this process I would love to hear how you accomplished this. Below I have listed out several choices we have for ETL Tools. These choices I actually took from the 2016 Spark Europe presentation by Bas Geerdink.
* IBM InfoSphere DataStage
* Oracle Warehouse Builder
* Pervasive Data Integrator
* PowerCenter Informatica
* SAS Data Management
* Talend Open Studio
* SAP Data Services
* Microsoft SSIS
* Syncsort DMX
* CloverETL
* Jaspersoft
* Pentaho
* NiFi
回答1:
I was able to utilize Apache NiFi
in order to accomplish this goal. Below is an extremely basic overview of what I did in order to get data out of a source Database into ArangoDB.
Using NiFi you are able to extract data from many of the standard databases out there. There are many Java Drivers out there that are already created to work with databases such as MySQL, SQLite, Oracle, etc....
I was able to use two processors to pull data out of a source database using:
QueryDatabaseTable
ExecuteSQL
The output of these are in NiFi's Avro format which I then converted to JSON using the ConvertAvroToJSON
Processor. This converts the output to a JSON List.
While there really isn't anything within NiFi specifically built for use with ArangoDB there is one feature that comes built in with ArangoDB and that is it's API.
I was able to Bulk Insert data into ArangoDB using NiFi's InvokeHTTP Processor with a POST method into a Collection named Cities.
The value I used as the RemoteURL:
http://localhost:8529/_api/import?collection=cities&type=list&details=true
Below is a screenshot of NiFi. I could have definitely used this to kick start my research. I hope this helps someone else. Ignore some of the extra processors as I have them in there for testing purposes and was messing around with JOLT to see if I can use it to 'Transform' my JSON. The "T"
in ETL.
回答2:
I wanted to add a comment above, but was unable to do so.
Based on Code Novice's response, I too used NiFi to move data into ArangoDB. In my case, I moved data from SQL Server on a Windows desktop machine to ArangoDB on a Linux desktop machine with both machines on the same network. For 9.7M records = 5.4GB of uncompressed JSON data, the flow took approximately 12 minutes - reasonable performance.
I made a minor change to the above flow by using the ExecuteSQLRecord processor. This step negates the need to convert from AVRO to JSON. In total, you can move the data with two processors: ExecuteSQLRecord and InvokeHTTP.
For ExecuteSQLRecord, based on my testing I recommend submitting in many small batches (~10,000 per batch) versus a few large batches (~500,000 per batch) to avoid ArangoDB bottlenecks.
For InvokeHTTP, if you run NiFi on a different machine than the ArangoDB machine, you need to (1) make sure your ArangoDB machine firewall port is open and (2) change the server address in the .conf files from 127.0.0.1 to your actual ArangoDB machine IP address. The .conf files can be found in the /etc/arangodb3 folder.
For the T (processors on the side above), I would generally let SQL perform the transformations rather than JOLT unless JSON specific formatting changes are needed.
Last, you could do the above by using the following three processors: ExecuteSQLRecord, PutFile, and ExecuteProcess
For ExecuteSQLRecord, you need to change the Output Grouping property to One Line Per Object (i.e. jsonl). For ExecuteProcess, you ask NiFi to invoke arangoimport with the appropriate options. I did not complete this process entirely in NiFi, but some testing suggests the time to complete is comparable to the ExecuteSQLRecord and InvokeHTTP flow.
I concur that NiFi is an excellent way to move data into ArangoDB.
来源:https://stackoverflow.com/questions/49436345/etl-tools-that-function-well-with-arangodb-what-are-they