问题
I have a large csv dataset (>5TB) in multiple files (stored in a storage bucket) that I need to import into Google Bigtable. The files are in the format:
rowkey,s1,s2,s3,s4
text,int,int,int,int
...
There is an importtsv function with hbase that would be perfect but this does not seem to be available when using Google hbase shell in windows. Is it possible to use this tool? If not, what is the fastest way of achieving this? I have little experience with hbase and Google Cloud so a simple example would be great. I have seen some similar examples using DataFlow but would prefer not to learn how to do this unless necessary.
Thanks
回答1:
The ideal way to import something this large into Cloud Bigtable is to put your TSV on Google Cloud Storage.
gsutil mb <your-bucket-name>
gsutil -m cp -r <source dir> gs://<your-bucket-name>/
Then use Cloud Dataflow.
Use the HBase shell to create the table, Column Family, and the output columns.
Write a small Dataflow job to read all the files, then create a key, followed by writing the table. (See this example to get started.)
A bit easier way would be to: (Note- untested)
- Copy your files to Google Cloud Storage
- Use Google Cloud Dataproc the example shows how to create a cluster and hookup Cloud Bigtable.
ssh
to your cluster master - the script in the wordcount-mapreduce example will accept./cluster ssh
Use the HBase TSV importer to start a Map Reduce job.
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> gs://<your-bucket-name>/<dir>/**
回答2:
I created a bug on the Cloud Bigtable Client project to implement a method of doing importtsv
.
Even if we can get importtsv
to work, setting up Bigtable on your own machine may take some doing. Importing a file this big is a bit involved for a single machine, so usually a distributed job (Hadoop or Dataflow) is needed, so I'm not sure how well running the job from your machine is going to work.
来源:https://stackoverflow.com/questions/34104427/bigtable-csv-import