How to read a huge CSV file from Google Cloud Storage line by line using Java?

这一生的挚爱 提交于 2020-07-18 18:54:11

问题


I'm new to Google Cloud Platform. I'm trying to read a CSV file present in Google Cloud Storage (non-public bucket accessed via Service Account key) line by line which is around 1GB.

I couldn't find any option to read the file present in the Google Cloud Storage (GCS) line by line. I only see the read by chunksize/byte size options. Since I'm trying to read a CSV, I don't want to use read by chunksize since it may split a record while reading.

Solutions tried so far: Tried copying the contents from CSV file present in GCS to temporary local file and read the temp file by using the below code. The below code is working as expected but I don't want to copy huge file to my local instance. Instead, I want to read line by line from GCS.

    StorageOptions options = 
    StorageOptions.newBuilder().setProjectId(GCP_PROJECT_ID)
            .setCredentials(gcsConfig.getCredentials()).build();
    Storage storage = options.getService();
    Blob blob = storage.get(BUCKET_NAME, FILE_NAME);
    ReadChannel readChannel = blob.reader();
    FileOutputStream fileOuputStream = new FileOutputStream(TEMP_FILE_NAME);
    fileOuputStream.getChannel().transferFrom(readChannel, 0, Long.MAX_VALUE);
    fileOuputStream.close();

Please suggest the approach.


回答1:


Since, I'm doing batch processing, I'm using the below code in my ItemReader's init() method which is annotated with @PostConstruct. And In my ItemReader's read(), I'm building a List. Size of list is same as chunk size. In this way I can read lines based on my chunkSize instead of reading all the lines at once.

StorageOptions options = 
StorageOptions.newBuilder().setProjectId(GCP_PROJECT_ID)
        .setCredentials(gcsConfig.getCredentials()).build();
Storage storage = options.getService();
Blob blob = storage.get(BUCKET_NAME, FILE_NAME);
ReadChannel readChannel = blob.reader();
BufferedReader br = new BufferedReader(Channels.newReader(readChannel, "UTF-8"));



回答2:


One of the easiest ways might be to use the google-cloud-nio package, part of the google-cloud-java library that you're already using: https://github.com/googleapis/google-cloud-java/tree/v0.30.0/google-cloud-contrib/google-cloud-nio

It incorporates Google Cloud Storage into Java's NIO, and so once it's up and running, you can refer to GCS resources just like you'd do for a file or URI. For example:

Path path = Paths.get(URI.create("gs://bucket/lolcat.csv"));
try (Stream<String> lines = Files.lines(path)) {
   lines.forEach(s -> System.out.println(s));
} catch (IOException ex) {
   // do something or re-throw...
}



回答3:


Brandon Yarbrough is right, and to add to his answer:

if you use gcloud to login with your credentials then Brandon's code will work: google-cloud-nio will use your login to access the files (and that'll work even if they are not public).

If you prefer to do it all in software, you can use this code to read credentials from a local file and then access your file from Google Cloud:

    String myCredentials = "/path/to/my/key.json";
    CloudStorageFileSystem fs =
        CloudStorageFileSystem.forBucket(
            "bucket",
            CloudStorageConfiguration.DEFAULT,
            StorageOptions.newBuilder()
                .setCredentials(ServiceAccountCredentials.fromStream(
                    new FileInputStream(myCredentials)))
                .build());
    Path path = fs.getPath("/lolcat.csv");
    List<String> lines = Files.readAllLines(path, StandardCharsets.UTF_8);

edit: you don't want to read all the lines at once so don't use realAllLines, but once you have the Path you can use any of the other techniques discussed above to read just the part of the file you need: you can read one line at a time or get a Channel object.



来源:https://stackoverflow.com/questions/55225297/how-to-read-a-huge-csv-file-from-google-cloud-storage-line-by-line-using-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!