问题
I'm new to Google Cloud Platform. I'm trying to read a CSV file present in Google Cloud Storage (non-public bucket accessed via Service Account key) line by line which is around 1GB.
I couldn't find any option to read the file present in the Google Cloud Storage (GCS) line by line. I only see the read by chunksize/byte size options. Since I'm trying to read a CSV, I don't want to use read by chunksize since it may split a record while reading.
Solutions tried so far: Tried copying the contents from CSV file present in GCS to temporary local file and read the temp file by using the below code. The below code is working as expected but I don't want to copy huge file to my local instance. Instead, I want to read line by line from GCS.
StorageOptions options =
StorageOptions.newBuilder().setProjectId(GCP_PROJECT_ID)
.setCredentials(gcsConfig.getCredentials()).build();
Storage storage = options.getService();
Blob blob = storage.get(BUCKET_NAME, FILE_NAME);
ReadChannel readChannel = blob.reader();
FileOutputStream fileOuputStream = new FileOutputStream(TEMP_FILE_NAME);
fileOuputStream.getChannel().transferFrom(readChannel, 0, Long.MAX_VALUE);
fileOuputStream.close();
Please suggest the approach.
回答1:
Since, I'm doing batch processing, I'm using the below code in my ItemReader's init() method which is annotated with @PostConstruct. And In my ItemReader's read(), I'm building a List. Size of list is same as chunk size. In this way I can read lines based on my chunkSize instead of reading all the lines at once.
StorageOptions options =
StorageOptions.newBuilder().setProjectId(GCP_PROJECT_ID)
.setCredentials(gcsConfig.getCredentials()).build();
Storage storage = options.getService();
Blob blob = storage.get(BUCKET_NAME, FILE_NAME);
ReadChannel readChannel = blob.reader();
BufferedReader br = new BufferedReader(Channels.newReader(readChannel, "UTF-8"));
回答2:
One of the easiest ways might be to use the google-cloud-nio
package, part of the google-cloud-java library that you're already using: https://github.com/googleapis/google-cloud-java/tree/v0.30.0/google-cloud-contrib/google-cloud-nio
It incorporates Google Cloud Storage into Java's NIO, and so once it's up and running, you can refer to GCS resources just like you'd do for a file or URI. For example:
Path path = Paths.get(URI.create("gs://bucket/lolcat.csv"));
try (Stream<String> lines = Files.lines(path)) {
lines.forEach(s -> System.out.println(s));
} catch (IOException ex) {
// do something or re-throw...
}
回答3:
Brandon Yarbrough is right, and to add to his answer:
if you use gcloud to login with your credentials then Brandon's code will work: google-cloud-nio
will use your login to access the files (and that'll work even if they are not public).
If you prefer to do it all in software, you can use this code to read credentials from a local file and then access your file from Google Cloud:
String myCredentials = "/path/to/my/key.json";
CloudStorageFileSystem fs =
CloudStorageFileSystem.forBucket(
"bucket",
CloudStorageConfiguration.DEFAULT,
StorageOptions.newBuilder()
.setCredentials(ServiceAccountCredentials.fromStream(
new FileInputStream(myCredentials)))
.build());
Path path = fs.getPath("/lolcat.csv");
List<String> lines = Files.readAllLines(path, StandardCharsets.UTF_8);
edit: you don't want to read all the lines at once so don't use realAllLines
, but once you have the Path
you can use any of the other techniques discussed above to read just the part of the file you need: you can read one line at a time or get a Channel
object.
来源:https://stackoverflow.com/questions/55225297/how-to-read-a-huge-csv-file-from-google-cloud-storage-line-by-line-using-java