How do you deal with empty or missing input files in Apache Pig?

前端 未结 2 1178
星月不相逢
星月不相逢 2021-01-12 14:40

Our workflow uses an AWS elastic map reduce cluster to run series of Pig jobs to manipulate a large amount of data into aggregated reports. Unfortunately, the input data is

2条回答
  •  孤街浪徒
    2021-01-12 15:20

    (For posterity, a sub-par solution we've come up with:)

    To deal with the 0-byte problem, we've found that we can detect the situation and instead insert a file with a single newline. This causes a message like:

    Encountered Warning ACCESSING_NON_EXISTENT_FIELD 13 time(s).
    

    but at least Pig doesn't crash with an exception.

    Alternatively, we could produce a line with the appropriate number of '\t' characters for that file which would avoid the warning, but it would insert garbage into the data that we would then have to filter out.

    These same ideas could be used to solve the no input files condition by creating a dummy file, but it has the same downsides as are listed above.

提交回复
热议问题