Get input file name in streaming hadoop program

让人想犯罪 __ 提交于 2019-12-03 18:53:39

问题


I am able to find the name if the input file in a mapper class using FileSplit when writing the program in Java.

Is there a corresponding way to do this when I write a program in Python (using streaming?)

I found the following in the hadoop streaming document on apache:

See Configured Parameters. During the execution of a streaming job, the names of the "mapred" parameters are transformed. The dots ( . ) become underscores ( _ ). For example, mapred.job.id becomes mapred_job_id and mapred.jar becomes mapred_jar. In your code, use the parameter names with the underscores.

But I still cant understand how to make use of this inside my mapper.

Any help is highly appreciated.

Thanks


回答1:


According to the "Hadoop : The Definitive Guide"

Hadoop sets job configuration parameters as environment variables for Streaming programs. However, it replaces non-alphanumeric character with underscores to make sure they are valid names. The following Python expression illustrates how you can retrieve the value of the mapred.job.id property from within a Python Streaming script:

os.environ["mapred_job_id"]

You can also set environment variables for the Streaming process launched by MapReduce by applying the -cmdenv option to the Streaming launcher program (once for each variable you wish to set). For example, the following sets the MAGIC_PARAMETER environment variable:

-cmdenv MAGIC_PARAMETER=abracadabra




回答2:


By parsing the mapreduce_map_input_file(new) or map_input_file(deprecated) environment variable, you will get the map input file name.

Notice:
The two environment variables are case-sensitive, all letters are lower-case.




回答3:


The new ENV_VARIABLE for Hadoop 2.x is MAPREDUCE_MAP_INPUT_FILE



来源:https://stackoverflow.com/questions/7449756/get-input-file-name-in-streaming-hadoop-program

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!