subprocess popen to run commands (HDFS/hadoop)

跟風遠走 提交于 2019-12-24 00:40:07

问题


I am trying to use subprocess.popen to run commands on my machine.

This is what I have so far

cmdvec = ['/usr/bin/hdfs', 'dfs', '-text', '/data/ds_abc/clickstream/{d_20151221-2300}/*', '|', 'wc', '-l']

subproc = subprocess.Popen(cmdvec, stdout=subprocess.PIPE, stdin=None, stderr=subprocess.STDOUT)

If I run the command in my terminal I get an output of

15/12/21 16:09:31 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
15/12/21 16:09:31 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 9cd4009fb896ac12418449e4678e16eaaa3d5e0a]
15/12/21 16:09:31 INFO compress.CodecPool: Got brand-new decompressor [.snappy]
15305

The number 15305 is the desired value I want.

When I run the command by splitting it and converting it into a list, I do this to try to get the lines:

for i in subproc.stdout:
    print(i)

However this gives me the data as if this command was ran because all the data from the file is being displayed.

/usr/bin/hdfs dfs -text /data/ds_abc/clickstream/{d_20151221-2300}/*

It doesn't seem like the pipe | has been used to count the number of lines are in all the files


回答1:


In your example, passing the pipe | character as an argument to subprocess.Popen does not create a pipeline of processes the same way that it would in something like Bash. Instead, the pipe | character is being passed an argument to a single process.

Instead, you would need to chain together 2 separate subprocess.Popen calls to simulate a Bash-style pipeline. This documentation on the subprocess module contains more details.

https://docs.python.org/2/library/subprocess.html#replacing-shell-pipeline



来源:https://stackoverflow.com/questions/34406547/subprocess-popen-to-run-commands-hdfs-hadoop

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!