python or bash script to pass all files in a folder to java command line

岁酱吖の 提交于 2019-12-06 15:38:01
jfs

To pass all .txt files in the current directory at once to the java subprocess:

#!/usr/bin/env python
from glob import glob
from subprocess import check_call

cmd = 'java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer'.split()
with open('output.txt', 'wb', 0) as file:
    check_call(cmd + glob('*.txt'), stdout=file)

It is similar to running the shell command but without running the shell:

$ java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer *.txt > output.txt

To run no more than 10 subprocesses at a time, passing no more than 100 files at a time, you could use multiprocessing.pool.ThreadPool:

#!/usr/bin/env python
from glob import glob
from multiprocessing.pool import ThreadPool
from subprocess import call
try:
    from threading import get_ident # Python 3.3+
except ImportError: # Python 2
    from thread import get_ident

cmd = 'java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer'.split()
def run_command(files):
    with open('output%d.txt' % get_ident(), 'ab', 0) as file:
        return files, call(cmd + files, stdout=file)

all_files = glob('*.txt')
file_groups = (all_files[i:i+100] for i in range(0, len(all_files), 100))
for _ in ThreadPool(10).imap_unordered(run_command, file_groups):
   pass

It is similar to this xargs command (suggested by @abarnert):

$ ls *.txt | xargs --max-procs=10 --max-args=100 java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer >>output.txt

except that each thread in the Python script writes to its own output file to avoid corrupting the output due to parallel writes.

If you want to do this from the shell instead of Python, the xargs tool can almost do everything you want.

You give it a command with a fixed list of arguments, and feed it input with a bunch of filenames, and it'll run the command multiple times, using the same fixed list plus a different batch of filenames from its input. The --max-args option sets the size of the biggest group. If you want to run things in parallel, the --max-procs option lets you do that.

But that's not quite there, because it doesn't do the output redirection. But… do you really need 10 separate files instead of 1 big one? Because if 1 big one is OK, you can just redirect all of them to it:

ls | xargs --max-args=10 --max-procs=10 java -cp stanford-ner.jar\
    edu.stanford.nlp.process.PTBTokenizer >> output.txt

If you have 100 files, and you want to kick off 10 processes, each handling 10 files, all in parallel, that's easy.

First, you want to group them into chunks of 10. You can do this with slicing or with zipping iterators; in this case, since we definitely have a list, let's just use slicing:

files = os.listdir(filedir)
groups = [files[i:i+10] for i in range(0, len(files), 10)]

Now, you want to kick off process for each group, and then wait for all of the processes, instead of waiting for each one to finish before kicking off the next. This is impossible with os.system, which is one of the many reasons os.system says "The subprocess module provides more powerful facilities for spawning new processes…"

procs = [subprocess.Popen(…) for group in groups]
for proc in procs:
    proc.wait()

So, what do you pass on the command line to give it 10 filenames instead of 1? If none of the names have spaces or other special characters, you can just ' '.join them. But otherwise, it's a nightmare. Another reason subprocess is better: you can just pass a list of arguments:

procs = [subprocess.Popen(['java', '-cp', 'stanford-ner.jar',
                           'edu.stanford.nlp.process.PTBTokenizer'] + group)
         for group in groups]

But now how to do you get all of the results?

One way is to go back to using a shell command line with the > redirection. But a better way is to do it in Python:

procs = []
files = []
for i, group in enumerate(groups):
    file = open('output_{}'.format(i), 'w')
    files.append(file)
    procs.append(subprocess.Popen([…same as before…], stdout=file)
for proc in procs:
    proc.wait()
for file in files:
    file.close()

(You might want to use a with statement with ExitStack, but I wanted to make sure this didn't require Python 2.7/3.3+, so I used explicit close.)

Inside your input file directory you can do the following in bash:

#!/bin/bash
for file in *.txt
do
    input=$input" \"$file\""
done
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer $input > output.txt

If you want to run it as a script. Save the file with some name, my_exec.bash:

#!/bin/bash
if [ $# -ne 2 ]; then
    echo "Invalid Input. Enter a directory and a output file"
    exit 1
fi
if [ ! -d $1 ]; then
    echo "Please pass a valid directory"
    exit 1
fi
for file in $1*.txt
do
    input=$input" \"$file\""
done
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer $input > $2

Make it an executable file

chmod +x my_exec.bash

USAGE:

 ./my_exec.bash <folder> <output_file>
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!