问题
Update based on Anthony Sottile's Answer
I re-implemented his solution to simplify the problem. Lets take Docker and Django out of the equation. The goal is to use Pandas to read excel by both of the following methods:
python example.py - < /path/to/file.xlsx
cat /path/to/file.xlsx | python example.py -
where example.py is reproduced below:
import argparse
import contextlib
from typing import IO
import sys
import pandas as pd
@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
if filename == '-':
yield sys.stdin.buffer
else:
with open(filename, 'rb') as f:
yield f
def main():
parser = argparse.ArgumentParser()
parser.add_argument('FILE')
args = parser.parse_args()
with file_ctx(args.FILE) as input_file:
print(input_file.read())
df = pd.read_excel(input_file)
print(df)
if __name__ == "__main__":
main()
The problem is that Pandas (see traceback below) does not accept 2. However it works fine with 1.
Whereas simply printing the text representation of the excel file works in both 1. and 2.
In case you want to easily reproduce the Docker environment:
First build Docker image named pandas:
docker build --pull -t pandas - <<EOF
FROM python:latest
RUN pip install pandas xlrd
EOF
Then use pandas Docker image to run:
docker run --rm -i -v /path/to/example.py:/example.py pandas python example.py - < /path/to/file.xlsx
Note how it correctly is able to print out a plaintext representation of the excel file, but pandas is unable to read it.
A more concise traceback, similar to below:
Traceback (most recent call last):
File "example.py", line 29, in <module>
main()
File "example.py", line 24, in main
df = pd.read_excel(input_file)
File "/usr/local/lib/python3.8/site-packages/pandas/util/_decorators.py", line 208, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 310, in read_excel
io = ExcelFile(io, engine=engine)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 819, in __init__
self._reader = self._engines[engine](self._io)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_xlrd.py", line 21, in __init__
super().__init__(filepath_or_buffer)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 356, in __init__
filepath_or_buffer.seek(0)
io.UnsupportedOperation: File or stream is not seekable.
To show the code working when mounting the excel file in (i.e. Not being passed by stdin):
docker run --rm -i -v /path/to/example.py:/example.py -v /path/to/file.xlsx:/file.xlsx pandas python example.py file.xlsx
Original problem description (for additional context)
Take the scenario where on the host system, you have a file at /tmp/test.txt
and you want to use head
on it, but within a Docker container (echo 'Hello World!' > /tmp/test.txt
to reproduce the example data I have):
You can run:
docker run -i busybox head -1 - < /tmp/test.txt
to print the first line out to screen:
OR
cat /tmp/test.txt | docker run -i busybox head -1 -
and the output is:
Hello World!
Even with a binary format like .xlsx instead of plaintext, the above can be done and you would get some weird output similar to:
�Oxl/_rels/workbook.xml.rels���j�0
��}
The point above is that head works with both binary and text formats even through the abstraction of Docker.
But in my own argparse based CLI (Actually custom Django management command, which I believe makes use of argparse), I get the following error when attempting to use panda's read_excel
within a Docker context.
The error that is printed is as follows:
Traceback (most recent call last):
File "./manage.py", line 15, in <module>
execute_from_command_line(sys.argv)
File "/opt/conda/lib/python3.7/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
utility.execute()
File "/opt/conda/lib/python3.7/site-packages/django/core/management/__init__.py", line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/opt/conda/lib/python3.7/site-packages/django/core/management/base.py", line 323, in run_from_argv
self.execute(*args, **cmd_options)
File "/opt/conda/lib/python3.7/site-packages/django/core/management/base.py", line 364, in execute
output = self.handle(*args, **options)
File "/home/jovyan/sequence_databaseApp/management/commands/seq_db.py", line 54, in handle
df_snapshot = pd.read_excel(options['FILE'].buffer, sheet_name='Snapshot', header=0, dtype=dtype)
File "/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py", line 208, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 310, in read_excel
io = ExcelFile(io, engine=engine)
File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 819, in __init__
self._reader = self._engines[engine](self._io)
File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_xlrd.py", line 21, in __init__
super().__init__(filepath_or_buffer)
File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 356, in __init__
filepath_or_buffer.seek(0)
io.UnsupportedOperation: File or stream is not seekable.
Concretely,
docker run -i <IMAGE> ./manage.py my_cli import - < /path/to/file.xlsx
does not work,
but ./manage.py my_cli import - < /path/to/file.xlsx
does work!
Somehow there is a difference within the Docker context.
However I also note, even taking Docker out of the equation:
cat /path/to/file.xlsx | ./manage.py my_cli import -
does not work
though:
./manage.py my_cli import - < /path/to/file.xlsx
does work (as mentioned before)
Finally, the code I am using (You should be able to save that as my_cli.py under management/commands to get it working within a Django project):
import argparse
import sys
from django.core.management.base import BaseCommand
class Command(BaseCommand):
help = 'my_cli help'
def add_arguments(self, parser):
subparsers = parser.add_subparsers(
title='commands', dest='command', help='command help')
subparsers.required = True
parser_import = subparsers.add_parser('import', help='import help')
parser_import.add_argument('FILE', type=argparse.FileType('r'), default=sys.stdin)
def handle(self, *args, **options):
import pandas as pd
df = pd.read_excel(options['FILE'].buffer, header=0)
print(df)
回答1:
It looks as though you're reading the file in text mode (FileType('r')
/ sys.stdin
)
According to this bpo issue argparse does not support opening binary files directly
I'd suggest handling the file type yourself with code similar to this (I'm not familiar with the django / pandas way so I've simplified it down to just plain python)
import argparse
import contextlib
import io
from typing import IO
@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
if filename == '-':
yield io.BytesIO(sys.stdin.buffer.read())
else:
with open(filename, 'rb') as f:
yield f
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument('FILE')
args = parser.parse_args()
with file_ctx(args.FILE) as input_file:
# do whatever you need with that input file
回答2:
Based very heavily on Anthony Sottile's Answer but with a slight edit that completely solves the problem:
import argparse
import contextlib
import io
from typing import IO
import sys
import pandas as pd
@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
if filename == '-':
yield io.BytesIO(sys.stdin.buffer.read())
else:
with open(filename, 'rb') as f:
yield f
def main():
parser = argparse.ArgumentParser()
parser.add_argument('FILE')
args = parser.parse_args()
with file_ctx(args.FILE) as input_file:
print(input_file.read())
df = pd.read_excel(input_file)
print(df)
if __name__ == "__main__":
main()
I got the idea after reading this answer to Pandas 0.25.0 and xlsx from response content stream
How this looks in terms of the original question's Django based context:
import contextlib
import io
import sys
from typing import IO
import pandas as pd
from django.core.management.base import BaseCommand
@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
if filename == '-':
yield io.BytesIO(sys.stdin.buffer.read())
else:
with open(filename, 'rb') as f:
yield f
class Command(BaseCommand):
help = 'my_cli help'
def add_arguments(self, parser):
subparsers = parser.add_subparsers(
title='commands', dest='command', help='command help')
subparsers.required = True
parser_import = subparsers.add_parser('import', help='import help')
parser_import.add_argument('FILE')
def handle(self, *args, **options):
with file_ctx(options['FILE']) as input_file:
df = pd.read_excel(input_file)
print(df)
来源:https://stackoverflow.com/questions/59468669/how-to-pass-a-binary-file-as-stdin-to-a-docker-containerized-python-script-using