问题
What is the best way to create/write/update a file in remote HDFS from local python script?
I am able to list files and directories but writing seems to be a problem.
I have searched hdfs and snakebite but none of them give a clean way to do this.
回答1:
try HDFS liberary.. its really good You can use write(). https://hdfscli.readthedocs.io/en/latest/api.html#hdfs.client.Client.write
Example:
to create connection:
from hdfs import InsecureClient
client = InsecureClient('http://host:port', user='ann')
from json import dump, dumps
records = [
{'name': 'foo', 'weight': 1},
{'name': 'bar', 'weight': 2},
]
# As a context manager:
with client.write('data/records.jsonl', encoding='utf-8') as writer:
dump(records, writer)
# Or, passing in a generator directly:
client.write('data/records.jsonl', data=dumps(records), encoding='utf-8')
For CSV you can do
import pandas as pd
df=pd.read.csv("file.csv")
with client_hdfs.write('path/output.csv', encoding = 'utf-8') as writer:
df.to_csv(writer)
回答2:
What's wrong with other answers
They use WebHDFS, which is not enabled by default, and insecure without Kerberos or Apache Knox.
This is what the upload function of that hdfs
library you linked to uses.
Native (more secure) ways to write to HDFS using Python
You can use pyspark
.
Example - How to write pyspark dataframe to HDFS and then how to read it back into dataframe?
snakebite
has been mentioned, but it doesn't write files
pyarrow has a FileSystem.open() function that should be able to write to HDFS as well, though I've not tried.
回答3:
Without using a complicated library built for HDFS, you can also simply use the requests package in python for HDFS as:
import requests
from json import dumps
params = (
('op', 'CREATE')
)
data = dumps(file) # some file or object - also tested for pickle library
response = requests.put('http://host:port/path', params=params, data=data)
If response is 200, then your connection is working! This technique lets you use all the utitities given by Hadoop's RESTful API: ls, md, get, post, etc.
You can also convert CURL commands to python through this:
- Get Command for HDFS: https://hadoop.apache.org/docs/r1.0.4/webhdfs.html
- Convert to python: https://curl.trillworks.com/
Hope this helps!
来源:https://stackoverflow.com/questions/47926758/python-write-to-hdfs-file