问题
We're just getting started evaluating the datalake service at Azure. We created our lake, and via the portal we can see the two public URLs for the service. (One is an https:// scheme, the other an adl:// scheme)
The datalake documentation states that there are indeed two interfaces: webHDFS REST API, and ADL. So, I am assuming the https:// scheme gets me the wehHDFS interface. However, I can find no more information at Azure about using this interface.
I tried poking at the given https:// URL, with web browser and curl. The service is responding. Replies are JSON, which is as expected, since a datalake is an instance of Hadoop. However, I cannot seem to get access to my files [which I uploaded into our lake via the portal].
If I do a GET to "/foo.txt", for example, the reply is an error, ResourceNotFound.
If I do a GET using the typical Hadoop HDFS syntax, "/webhdfs/v1/foo.txt", the reply is an error, AuthenticationFailed. Additional text indicates a missing access token. This seems more promising. However, can't find anything about generating such an access token.
There is some documentation on using the ADL interface, and .NET and Visual Studio, but this is not what I want, initially.
Any help much appreciated!
回答1:
I am indebted to this forum post by Matthew Hicks which outlined how to do this with curl
. I took it and wrapped it in PowerShell. I'm sure there are many ways to accomplish this, but here's one that works.
First setup an AAD application so that you can fill in the client_id and client_secret mentioned below. (That assumes you want to automate this rather than having an interactive login. If you want an interactive login, then there's a link to that approach in the forum post above.)
Then fill in the settings in the first 5 lines and run the following PowerShell script:
$client_id = "<client id>";
$client_secret = "<secret>";
$tenant = "<tenant>";
$adlsAccount = "<account>";
cd D:\path\to\curl
#authenticate
$cmd = { .\curl.exe -X POST https://login.microsoftonline.com/$tenant/oauth2/token -F grant_type=client_credentials -F resource=https://management.core.windows.net/ -F client_id=$client_id -F client_secret=$client_secret };
$responseToken = Invoke-Command -scriptblock $cmd;
$accessToken = (ConvertFrom-Json $responseToken).access_token;
#list root folders
$cmd = {.\curl.exe -X GET -H "Authorization: Bearer $accessToken" https://$adlsAccount.azuredatalakestore.net/webhdfs/v1/?op=LISTSTATUS };
$foldersResponse = Invoke-Command -scriptblock $cmd;
#loop through directories directories
(ConvertFrom-Json $foldersResponse).FileStatuses.FileStatus | ForEach-Object { $_.pathSuffix }
#list files in one folder
$cmd = {.\curl.exe -X GET -H "Authorization: Bearer $accessToken" https://$adlsAccount.azuredatalakestore.net/webhdfs/v1/weather/?op=LISTSTATUS };
$weatherResponse = Invoke-Command -scriptblock $cmd;
(ConvertFrom-Json $weatherResponse).FileStatuses.FileStatus | ForEach-Object { $_.pathSuffix }
#download one file
$cmd = {.\curl.exe -L "https://$adlsAccount.azuredatalakestore.net/webhdfs/v1/weather/2007small.csv?op=OPEN" -H "Authorization: Bearer $accessToken" -o d:\temp\curl\2007small.csv };
Invoke-Command -scriptblock $cmd;
#upload one file
$cmd = {.\curl.exe -i -X PUT -L "https://$adlsAccount.azuredatalakestore.net/webhdfs/v1/weather/new2007small.csv?op=CREATE" -T "D:\temp\weather\smallcsv\new2007small.csv" -H "Authorization: Bearer $accessToken" };
Invoke-Command -scriptblock $cmd;
来源:https://stackoverflow.com/questions/36410042/how-to-access-azure-datalake-using-the-webhdfs-api