Accessing HDInsight Hive with python

问题

We have a HDInsight cluster with some tables in HIVE. I want to query these tables from Python 3.6 from a client machine (outside Azure).

I have tried using PyHive, pyhs2 and also impyla but I am running into various problems with all of them.

Does anybody have a working example of accessing a HDInsight HIVE from Python?

I have very little experience with this, and don't know how to configure PyHive (which seems the most promising), especially regarding authorization.

With impyla:

from impala.dbapi import connect
conn = connect(host='redacted.azurehdinsight.net',port=443)
cursor = conn.cursor()
cursor.execute('SELECT * FROM cs_test LIMIT 100')
print(cursor.description)  # prints the result set's schema
results = cursor.fetchall()

This gives:

Traceback (most recent call last):
  File "C:/git/ml-notebooks/impyla.py", line 3, in <module>
    cursor = conn.cursor()
  File "C:\Users\chris\Anaconda3\lib\site-packages\impala\hiveserver2.py", line 125, in cursor
    session = self.service.open_session(user, configuration)
  File "C:\Users\chris\Anaconda3\lib\site-packages\impala\hiveserver2.py", line 995, in open_session
    resp = self._rpc('OpenSession', req)
  File "C:\Users\chris\Anaconda3\lib\site-packages\impala\hiveserver2.py", line 923, in _rpc
    response = self._execute(func_name, request)
  File "C:\Users\chris\Anaconda3\lib\site-packages\impala\hiveserver2.py", line 954, in _execute
    .format(self.retries))
impala.error.HiveServer2Error: Failed after retrying 3 times

With Pyhive:

from pyhive import hive

conn = hive.connect(host="redacted.azurehdinsight.net",port=443,auth="NOSASL")
#also tried other auth-types, but as i said, i have no clue here

This gives:

Traceback (most recent call last):
  File "C:/git/ml-notebooks/PythonToHive.py", line 3, in <module>
    conn = hive.connect(host="redacted.azurehdinsight.net",port=443,auth="NOSASL")
  File "C:\Users\chris\Anaconda3\lib\site-packages\pyhive\hive.py", line 64, in connect
    return Connection(*args, **kwargs)
  File "C:\Users\chris\Anaconda3\lib\site-packages\pyhive\hive.py", line 164, in __init__
    response = self._client.OpenSession(open_session_req)
  File "C:\Users\chris\Anaconda3\lib\site-packages\TCLIService\TCLIService.py", line 187, in OpenSession
    return self.recv_OpenSession()
  File "C:\Users\chris\Anaconda3\lib\site-packages\TCLIService\TCLIService.py", line 199, in recv_OpenSession
    (fname, mtype, rseqid) = iprot.readMessageBegin()
  File "C:\Users\chris\Anaconda3\lib\site-packages\thrift\protocol\TBinaryProtocol.py", line 134, in readMessageBegin
    sz = self.readI32()
  File "C:\Users\chris\Anaconda3\lib\site-packages\thrift\protocol\TBinaryProtocol.py", line 217, in readI32
    buff = self.trans.readAll(4)
  File "C:\Users\chris\Anaconda3\lib\site-packages\thrift\transport\TTransport.py", line 60, in readAll
    chunk = self.read(sz - have)
  File "C:\Users\chris\Anaconda3\lib\site-packages\thrift\transport\TTransport.py", line 161, in read
    self.__rbuf = BufferIO(self.__trans.read(max(sz, self.__rbuf_size)))
  File "C:\Users\chris\Anaconda3\lib\site-packages\thrift\transport\TSocket.py", line 117, in read
    buff = self.handle.recv(sz)
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

回答1:

According to the offical document Understand and resolve errors received from WebHCat on HDInsight, it said as below.

What is WebHCat

WebHCat is a REST API for HCatalog, a table, and storage management layer for Hadoop. WebHCat is enabled by default on HDInsight clusters, and is used by various tools to submit jobs, get job status, etc. without logging in to the cluster.

So a workaround way is to use WebHCat to run the Hive QL in Python, please refer to the Hive document to learn & use it. As reference, there is a similar MSDN thread discussed about it.

Hope it helps.

回答2:

Technically you should be able to use the Thrift connector and pyhive but I haven't had any success with this. However I have successfully used the JDBC connector using JayDeBeAPI.

First you need to download the JDBC driver.

http://central.maven.org/maven2/org/apache/hive/hive-jdbc/1.2.1/hive-jdbc-1.2.1-standalone.jar
http://repo1.maven.org/maven2/org/apache/httpcomponents/httpclient/4.4/httpclient-4.4.jar
http://central.maven.org/maven2/org/apache/httpcomponents/httpcore/4.4.4/httpcore-4.4.4.jar

I put mine in /jdbc and used JayDeBeAPI with the following connection string.

edit: You need to add /jdbc/* to your CLASSPATH environment variable.

import jaydebeapi
conn = jaydebeapi.connect("org.apache.hive.jdbc.HiveDriver",
       "jdbc:hive2://my_ip_or_url:443/;ssl=true;transportMode=http;httpPath=/hive2", 
       [username, password],
       "/jdbc/hive-jdbc-1.2.1.jar")

来源：https://stackoverflow.com/questions/46369427/accessing-hdinsight-hive-with-python

标签

python

azure

Hadoop