Pandas dataframe in pyspark to hive

后端 未结 3 1494
时光说笑
时光说笑 2021-01-04 21:13

How to send a pandas dataframe to a hive table?

I know if I have a spark dataframe, I can register it to a temporary table using

df.registerTempTabl         


        
相关标签:
3条回答
  • 2021-01-04 21:36

    I guess you are trying to use pandas df instead of Spark's DF.

    Pandas DataFrame has no such method as registerTempTable.

    you may try to create Spark DF from pandas DF.

    UPDATE:

    I've tested it under Cloudera (with installed Anaconda parcel, which includes Pandas module).

    Make sure that you have set PYSPARK_PYTHON to your anaconda python installation (or another one containing Pandas module) on all your Spark workers (usually in: spark-conf/spark-env.sh)

    Here is result of my test:

    >>> import pandas as pd
    >>> import numpy as np
    >>> df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('ABC'))
    >>> sdf = sqlContext.createDataFrame(df)
    >>> sdf.show()
    +---+---+---+
    |  A|  B|  C|
    +---+---+---+
    | 98| 33| 75|
    | 91| 57| 80|
    | 20| 87| 85|
    | 20| 61| 37|
    | 96| 64| 60|
    | 79| 45| 82|
    | 82| 16| 22|
    | 77| 34| 65|
    | 74| 18| 17|
    | 71| 57| 60|
    +---+---+---+
    
    >>> sdf.printSchema()
    root
     |-- A: long (nullable = true)
     |-- B: long (nullable = true)
     |-- C: long (nullable = true)
    
    0 讨论(0)
  • 2021-01-04 21:38

    first u need to convert pandas dataframe to spark dataframe:

    from pyspark.sql import HiveContext
    hive_context = HiveContext(sc)
    df = hive_context.createDataFrame(pd_df)
    

    then u can create a temptable which is in memory:

    df.registerTempTable('tmp')
    

    now,u can use hive ql to save data into hive:

    hive_context.sql("""insert overwrite table target partition(p='p') select a,b from tmp'''
    

    note than:the hive_context must be keep to the same one!

    0 讨论(0)
  • 2021-01-04 21:53

    I converted my pandas df to a temp table by

    1) Converting the pandas dataframe to spark dataframe:

    spark_df=sqlContext.createDataFrame(Pandas_df)
    

    2) Make sure that the data is migrated properly

    spark_df.select("*").show()
    

    3) Convert the spark dataframe to a temp table for querying.

    spark_df.registerTempTable("table_name").
    

    Cheers..

    0 讨论(0)
提交回复
热议问题