Pyspark sql count returns different number of rows than pure sql

坚强是说给别人听的谎言 提交于 2019-12-11 16:24:13

问题


I've started using pyspark in one of my projects. I was testing different commands to explore functionalities of the library and I found something that I don't understand.

Take this code:

from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.dataframe import Dataframe

sc = SparkContext(sc)
hc = HiveContext(sc)

hc.sql("use test_schema")
hc.table("diamonds").count()

the last count() operation returns 53941 records. If I run instead a select count(*) from diamonds in Hive I got 53940.

Is that pyspark count including the header?

I've tried to look into:

df = hc.sql("select * from diamonds").collect()
df[0]
df[1]

to see if header was included:

df[0] --> Row(carat=None, cut='cut', color='color', clarity='clarity', depth=None, table=None, price=None, x=None, y=None, z=None)
df[1] -- > Row(carat=0.23, cut='Ideal', color='E', clarity='SI2', depth=61.5, table=55, price=326, x=3.95, y=3.98, z=2.43)

The 0th element doesn't look like the header.

Anyone has an explanation for this?

Thanks! Ale


回答1:


Hive can give incorrect counts when stale statistics are used to speed up calculations. To see if this is the problem, in Hive try:

SET hive.compute.query.using.stats=false;
SELECT COUNT(*) FROM diamonds;

Alternatively, refresh the statistics. If your table is not partitioned:

ANALYZE TABLE diamonds COMPUTE STATISTICS;
SELECT COUNT(*) FROM diamonds;

If it is partitioned:

ANALYZE TABLE diamonds PARTITION(partition_column) COMPUTE STATISTICS;

Also take another look at your first row (df[0] in your question). It does look like an improperly formatted header row.



来源:https://stackoverflow.com/questions/48639592/pyspark-sql-count-returns-different-number-of-rows-than-pure-sql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!