get datatype of column using pyspark

前端未结

关注

 6  427

We are reading data from MongoDB Collection. Collection column has two different values (e.g.: (bson.Int64,int) (int,float) ).

I a

相关标签:

6条回答

遇见更好的自我

2021-01-31 16:23

Looks like your actual data and your metadata have different types. The actual data is of type string while the metadata is double.

As a solution I would recommend you to recreate the table with the correct datatypes.

0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2021-01-31 16:26
I don't know how are you reading from mongodb, but if you are using the mongodb connector, the datatypes will be automatically converted to spark types. To get the spark sql types, just use schema atribute like this:
```
df.schema
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
不知归路

2021-01-31 16:28
Your question is broad, thus my answer will also be broad.

To get the data types of your DataFrame columns, you can use dtypes i.e :
```
>>> df.dtypes
[('age', 'int'), ('name', 'string')]
```
This means your column age is of type int and name is of type string.
0 讨论(0)
发布评论:

提交评论
- 加载中...
清歌不尽

2021-01-31 16:33
I am assuming you are looking to get the data type of the data you read.

input_data = [Read from Mongo DB operation]

You can use
```
type(input_data) 
```
to inspect the data type
0 讨论(0)
发布评论:

提交评论
- 加载中...
忘掉有多难

2021-01-31 16:40
For anyone else who came here looking for an answer to the exact question in the post title (i.e. the data type of a single column, not multiple columns), I have been unable to find a simple way to do so.

Luckily it's trivial to get the type using dtypes:
```
def get_dtype(df,colname):
    return [dtype for name, dtype in df.dtypes if name == colname][0]

get_dtype(my_df,'column_name')
```
(note that this will only return the first column's type if there are multiple columns with the same name)
0 讨论(0)
发布评论:

提交评论
- 加载中...

隐瞒了意图╮

2021-01-31 16:42

import pandas as pd
pd.set_option('max_colwidth', -1) # to prevent truncating of columns in jupyter

def count_column_types(spark_df):
    """Count number of columns per type"""
    return pd.DataFrame(spark_df.dtypes).groupby(1, as_index=False)[0].agg({'count':'count', 'names': lambda x: " | ".join(set(x))}).rename(columns={1:"type"})

Example output in jupyter notebook for a spark dataframe with 4 columns:

count_column_types(my_spark_df)

0 讨论(0)