Read JSON file as Pyspark Dataframe using PySpark?

前端 未结 2 1832
天涯浪人
天涯浪人 2020-12-19 18:44

How can I read the following JSON structure to spark dataframe using PySpark?

My JSON structure

{\"results\":[{\"a\":1,\"b\":2,\"c\":\"name\"},{\"a\"         


        
相关标签:
2条回答
  • 2020-12-19 19:19

    Json string variables

    If you have json strings as variables then you can do

    simple_json = '{"results":[{"a":1,"b":2,"c":"name"},{"a":2,"b":5,"c":"foo"}]}'
    rddjson = sc.parallelize([simple_json])
    df = sqlContext.read.json(rddjson)
    
    from pyspark.sql import functions as F
    df.select(F.explode(df.results).alias('results')).select('results.*').show(truncate=False)
    

    which will give you

    +---+---+----+
    |a  |b  |c   |
    +---+---+----+
    |1  |2  |name|
    |2  |5  |foo |
    +---+---+----+
    

    Json strings as separate lines in a file (sparkContext and sqlContext)

    If you have json strings as separate lines in a file then you can read it using sparkContext into rdd[string] as above and the rest of the process is same as above

    rddjson = sc.textFile('/home/anahcolus/IdeaProjects/pythonSpark/test.csv')
    df = sqlContext.read.json(rddjson)
    df.select(F.explode(df['results']).alias('results')).select('results.*').show(truncate=False)
    

    Json strings as separate lines in a file (sqlContext only)

    If you have json strings as separate lines in a file then you can just use sqlContext only. But the process is complex as you have to create schema for it

    df = sqlContext.read.text('path to the file')
    
    from pyspark.sql import functions as F
    from pyspark.sql import types as T
    df = df.select(F.from_json(df.value, T.StructType([T.StructField('results', T.ArrayType(T.StructType([T.StructField('a', T.IntegerType()), T.StructField('b', T.IntegerType()), T.StructField('c', T.StringType())])))])).alias('results'))
    df.select(F.explode(df['results.results']).alias('results')).select('results.*').show(truncate=False)
    

    which should give you same as above result

    I hope the answer is helpful

    0 讨论(0)
  • 2020-12-19 19:27
    !pip install findspark
    !pip install pyspark
    import findspark
    import pyspark
    findspark.init()
    sc = pyspark.SparkContext.getOrCreate()
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName('abc').getOrCreate()
    

    Let's Generate our own JSON data This way we don't have to access the file system yet.

    stringJSONRDD = sc.parallelize((""" 
      { "id": "123",
        "name": "Katie",
        "age": 19,
        "eyeColor": "brown"
      }""",
       """{
        "id": "234",
        "name": "Michael",
        "age": 22,
        "eyeColor": "green"
      }""", 
      """{
        "id": "345",
        "name": "Simone",
        "age": 23,
        "eyeColor": "blue"
      }""")
    )
    

    Then Create DataFrame

    swimmersJSON = spark.read.json(stringJSONRDD)
    

    Create temporary table

    swimmersJSON.createOrReplaceTempView("swimmersJSON")
    

    Hope this helps you. For complete code you can refer to this GitHub repository.

    0 讨论(0)
提交回复
热议问题