Spark Structured Streaming using sockets, set SCHEMA, Display DATAFRAME in console

后端 未结 1 1634
情歌与酒
情歌与酒 2020-12-07 00:10

How can I set a schema for a streaming DataFrame in PySpark.

from pyspark.sql import SparkSession
from p         


        
相关标签:
1条回答
  • 2020-12-07 00:28

    TextSocketSource doesn't provide any integrated parsing options. It is only possible to use one of the two formats:

    • timestamp and text if includeTimestamp is set to true with the following schema:

      StructType([
          StructField("value", StringType()),
          StructField("timestamp", TimestampType())
      ])
      
    • text only if includeTimestamp is set to false with the schema as shown below:

      StructType([StructField("value", StringType())]))
      

    If you want to change this format you'll have to transform the stream to extract fields of interest, for example with regular expressions:

    from pyspark.sql.functions import regexp_extract
    from functools import partial
    
    fields = partial(
        regexp_extract, str="value", pattern="^(\w*)\s*,\s*(\w*)\s*,\s*([0-9]*)$"
    )
    
    lines.select(
        fields(idx=1).alias("name"),
        fields(idx=2).alias("last_name"), 
        fields(idx=3).alias("phone_number")
    )
    
    0 讨论(0)
提交回复
热议问题