How can I set a schema for a streaming DataFrame
in PySpark.
from pyspark.sql import SparkSession
from p
TextSocketSource
doesn't provide any integrated parsing options. It is only possible to use one of the two formats:
timestamp and text if includeTimestamp
is set to true
with the following schema:
StructType([
StructField("value", StringType()),
StructField("timestamp", TimestampType())
])
text only if includeTimestamp
is set to false
with the schema as shown below:
StructType([StructField("value", StringType())]))
If you want to change this format you'll have to transform the stream to extract fields of interest, for example with regular expressions:
from pyspark.sql.functions import regexp_extract
from functools import partial
fields = partial(
regexp_extract, str="value", pattern="^(\w*)\s*,\s*(\w*)\s*,\s*([0-9]*)$"
)
lines.select(
fields(idx=1).alias("name"),
fields(idx=2).alias("last_name"),
fields(idx=3).alias("phone_number")
)