I have a text file on HDFS and I want to convert it to a Data Frame in Spark.
I am using the Spark Context to load the file and then try to generate individual columns f
You can read a file to have an RDD and then assign schema to it. Two common ways to creating schema are either using a case class or a Schema object [my preferred one]. Follows the quick snippets of code that you may use.
Case Class approach
case class Test(id:String,name:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()
Schema Approach
import org.apache.spark.sql.types._
val schemaString = "id name"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val dfWithSchema = sparkSess.read.option("header","false").schema(schema).csv("file.txt")
dfWithSchema.show()
The second one is my preferred approach since case class has a limitation of max 22 fields and this will be a problem if your file has more than 22 fields!