Read Parquet files from Scala without using Spark

前端 未结 3 621
终归单人心
终归单人心 2020-12-10 11:21

Is it possible to read parquet files from Scala without using Apache Spark?

I found a project which allows us to read and write avro files using plain scala.

相关标签:
3条回答
  • 2020-12-10 11:27

    There is also a relatively new project called eel this is a lightweight (non distributed processing) toolkit for using some of the 'big data' technology in the small.

    0 讨论(0)
  • 2020-12-10 11:38

    Yes, you don't have to use Spark to read/write Parquet. Just use parquet lib directly from your Scala code (and that's what Spark is doing anyway): http://search.maven.org/#search%7Cga%7C1%7Cparquet

    0 讨论(0)
  • 2020-12-10 11:51

    It's straightforward enough to do using the parquet-mr project, which is the project Alexey Raga is referring to in his answer.

    Some sample code

    val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]]
    // iter is of type Iterator[GenericRecord]
    val iter = Iterator.continually(reader.read).takeWhile(_ != null)
    // if you want a list then...
    val list = iter.toList
    

    This will return you a standard Avro GenericRecords, but if you want to turn that into a scala case class, then you can use my Avro4s library as you linked to in your question, to do the marshalling for you. Assuming you are using version 1.30 or higher then:

    case class Bibble(name: String, location: String)
    val format = RecordFormat[Bibble]
    // then for a given record
    val bibble = format.from(record)
    

    We can obviously combine that with the original iterator in one step:

    val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]]
    val format = RecordFormat[Bibble]
    // iter is now an Iterator[Bibble]
    val iter = Iterator.continually(reader.read).takeWhile(_ != null).map(format.from)
    // and list is now a List[Bibble]
    val list = iter.toList
    
    0 讨论(0)
提交回复
热议问题