Export large amount of data from Cassandra to CSV

后端 未结 3 779
陌清茗
陌清茗 2021-02-05 05:46

I\'m using Cassandra 2.0.9 for store quite big amounts of data, let\'s say 100Gb, in one column family. I would like to export this data to CSV in fast way. I tried:

    <
相关标签:
3条回答
  • 2021-02-05 05:58

    Update for 2020th: DataStax provides a special tool called DSBulk for loading and unloading of data from Cassandra (starting with Cassandra 2.1), and DSE (starting with DSE 4.7/4.8). In simplest case, the command line looks as following:

    dsbulk unload -k keyspace -t table -url path_to_unload
    

    DSBulk is heavily optimized for loading/unloading operations, and has a lot of options, including import/export from/to compressed files, providing the custom queries, etc.

    There is a series of blog posts about DSBulk, that could provide more information & examples: 1, 2, 3, 4, 5, 6

    0 讨论(0)
  • 2021-02-05 06:05

    Because using COPY will be quite challenging when you are trying to export a table with millions of rows from Cassandra, So what I have done is to create simple tool to get the data chunk by chunk (paginated) from cassandra table and export it to CSV.

    Look at my example solution using java library from datastax.

    0 讨论(0)
  • 2021-02-05 06:13

    Inspired by @user1859675 's answer, Here is how we can export data from Cassandra using Spark

    val cassandraHostNode = "10.xxx.xxx.x5,10.xxx.xxx.x6,10.xxx.xxx.x7";
    val spark = org.apache.spark.sql.SparkSession
                                        .builder
                                        .config("spark.cassandra.connection.host",  cassandraHostNode)
                                        .appName("Awesome Spark App")
                                        .master("local[*]")
                                        .getOrCreate()
    
    val dataSet = spark.read.format("org.apache.spark.sql.cassandra")
                            .options(Map("table" -> "xxxxxxx", "keyspace" -> "xxxxxxx"))
                            .load()
    
    val targetfilepath = "/opt/report_values/"
    dataSet.write.format("csv").save(targetfilepath)  // Spark 2.x
    

    You will need "spark-cassandra-connector" in your classpath for this to work.
    The version I am using is below

        <groupId>com.datastax.spark</groupId>
        <artifactId>spark-cassandra-connector_2.11</artifactId>
        <version>2.3.2</version>
    
    0 讨论(0)
提交回复
热议问题