spark, scala & jdbc - how to limit number of records

前端未结

关注

 3  1718

春和景丽

Is there a way to limit the number of records fetched from the jdbc source using spark sql 2.2.0?

I am dealing with a task of moving (and transforming) a large number of

相关标签:

3条回答

情话喂你

2021-01-27 05:00

To limit the number of downloaded rows, a SQL query can be used instead of the table name in "dbtable". Description in documentation.

In query "where" condition can be specified, for example, with server specific features to limit the number of rows (like "rownum" in Oracle).

0 讨论(0)
发布评论:

提交评论
- 加载中...
庸人自扰

2021-01-27 05:02
This approach is a little bit bad for relational databases. The load function of spark will request your full table, store in memory/disk and then will do the RDD transformations and executions.

If you want to do an exploratory work, I will suggest you to store this data in your first load. There a few ways to do that. Take your code and do like this:
```
val sourceData = spark
    .read
    .format("jdbc")
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
    .option("url", jdbcSqlConnStr)
    .option("dbtable", sourceTableName)
    .load()
sourceData.write
    .option("header", "true")
    .option("delimiter", ",")
    .format("csv")
    .save("your_path")
```
This will allow you to save your data in your local machine as CSV, the most common format that you can work with any language for exploration. Everytime that you want to load this, take this data from this file. If you want real time analysis, or any other thing like this. I will suggest you build a pipeline with the transformations of the data to update another storage. Using this approach to process your data of loading from your db every time is not good.
0 讨论(0)
发布评论:

提交评论
- 加载中...
鱼传尺愫

2021-01-27 05:11
I have not tested this, but you should try using limit instead of take. take calls head under the covers which has the following note:

this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.

whereas limit results in a LIMIT pushed into the sql query as it is a lazy evaluation:

The difference between this function and head is that head is an action and returns an array (by triggering query execution) while limit returns a new Dataset.

If you want the data without pulling it in first then you could even do something like:
```
...load.limit(limitNum).take(limitNum)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...