问题
Below is a working code to connect to a SQL server,and save 1 table to a CSV format file.
conf = new SparkConf().setAppName("test").setMaster("local").set("spark.driver.allowMultipleContexts", "true");
sc = new SparkContext(conf)
sqlContext = new SQLContext(sc)
df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://DBServer:PORT").option("databaseName","xxx").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable","xxx").option("user","xxx").option("password","xxxx").load()
df.registerTempTable("test")
df.write.format("com.databricks.spark.csv").save("poc/amitesh/csv")
exit()
I ahve a scenario, where in I have to save 4 table from same database in CSV format in 4 different files at a time through pyspark code. Is there anyway we can achieve the objective? Or,these splits are done at the HDFS block size level, so if you have a file of 300mb, and the HDFS block size is set at 128, then you get 3 blocks of 128mb, 128mb and 44mb respectively?
回答1:
where in I have to save 4 table from same database in CSV format in 4 different files at a time through pyspark code.
You have to code a transformation (reading and writing) for every table in the database (using sqlContext.read.format
).
The only difference between the table-specific ETL pipeline is a different dbtable
option per table. Once you have a DataFrame, save it to its own CSV file.
The code could look as follows (in Scala so I leave converting it to Python as a home exercise):
val datasetFromTABLE_ONE: DataFrame = sqlContext.
read.
format("jdbc").
option("url","jdbc:sqlserver://DBServer:PORT").
option("databaseName","xxx").
option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").
option("dbtable","TABLE_ONE").
option("user","xxx").
option("password","xxxx").
load()
// save the dataset from TABLE_ONE into its own CSV file
datasetFromTABLE_ONE.write.csv("table_one.csv")
Repeat the same code for every table you want to save to CSV.
Done!
100-table Case — Fair Scheduling
The solution begs another:
What when I have 100 or more tables? How to optimize the code for that? How to do it effectively in Spark? Any parallelization?
SparkContext
that sits behind SparkSession
we use for the ETL pipeline is thread-safe which means that you can use it from multiple threads. If you think about a thread per table that's the right approach.
You could spawn as many threads as you have tables, say 100, and start them. Spark could then decide what and when to execute.
That's something Spark does using Fair Scheduler Pools. That's not very widely known feature of Spark that'd be worth considering for this case:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
Use it and your loading and saving pipelines may get faster.
来源:https://stackoverflow.com/questions/44178294/how-to-read-many-tables-from-the-same-database-and-save-them-to-their-own-csv-fi