Reading csv files in zeppelin using spark-csv

匿名 (未验证) 提交于 2019-12-03 03:05:02

问题:

I wanna read csv files in Zeppelin and would like to use databricks' spark-csv package: https://github.com/databricks/spark-csv

In the spark-shell, I can use spark-csv with

spark-shell --packages com.databricks:spark-csv_2.11:1.2.0 

But how do I tell Zeppelin to use that package?

Thanks in advance!

回答1:

You need to add the Spark Packages repository to Zeppelin before you can use %dep on spark packages.

%dep z.reset() z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven") z.load("com.databricks:spark-csv_2.10:1.2.0") 

Alternatively, if this is something you want available in all your notebooks, you can add the --packages option to the spark-submit command setting in the interpreters config in Zeppelin, and then restart the interpreter. This should start a context with the package already loaded as per the spark-shell method.



回答2:

  1. Go to the Interpreter tab, click Repository Information, add a repo and set the URL to http://dl.bintray.com/spark-packages/maven
  2. Scroll down to the spark interpreter paragraph and click edit, scroll down a bit to the artifact field and add "com.databricks:spark-csv_2.10:1.2.0" or a newer version. Then restart the interpreter when asked.
  3. In the notebook, use something like:

    import org.apache.spark.sql.SQLContext  val sqlContext = new SQLContext(sc) val df = sqlContext.read     .format("com.databricks.spark.csv")     .option("header", "true") // Use first line of all files as header     .option("inferSchema", "true") // Automatically infer data types     .load("my_data.txt") 

Update:

In the Zeppelin user mailing list, it is now (Nov. 2016) stated by Moon Soo Lee (creator of Apache Zeppelin) that users prefer to keep %dep as it allows for:

  • self-documenting library requirements in the notebook;
  • per Note (and possible per User) library loading.

The tendency is now to keep %dep, so it should not be considered depreciated at this time.



回答3:

BEGIN-EDIT

%dep is deprecated in Zeppelin 0.6.0. Please refer Paul-Armand Verhaegen's answer.

Please read further in this answer, if you are using zeppelin older than 0.6.0

END-EDIT

You can load the spark-csv package using %dep interpreter.

like,

%dep z.reset()  // Add spark-csv package z.load("com.databricks:spark-csv_2.10:1.2.0") 

See Dependency Loading section in https://zeppelin.incubator.apache.org/docs/interpreter/spark.html

If you've already initialized Spark Context, quick solution is to restart zeppelin and execute zeppelin paragraph with above code first and then execute your spark code to read the CSV file



回答4:

You can add jar files under Spark Interpreter dependencies:

  1. Click 'Interpreter' menu in navigation bar.
  2. Click 'edit' button for Spark interpreter.
  3. Fill artifact and exclude fields.
  4. Press 'Save'


回答5:

if you define in conf/zeppelin-env.sh

export SPARK_HOME=<PATH_TO_SPARK_DIST> 

Zeppelin will then look in $SPARK_HOME/conf/spark-defaults.conf and you can define jars there:

spark.jars.packages                com.databricks:spark-csv_2.10:1.4.0,org.postgresql:postgresql:9.3-1102-jdbc41 

then look at

http://zepplin_url:4040/environment/ for the following:

spark.jars file:/root/.ivy2/jars/com.databricks_spark-csv_2.10-1.4.0.jar,file:/root/.ivy2/jars/org.postgresql_postgresql-9.3-1102-jdbc41.jar

spark.jars.packages com.databricks:spark-csv_2.10:1.4.0,org.postgresql:postgresql:9.3-1102-jdbc41

For more reference: https://zeppelin.incubator.apache.org/docs/0.5.6-incubating/interpreter/spark.html



回答6:

Another solution:

In conf/zeppelin-env.sh (located in /etc/zeppelin for me) add the line:

export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.10:1.2.0" 

Then start the service.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!