apache-spark

Getting “org.apache.spark.sql.AnalysisException: Path does not exist” from SparkSession.read() [duplicate]

人走茶凉 提交于 2021-02-18 17:44:26
问题 This question already has an answer here : How to get path to the uploaded file (1 answer) Closed 2 years ago . I am trying to read a file submitted by spark-submit to yarn cluster in client mode. Putting file in HDFS is not an option. Here's what I've done: def main(args: Array[String]) { if (args != null && args.length > 0) { val inputfile: String = args(0) //get filename: train.csv val input_filename = inputfile.split("/").toList.last val d = SparkSession.read .option("header", "true")

Spark Dataframe: Select distinct rows

狂风中的少年 提交于 2021-02-18 17:00:20
问题 I tried two ways to find distinct rows from parquet but it doesn't seem to work. Attemp 1: Dataset<Row> df = sqlContext.read().parquet("location.parquet").distinct(); But throws Cannot have map type columns in DataFrame which calls set operations (intersect, except, etc.), but the type of column canvasHashes is map<string,string>;; Attemp 2: Tried running sql queries: Dataset<Row> df = sqlContext.read().parquet("location.parquet"); rawLandingDS.createOrReplaceTempView("df"); Dataset<Row>

Spark Dataframe: Select distinct rows

拜拜、爱过 提交于 2021-02-18 17:00:14
问题 I tried two ways to find distinct rows from parquet but it doesn't seem to work. Attemp 1: Dataset<Row> df = sqlContext.read().parquet("location.parquet").distinct(); But throws Cannot have map type columns in DataFrame which calls set operations (intersect, except, etc.), but the type of column canvasHashes is map<string,string>;; Attemp 2: Tried running sql queries: Dataset<Row> df = sqlContext.read().parquet("location.parquet"); rawLandingDS.createOrReplaceTempView("df"); Dataset<Row>

Pyspark RDD collect first 163 Rows

不打扰是莪最后的温柔 提交于 2021-02-18 13:51:54
问题 Is there a way to get the first 163 rows of an rdd without converting to a df? I've tried something like newrdd = rdd.take(163) , but that returns a list, and rdd.collect() returns the whole rdd. Is there a way to do this? Or if not is there a way to convert a list into an rdd? 回答1: It is not very efficient but you can zipWithIndex and filter : rdd.zipWithIndex().filter(lambda vi: vi[1] < 163).keys() In practice it make more sense to simply take and parallelize : sc.parallelize(rdd.take(163))

Pyspark RDD collect first 163 Rows

妖精的绣舞 提交于 2021-02-18 13:51:11
问题 Is there a way to get the first 163 rows of an rdd without converting to a df? I've tried something like newrdd = rdd.take(163) , but that returns a list, and rdd.collect() returns the whole rdd. Is there a way to do this? Or if not is there a way to convert a list into an rdd? 回答1: It is not very efficient but you can zipWithIndex and filter : rdd.zipWithIndex().filter(lambda vi: vi[1] < 163).keys() In practice it make more sense to simply take and parallelize : sc.parallelize(rdd.take(163))

Spark BlockManager running on localhost

醉酒当歌 提交于 2021-02-18 13:37:13
问题 I have a simple script file I am trying to execute in the spark-shell that mimics the tutorial here import org.apache.spark.SparkConf import org.apache.spark.SparkContext sc.stop(); val conf = new SparkConf().setAppName("MyApp").setMaster("mesos://zk://172.24.51.171:2181/mesos").set("spark.executor.uri", "hdfs://172.24.51.171:8020/spark-1.3.0-bin-hadoop2.4.tgz").set("spark.driver.host", "172.24.51.142") val sc2 = new SparkContext(conf) val file = sc2.textFile("hdfs://172.24.51.171:8020/input

repartition() is not affecting RDD partition size

陌路散爱 提交于 2021-02-18 12:17:07
问题 I am trying to change partition size of an RDD using repartition() method. The method call on the RDD succeeds, but when I explicitly check the partition size using partition.size property of the RDD, I get back the same number of partitions that it originally had:- scala> rdd.partitions.size res56: Int = 50 scala> rdd.repartition(10) res57: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] at repartition at <console>:27 At this stage I perform some action like rdd.take(1) just to force

How to know when to repartition/coalesce RDD with unbalanced partitions (without shuffling possibly)?

余生颓废 提交于 2021-02-18 11:46:08
问题 I'm loading tens of thousands of gzipped files from s3 for my spark job. This results in some partitions being very small (10s of records) and some very large (10000s of records). The sizes of the partitions are pretty well distributed among nodes so each executor seems to be working on the same amount of data in aggregate. So I'm not really sure if I even have a problem. How would I know if it's worth repartitioning or coalescing the RDD? Will either of these be able to balance the

Spark compression when writing to external Hive table

情到浓时终转凉″ 提交于 2021-02-18 11:28:28
问题 I'm inserting into an external hive-parquet table from Spark 2.1 (using df.write.insertInto(...) . By setting e.g. spark.sql("SET spark.sql.parquet.compression.codec=GZIP") I can switch between SNAPPY,GZIP and uncompressed. I can verify that the file size (and filename ending) is influenced by these settings. I get a file named e.g. part-00000-5efbfc08-66fe-4fd1-bebb-944b34689e70.gz.parquet However if I work with partitioned Hive table , this setting does not have any effect, the file size is

Implement SCD Type 2 in Spark

自闭症网瘾萝莉.ら 提交于 2021-02-18 08:47:47
问题 Trying to implement SCD Type 2 logic in Spark 2.4.4. I've two Data Frames; one containing 'Existing Data' and the other containing 'New Incoming Data'. Input and expected output are given below. What needs to happen is: All incoming rows should get appended to the existing data. Only following 3 rows which were previously 'active' should become inactive with appropriate 'endDate' populated as follows: pk=1, amount = 20 => Row should become 'inactive' & 'endDate' is the 'startDate' of