apache-spark | 易学教程

Getting “org.apache.spark.sql.AnalysisException: Path does not exist” from SparkSession.read() [duplicate]

阅读更多关于 Getting “org.apache.spark.sql.AnalysisException: Path does not exist” from SparkSession.read() [duplicate]

问题 This question already has an answer here : How to get path to the uploaded file (1 answer) Closed 2 years ago . I am trying to read a file submitted by spark-submit to yarn cluster in client mode. Putting file in HDFS is not an option. Here's what I've done: def main(args: Array[String]) { if (args != null && args.length > 0) { val inputfile: String = args(0) //get filename: train.csv val input_filename = inputfile.split("/").toList.last val d = SparkSession.read .option("header", "true")

Spark Dataframe: Select distinct rows

阅读更多关于 Spark Dataframe: Select distinct rows

问题 I tried two ways to find distinct rows from parquet but it doesn't seem to work. Attemp 1: Dataset<Row> df = sqlContext.read().parquet("location.parquet").distinct(); But throws Cannot have map type columns in DataFrame which calls set operations (intersect, except, etc.), but the type of column canvasHashes is map<string,string>;; Attemp 2: Tried running sql queries: Dataset<Row> df = sqlContext.read().parquet("location.parquet"); rawLandingDS.createOrReplaceTempView("df"); Dataset<Row>

Spark Dataframe: Select distinct rows

阅读更多关于 Spark Dataframe: Select distinct rows

Pyspark RDD collect first 163 Rows

阅读更多关于 Pyspark RDD collect first 163 Rows

问题 Is there a way to get the first 163 rows of an rdd without converting to a df? I've tried something like newrdd = rdd.take(163) , but that returns a list, and rdd.collect() returns the whole rdd. Is there a way to do this? Or if not is there a way to convert a list into an rdd? 回答1: It is not very efficient but you can zipWithIndex and filter : rdd.zipWithIndex().filter(lambda vi: vi[1] < 163).keys() In practice it make more sense to simply take and parallelize : sc.parallelize(rdd.take(163))

Pyspark RDD collect first 163 Rows

阅读更多关于 Pyspark RDD collect first 163 Rows

Spark BlockManager running on localhost

阅读更多关于 Spark BlockManager running on localhost

问题 I have a simple script file I am trying to execute in the spark-shell that mimics the tutorial here import org.apache.spark.SparkConf import org.apache.spark.SparkContext sc.stop(); val conf = new SparkConf().setAppName("MyApp").setMaster("mesos://zk://172.24.51.171:2181/mesos").set("spark.executor.uri", "hdfs://172.24.51.171:8020/spark-1.3.0-bin-hadoop2.4.tgz").set("spark.driver.host", "172.24.51.142") val sc2 = new SparkContext(conf) val file = sc2.textFile("hdfs://172.24.51.171:8020/input

repartition() is not affecting RDD partition size

阅读更多关于 repartition() is not affecting RDD partition size

问题 I am trying to change partition size of an RDD using repartition() method. The method call on the RDD succeeds, but when I explicitly check the partition size using partition.size property of the RDD, I get back the same number of partitions that it originally had:- scala> rdd.partitions.size res56: Int = 50 scala> rdd.repartition(10) res57: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] at repartition at <console>:27 At this stage I perform some action like rdd.take(1) just to force

How to know when to repartition/coalesce RDD with unbalanced partitions (without shuffling possibly)?

阅读更多关于 How to know when to repartition/coalesce RDD with unbalanced partitions (without shuffling possibly)?

问题 I'm loading tens of thousands of gzipped files from s3 for my spark job. This results in some partitions being very small (10s of records) and some very large (10000s of records). The sizes of the partitions are pretty well distributed among nodes so each executor seems to be working on the same amount of data in aggregate. So I'm not really sure if I even have a problem. How would I know if it's worth repartitioning or coalescing the RDD? Will either of these be able to balance the

Spark compression when writing to external Hive table

阅读更多关于 Spark compression when writing to external Hive table

问题 I'm inserting into an external hive-parquet table from Spark 2.1 (using df.write.insertInto(...) . By setting e.g. spark.sql("SET spark.sql.parquet.compression.codec=GZIP") I can switch between SNAPPY,GZIP and uncompressed. I can verify that the file size (and filename ending) is influenced by these settings. I get a file named e.g. part-00000-5efbfc08-66fe-4fd1-bebb-944b34689e70.gz.parquet However if I work with partitioned Hive table , this setting does not have any effect, the file size is

Implement SCD Type 2 in Spark

阅读更多关于 Implement SCD Type 2 in Spark

问题 Trying to implement SCD Type 2 logic in Spark 2.4.4. I've two Data Frames; one containing 'Existing Data' and the other containing 'New Incoming Data'. Input and expected output are given below. What needs to happen is: All incoming rows should get appended to the existing data. Only following 3 rows which were previously 'active' should become inactive with appropriate 'endDate' populated as follows: pk=1, amount = 20 => Row should become 'inactive' & 'endDate' is the 'startDate' of