spark-dataframe

Creating a Pyspark Schema involving an ArrayType

两盒软妹~` 提交于 2020-02-19 10:33:12
问题 I'm trying to create a schema for my new DataFrame and have tried various combinations of brackets and keywords but have been unable to figure out how to make this work. My current attempt: from pyspark.sql.types import * schema = StructType([ StructField("User", IntegerType()), ArrayType(StructType([ StructField("user", StringType()), StructField("product", StringType()), StructField("rating", DoubleType())])) ]) Comes back with the error: elementType should be DataType Traceback (most

Dataframes reading json files with changing schema

℡╲_俬逩灬. 提交于 2020-01-25 06:24:36
问题 I am currently reading json files which have variable schema in each file. We are using the following logic to read json - first we read the base schema which has all fields and then read the actual data. We are using this approach because the schema is understood based on the first file read, but we are not getting all the fields in the first file it self. So just tricking the code to understand the schema first and then start reading the actual data. rdd=sc.textFile(baseSchemaWithAllColumns

Dataframes reading json files with changing schema

半腔热情 提交于 2020-01-25 06:24:00
问题 I am currently reading json files which have variable schema in each file. We are using the following logic to read json - first we read the base schema which has all fields and then read the actual data. We are using this approach because the schema is understood based on the first file read, but we are not getting all the fields in the first file it self. So just tricking the code to understand the schema first and then start reading the actual data. rdd=sc.textFile(baseSchemaWithAllColumns

Apache Spark update a row in an RDD or Dataset based on another row

寵の児 提交于 2020-01-24 21:06:57
问题 I'm trying to figure how I can update some rows based on another another row. For example, I have some data like Id | useraname | ratings | city -------------------------------- 1, philip, 2.0, montreal, ... 2, john, 4.0, montreal, ... 3, charles, 2.0, texas, ... I want to update the users in the same city to the same groupId (either 1 or 2) Id | useraname | ratings | city -------------------------------- 1, philip, 2.0, montreal, ... 1, john, 4.0, montreal, ... 3, charles, 2.0, texas, ...

Apache Spark update a row in an RDD or Dataset based on another row

六眼飞鱼酱① 提交于 2020-01-24 21:06:26
问题 I'm trying to figure how I can update some rows based on another another row. For example, I have some data like Id | useraname | ratings | city -------------------------------- 1, philip, 2.0, montreal, ... 2, john, 4.0, montreal, ... 3, charles, 2.0, texas, ... I want to update the users in the same city to the same groupId (either 1 or 2) Id | useraname | ratings | city -------------------------------- 1, philip, 2.0, montreal, ... 1, john, 4.0, montreal, ... 3, charles, 2.0, texas, ...

Apache Spark update a row in an RDD or Dataset based on another row

冷暖自知 提交于 2020-01-24 21:06:21
问题 I'm trying to figure how I can update some rows based on another another row. For example, I have some data like Id | useraname | ratings | city -------------------------------- 1, philip, 2.0, montreal, ... 2, john, 4.0, montreal, ... 3, charles, 2.0, texas, ... I want to update the users in the same city to the same groupId (either 1 or 2) Id | useraname | ratings | city -------------------------------- 1, philip, 2.0, montreal, ... 1, john, 4.0, montreal, ... 3, charles, 2.0, texas, ...

How to calculate TF-IDF on grouped spark dataframe in scala?

拈花ヽ惹草 提交于 2020-01-24 17:32:09
问题 I have used Spark Api (https://spark.apache.org/docs/latest/ml-features.html#tf-idf) for calculating TF IDF on a dataframe. What I am unable to do is to do it on grouped data using Dataframe groupBy and calculating TFIDF for each group and in the result getting single dataframe. For Example for input id | category | texts 0 | smallLetters | Array("a", "b", "c") 1 | smallLetters | Array("a", "b", "b", "c", "a") 2 | capitalLetters | Array("A", "B", "C") 3 | capitalLetters | Array("A", "B", "B",

Can't access Spark 2.0 Temporary Table from beeline

此生再无相见时 提交于 2020-01-24 17:18:44
问题 With Spark 1.5.1, I've already been able to access spark-shell temporary tables from Beeline using Thrift Server. I've been able to do so by reading answers to related questions on Stackoverflow. However, after upgrading to Spark 2.0, I can't see temporary tables from Beeline anymore, here are the steps I'm following. I'm launching spark-shell using the following command: ./bin/spark-shell --master=myHost.local:7077 —conf spark.sql.hive.thriftServer.singleSession=true Once the spark shell is

Write data to Redshift using Spark 2.0.1

冷暖自知 提交于 2020-01-24 15:45:09
问题 I am doing a POC, where I want to write some simple data set to Redshift. I have following sbt file: name := "Spark_POC" version := "1.0" scalaVersion := "2.10.6" libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "2.0.1" libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "2.0.1" resolvers += "jitpack" at "https://jitpack.io" libraryDependencies += "com.databricks" %% "spark-redshift" % "3.0.0-preview1" and following code: object Main extends App{ val conf = new

Spark get top N highest score results for each (item1, item2, score)

一世执手 提交于 2020-01-24 00:34:04
问题 I have a DataFrame of the following format: item_id1: Long, item_id2: Long, similarity_score: Double What I'm trying to do is to get top N highest similarity_score records for each item_id1. So, for example: 1 2 0.5 1 3 0.4 1 4 0.3 2 1 0.5 2 3 0.4 2 4 0.3 With top 2 similar items would give: 1 2 0.5 1 3 0.4 2 1 0.5 2 3 0.4 I vaguely guess that it can be done by first grouping records by item_id1, then sorting in reverse by score and then limiting the results. But I'm stuck with how to