elasticsearch-hadoop | 易学教程

what does load() do in spark?

阅读更多关于 what does load() do in spark?

问题 spark is lazy right? so what does load() do? start = timeit.default_timer() df = sqlContext.read.option( "es.resource", indexes ).format("org.elasticsearch.spark.sql") end = timeit.default_timer() print('without load: ', end - start) # almost instant start = timeit.default_timer() df = df.load() end = timeit.default_timer() print('load: ', end - start) # takes 1sec start = timeit.default_timer() df.show() end = timeit.default_timer() print('show: ', end - start) # takes 4 sec If show() is the

Python spark Dataframe to Elasticsearch

阅读更多关于 Python spark Dataframe to Elasticsearch

问题 I can't figure out how to write a dataframe to elasticsearch using python from spark. I followed the steps from here. Here is my code: # Read file df = sqlContext.read \ .format('com.databricks.spark.csv') \ .options(header='true') \ .load('/vagrant/data/input/input.csv', schema = customSchema) df.registerTempTable("data") # KPIs kpi1 = sqlContext.sql("SELECT * FROM data") es_conf = {"es.nodes" : "10.10.10.10","es.port" : "9200","es.resource" : "kpi"} kpi1.rdd.saveAsNewAPIHadoopFile( path='-'

Save Spark Dataframe into Elasticsearch - Can’t handle type exception

阅读更多关于 Save Spark Dataframe into Elasticsearch - Can’t handle type exception

问题 I have designed a simple job to read data from MySQL and save it in Elasticsearch with Spark. Here is the code: JavaSparkContext sc = new JavaSparkContext( new SparkConf().setAppName("MySQLtoEs") .set("es.index.auto.create", "true") .set("es.nodes", "127.0.0.1:9200") .set("es.mapping.id", "id") .set("spark.serializer", KryoSerializer.class.getName())); SQLContext sqlContext = new SQLContext(sc); // Data source options Map<String, String> options = new HashMap<>(); options.put("driver", MYSQL

Pypsark - Retain null values when using collect_list

阅读更多关于 Pypsark - Retain null values when using collect_list

问题 According to the accepted answer in pyspark collect_set or collect_list with groupby, when you do a collect_list on a certain column, the null values in this column are removed. I have checked and this is true. But in my case, I need to keep the null columns -- How can I achieve this? I did not find any info on this kind of a variant of collect_list function. Background context to explain why I want nulls: I have a dataframe df as below: cId | eId | amount | city 1 | 2 | 20.0 | Paris 1 | 2 |

How to read a few columns of elasticsearch by spark?

阅读更多关于 How to read a few columns of elasticsearch by spark?

问题 In the es cluster, it has a large scale data, we used spark to compute data but in the way of elasticsearch-hadoop , followed by https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html We have to read full columns of an index. Is there anything that help? 回答1: Yes, you can set config parameter "es.read.field.include" or "es.read.field.exclude" respectively. Full details here. Example assuming Spark 2 or higher. val sparkSession:SparkSession = SparkSession .builder() .appName(

Elasticsearch + Spark: write json with custom document _id

阅读更多关于 Elasticsearch + Spark: write json with custom document _id

问题 I am trying to write a collection of objects in Elasticsearch from Spark. I have to meet two requirements: Document is already serialized in JSON and should be written as is Elasticsearch document _id should be provided Here's what I tried so far. saveJsonToEs() I tried to use saveJsonToEs() like this (the serialized document contains field _id with desired Elasticsearch ID): val rdd: RDD[String] = job.map{ r => r.toJson() } val cfg = Map( ("es.resource", "myindex/mytype"), ("es.mapping.id",

java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror

阅读更多关于 java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror

问题 java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror; at org.elasticsearch.spark.serialization.ReflectionUtils$.org$elasticsearch$spark$serialization$ReflectionUtils$$checkCaseClass(ReflectionUtils.scala:42) at org.elasticsearch.spark.serialization.ReflectionUtils$$anonfun$checkCaseClassCache$1.apply(ReflectionUtils.scala:84) it is seems scala version uncompatible,but i see the document of spark ,spark 2.10

Python spark Dataframe to Elasticsearch

阅读更多关于 Python spark Dataframe to Elasticsearch

I can't figure out how to write a dataframe to elasticsearch using python from spark. I followed the steps from here . Here is my code: # Read file df = sqlContext.read \ .format('com.databricks.spark.csv') \ .options(header='true') \ .load('/vagrant/data/input/input.csv', schema = customSchema) df.registerTempTable("data") # KPIs kpi1 = sqlContext.sql("SELECT * FROM data") es_conf = {"es.nodes" : "10.10.10.10","es.port" : "9200","es.resource" : "kpi"} kpi1.rdd.saveAsNewAPIHadoopFile( path='-', outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat", keyClass="org.apache.hadoop.io

ElasticSearch to Spark RDD

阅读更多关于 ElasticSearch to Spark RDD

问题 I was testing ElasticSearch and Spark integration on my local machine, using some test data loaded in elasticsearch. val sparkConf = new SparkConf().setAppName("Test").setMaster("local") val sc = new SparkContext(sparkConf) val conf = new JobConf() conf.set("spark.serializer", classOf[KryoSerializer].getName) conf.set("es.nodes", "localhost:9200") conf.set("es.resource", "bank/account") conf.set("es.query", "?q=firstname:Daniel") val esRDD = sc.hadoopRDD(conf,classOf[EsInputFormat[Text,

ElasticSearch to Spark RDD

阅读更多关于 ElasticSearch to Spark RDD

I was testing ElasticSearch and Spark integration on my local machine, using some test data loaded in elasticsearch. val sparkConf = new SparkConf().setAppName("Test").setMaster("local") val sc = new SparkContext(sparkConf) val conf = new JobConf() conf.set("spark.serializer", classOf[KryoSerializer].getName) conf.set("es.nodes", "localhost:9200") conf.set("es.resource", "bank/account") conf.set("es.query", "?q=firstname:Daniel") val esRDD = sc.hadoopRDD(conf,classOf[EsInputFormat[Text, MapWritable]], classOf[Text], classOf[MapWritable]) esRDD.first() esRDD.collect() The code runs fine and