hdinsight

HDInsight query console Job History

安稳与你 提交于 2019-12-11 04:40:47
问题 I am new to Microsoft Azure. I created a trial account on Azure. Installed the azure powershell and submitted the default wordcount map reduce program and it works fine and am able to see the results in the powershell. Now when I open the query console of my cluster in the HDInsight tab, the job history is empty. What am I missing here? Where can I view the job results in the Azure? 回答1: The Query Console does not display M/R jobs, only hive jobs. You can see the history of all jobs by using

reading a csv file from azure blob storage with PySpark

你。 提交于 2019-12-10 17:56:31
问题 I'm trying to do a machine learning project using a PySpark HDInsight cluster on Microsoft Azure. To operate on my cluster a use a Jupyter notebook. Also, I have my data (a csv file), stored on the Azure Blob storage. According to the documentation the syntax of the path to my file is: path = 'wasb[s]://springboard@6zpbt6muaorgs.blob.core.windows.net/movies_plus_genre_info_2.csv' However, when i try to read the csv file with the following command: csvFile = spark.read.csv(path, header=True,

jupyter pyspark outputs: No module name sknn.mlp

℡╲_俬逩灬. 提交于 2019-12-10 11:55:30
问题 I have 1 WorkerNode SPARK HDInsight cluster. I need to use scikit-neuralnetwork and vaderSentiment module in Pyspark Jupyter. Installed the library using commands below: cd /usr/bin/anaconda/bin/ export PATH=/usr/bin/anaconda/bin:$PATH conda update matplotlib conda install Theano pip install scikit-neuralnetwork pip install vaderSentiment Next I open pyspark terminal and i am able to successfully import the module. Screenshot below. Now, i open Jupyter Pyspark Notebook: Just to add, I am able

Azure Storm vs Azure Stream Analytics

我是研究僧i 提交于 2019-12-08 18:15:59
问题 Looking to do real time metric calculations on event streams, what is a good choice in Azure? Stream Analytics or Storm? I am comfortable with either SQL or Java, so wondering what are the other differences. 回答1: It depends on your needs and requirements. I'll try to lay out the strengths and benefits of both. In terms of setup, Stream Analytics has Storm beat. Stream Analytics is great if you need to ask a lot of different questions often. Stream Analytics can also only handle CSV or JSON

Spark SQL slow execution with resource idle

一笑奈何 提交于 2019-12-08 02:29:09
问题 I have a Spark SQL that used to execute < 10 mins now running at 3 hours after a cluster migration and need to deep dive on what it's actually doing. I'm new to spark and please don't mind if I'm asking something unrelated. Increased spark.executor.memory but no luck. Env: Azure HDInsight Spark 2.4 on Azure Storage SQL: Read and Join some data and finally write result to a Hive metastore. The spark.sql script ends with below code: .write.mode("overwrite").saveAsTable("default.mikemiketable")

How to setup custom Spark parameter in HDInsights cluster with Data Factory

馋奶兔 提交于 2019-12-06 11:50:54
I am creating HDInsights cluster on Azure according to this desciption Now I would like to set up spark custom parameter, for example spark.yarn.appMasterEnv.PYSPARK3_PYTHON or spark_daemon_memory in time of cluster provisioning. Is it possible to setup using Data Factory/Automation Account? I can not find any example doing this. Thanks You can use SparkConfig in Data Factory to pass these configurations to Spark. For example: "typeProperties": { ... "sparkConfig": { "spark.submit.pyFiles": "/dist/package_name-1.0.0-py3.5.egg", "spark.yarn.appMasterEnv.PYSPARK_PYTHON": "/usr/bin/anaconda/envs

Creating hive partitions for multiple months using one script

你离开我真会死。 提交于 2019-12-06 10:53:15
问题 I have data for 4 years. Like '2011 2012 2013 2014' I have to run queries based on one month's data. So i am creating partitions as below. 'ALTER TABLE table1_2010Jan ADD PARTITION(year='2010', month='01', day='01') LOCATION 'path'; ALTER TABLE table1_2010Jan ADD PARTITION(year='2010', month='01', day='02') LOCATION 'path'; ALTER TABLE table1_2010Jan ADD PARTITION(year='2010', month='01', day='03') LOCATION 'path';' I am creating individual partitions like above for every day of every month.

ConcurrentModificationException when using Spark collectionAccumulator

南楼画角 提交于 2019-12-05 20:31:08
问题 I'm trying to run a Spark-based application on an Azure HDInsight on-demand cluster, and am seeing lots of SparkExceptions (caused by ConcurrentModificationExceptions) being logged. The application runs without these errors when I start a local Spark instance. I've seen reports of similar errors when using accumulators and my code is indeed using a CollectionAccumulator, however I have placed synchronized blocks everywhere I use it, and it makes no difference. The accumulator-related code

Creating hive partitions for multiple months using one script

不羁的心 提交于 2019-12-04 15:02:53
I have data for 4 years. Like '2011 2012 2013 2014' I have to run queries based on one month's data. So i am creating partitions as below. 'ALTER TABLE table1_2010Jan ADD PARTITION(year='2010', month='01', day='01') LOCATION 'path'; ALTER TABLE table1_2010Jan ADD PARTITION(year='2010', month='01', day='02') LOCATION 'path'; ALTER TABLE table1_2010Jan ADD PARTITION(year='2010', month='01', day='03') LOCATION 'path';' I am creating individual partitions like above for every day of every month. I want to know if we can write a script(any language) an run it one time to create these partitions for

how to connect to HBase / Hadoop Database using C#

半城伤御伤魂 提交于 2019-12-04 14:22:03
问题 Recently, Exploring Microsoft HDInsight Hadoop for Windows.But don't know where to began and start using apache hadoop with c# / asp.net mvc. i know http://hadoopsdk.codeplex.com/ is best available resource to start, but can't find documentation to start from scratch? like creating cluster,database and then connecting it to C# app. 回答1: The easiest way to get started is to use the HDInsight service on Azure (which is still in preview, but works well). That way you can just log into your azure