azure-databricks | 易学教程

How can I resolve “SparkException: Exception thrown in Future.get” issue?

阅读更多关于 How can I resolve “SparkException: Exception thrown in Future.get” issue?

问题 I'm working on two pyspark dataframes and doing a left-anti join on them to track everyday changes and then send an email. The first time I tried: diff = Table_a.join( Table_b, [Table_a.col1== Table_b.col1, Table_a.col2== Table_b.col2], how='left_anti' ) Expected output is a pyspark dataframe with some or no data. This diff dataframe gets it's schema from Table_a. The first time I ran it, showed no data as expected with the schema representation. The next time onwards just throws

How to properly access dbutils in Scala when using Databricks Connect

阅读更多关于 How to properly access dbutils in Scala when using Databricks Connect

问题 I'm using Databricks Connect to run code in my Azure Databricks cluster locally from IntelliJ IDEA (Scala). Everything works fine. I can connect, debug, inspect locally in the IDE. I created a Databricks Job to run my custom app JAR, but it fails with the following exception: 19/08/17 19:20:26 ERROR Uncaught throwable from user code: java.lang.NoClassDefFoundError: com/databricks/service/DBUtils$ at Main$.<init>(Main.scala:30) at Main$.<clinit>(Main.scala) Line 30 of my Main.scala class is

creating dataframe specific schema : StructField starting with capital letter

阅读更多关于 creating dataframe specific schema : StructField starting with capital letter

问题 Apologies for the lengthy post for a seemingly simple curiosity, but I wanted to give full context... In Databricks, I am creating a "row" of data based on a specific schema definition, and then inserting that row into an empty dataframe (also based on the same specific schema). The schema definition looks like this: myschema_xb = StructType( [ StructField("_xmlns", StringType(), True), StructField("_Version", DoubleType(), True), StructField("MyIds", ArrayType( StructType( [ StructField("_ID

Installing Maven library on Databricks via Python commands and dbutils

阅读更多关于 Installing Maven library on Databricks via Python commands and dbutils

问题 On Databricks I would like to install a Maven library through commands in a Python Notebook if its not already installed. If it were a Python PyPI library I would do something like the following: # Get a list of all available library library_name_list = dbutils.library.list() # Suppose the library of interest was "scikit-learn" if "scikit-learn" not in library_name_list: # Install the library dbutils.library.installPyPI("scikit-learn") How can I do the same for a Maven library "com.microsoft

Databricks dbutils throwing NullPointerException

阅读更多关于 Databricks dbutils throwing NullPointerException

问题 Trying to read secret from Azure Key Vault using databricks dbutils, but facing the following exception: OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0 Warning: Ignoring non-Spark config property: eventLog.rolloverIntervalSeconds Exception in thread "main" java.lang.NullPointerException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect

Efficient way of reading parquet files between a date range in Azure Databricks

阅读更多关于 Efficient way of reading parquet files between a date range in Azure Databricks

问题 I would like to know if below pseudo code is efficient method to read multiple parquet files between a date range stored in Azure Data Lake from PySpark(Azure Databricks). Note: the parquet files are not partitioned by date. Im using uat/EntityName/2019/01/01/EntityName_2019_01_01_HHMMSS.parquet convention for storing data in ADL as suggested in the book Big Data by Nathan Marz with slight modification(using 2019 instead of year=2019). Read all data using * wildcard: df = spark.read.parquet

Appending column name to column value using Spark

阅读更多关于 Appending column name to column value using Spark

问题 I have data in comma separated file, I have loaded it in the spark data frame: The data looks like: A B C 1 2 3 4 5 6 7 8 9 I want to transform the above data frame in spark using pyspark as: A B C A_1 B_2 C_3 A_4 B_5 C_6 -------------- Then convert it to list of list using pyspark as: [[ A_1 , B_2 , C_3],[A_4 , B_5 , C_6]] And then run FP Growth algorithm using pyspark on the above data set. The code that I have tried is below: from pyspark.sql.functions import col, size from pyspark.sql

Databricks : Equivalent code for SQL query

阅读更多关于 Databricks : Equivalent code for SQL query

问题 I'm looking for the equivalent databricks code for the query. I added some sample code and the expected as well, but in particular I'm looking for the equivalent code in Databricks for the query . For the moment I'm stuck on the CROSS APPLY STRING SPLIT part. Sample SQL data: CREATE TABLE FactTurnover ( ID INT, SalesPriceExcl NUMERIC (9,4), Discount VARCHAR(100) ) INSERT INTO FactTurnover VALUES (1, 100, '10'), (2, 39.5877, '58, 12'), (3, 100, '50, 10, 15'), (4, 100, 'B') Query: ;WITH CTE AS

what is the cluster manager used in Databricks ? How do I change the number of executors in Databricks clusters?

阅读更多关于 what is the cluster manager used in Databricks ? How do I change the number of executors in Databricks clusters?

问题 What is the cluster manager used in Databricks? How do I change the number of executors in Databricks clusters ? 回答1: What is the cluster manager used in Databricks? Azure Databricks builds on the capabilities of Spark by providing a zero-management cloud platform that includes: Fully managed Spark clusters An interactive workspace for exploration and visualization A platform for powering your favorite Spark-based applications The Databricks Runtime is built on top of Apache Spark and is

Azure Databricks to Azure SQL DW: Long text columns

阅读更多关于 Azure Databricks to Azure SQL DW: Long text columns

问题 I would like to populate an Azure SQL DW from an Azure Databricks notebook environment. I am using the built-in connector with pyspark: sdf.write \ .format("com.databricks.spark.sqldw") \ .option("forwardSparkAzureStorageCredentials", "true") \ .option("dbTable", "test_table") \ .option("url", url) \ .option("tempDir", temp_dir) \ .save() This works fine, but I get an error when I include a string column with a sufficiently long content. I get the following error: Py4JJavaError: An error