问题
I'm using Databricks Connect to run code in my Azure Databricks cluster locally from IntelliJ IDEA (Scala).
Everything works fine. I can connect, debug, inspect locally in the IDE.
I created a Databricks Job to run my custom app JAR, but it fails with the following exception:
19/08/17 19:20:26 ERROR Uncaught throwable from user code: java.lang.NoClassDefFoundError: com/databricks/service/DBUtils$
at Main$.<init>(Main.scala:30)
at Main$.<clinit>(Main.scala)
Line 30 of my Main.scala class is
val dbutils: DBUtils.type = com.databricks.service.DBUtils
Just like how it's described on this documentation page
That pages shows a way to access DBUtils that works both locally and in the cluster. But the example only shows Python, and I'm using Scala.
What's the proper way to access it in a way that works both locally using databricks-connect and in a Databricks Job running a JAR?
UPDATE
It seems there are two ways of using DBUtils.
1) The DbUtils class described here. Quoting the docs, this library allows you to build and compile the project, but not run it. This doesn't let you run your local code on the cluster.
2) The Databricks Connect described here. This one allows you to run your local Spark code in a Databricks cluster.
The problem is that these two methods have different setups and package name. There doesn't seem to be a way to use Databricks Connect locally (which is not available in the cluster) but then have the jar application using the DbUtils class added via sbt/maven so that the cluster has access to it.
回答1:
I don't know why the docs you mentioned don't work. Maybe you're using a different dependency?
These docs have an example application you can download. It's a project with a very minimal test, so it doesn't create jobs or tries to run them on the cluster -- but it's a start. Also, please note that it uses the older 0.0.1
version of dbutils-api
.
So to fix your current issue, instead of using com.databricks.service.DBUtils
, try importing the dbutils
from a different place:
import com.databricks.dbutils_v1.DBUtilsHolder.dbutils
Or, if you prefer:
import com.databricks.dbutils_v1.{DBUtilsV1, DBUtilsHolder}
type DBUtils = DBUtilsV1
val dbutils: DBUtils = DBUtilsHolder.dbutils
Also, make sure that you have the following dependency in SBT (maybe try to play with versions if 0.0.3
doesn't work -- the latest one is 0.0.4
):
libraryDependencies += "com.databricks" % "dbutils-api_2.11" % "0.0.3"
This question and answer pointed me in the right direction. The answer contains a link to a working Github repo which uses dbutils
: waimak. I hope that this repo could aid you in further questions about Databricks config and dependencies.
Good luck!
UPDATE
I see, so we have two similar but not identical APIs, and no good way to switch between the local and the backend version (though Databricks Connect promises that it should work anyhow). Please let me propose a workaround.
It's good that Scala is convenient for writing adapters. Here's a code snippet which should work as a bridge -- there's the DBUtils
object defined in here which provides a sufficient API abstraction for the two versions of the API: the Databricks Connect one on com.databricks.service.DBUtils
, and the backend com.databricks.dbutils_v1.DBUtilsHolder.dbutils
API. We're able to achieve that by both loading and subsequently using the com.databricks.service.DBUtils
through reflection -- we don't have hard-coded imports of it.
package com.example.my.proxy.adapter
import org.apache.hadoop.fs.FileSystem
import org.apache.spark.sql.catalyst.DefinedByConstructorParams
import scala.util.Try
import scala.language.implicitConversions
import scala.language.reflectiveCalls
trait DBUtilsApi {
type FSUtils
type FileInfo
type SecretUtils
type SecretMetadata
type SecretScope
val fs: FSUtils
val secrets: SecretUtils
}
trait DBUtils extends DBUtilsApi {
trait FSUtils {
def dbfs: org.apache.hadoop.fs.FileSystem
def ls(dir: String): Seq[FileInfo]
def rm(dir: String, recurse: Boolean = false): Boolean
def mkdirs(dir: String): Boolean
def cp(from: String, to: String, recurse: Boolean = false): Boolean
def mv(from: String, to: String, recurse: Boolean = false): Boolean
def head(file: String, maxBytes: Int = 65536): String
def put(file: String, contents: String, overwrite: Boolean = false): Boolean
}
case class FileInfo(path: String, name: String, size: Long)
trait SecretUtils {
def get(scope: String, key: String): String
def getBytes(scope: String, key: String): Array[Byte]
def list(scope: String): Seq[SecretMetadata]
def listScopes(): Seq[SecretScope]
}
case class SecretMetadata(key: String) extends DefinedByConstructorParams
case class SecretScope(name: String) extends DefinedByConstructorParams
}
object DBUtils extends DBUtils {
import Adapters._
override lazy val (fs, secrets): (FSUtils, SecretUtils) = Try[(FSUtils, SecretUtils)](
(ReflectiveDBUtils.fs, ReflectiveDBUtils.secrets) // try to use the Databricks Connect API
).getOrElse(
(BackendDBUtils.fs, BackendDBUtils.secrets) // if it's not available, use com.databricks.dbutils_v1.DBUtilsHolder
)
private object Adapters {
// The apparent code copying here is for performance -- the ones for `ReflectiveDBUtils` use reflection, while
// the `BackendDBUtils` call the functions directly.
implicit class FSUtilsFromBackend(underlying: BackendDBUtils.FSUtils) extends FSUtils {
override def dbfs: FileSystem = underlying.dbfs
override def ls(dir: String): Seq[FileInfo] = underlying.ls(dir).map(fi => FileInfo(fi.path, fi.name, fi.size))
override def rm(dir: String, recurse: Boolean = false): Boolean = underlying.rm(dir, recurse)
override def mkdirs(dir: String): Boolean = underlying.mkdirs(dir)
override def cp(from: String, to: String, recurse: Boolean = false): Boolean = underlying.cp(from, to, recurse)
override def mv(from: String, to: String, recurse: Boolean = false): Boolean = underlying.mv(from, to, recurse)
override def head(file: String, maxBytes: Int = 65536): String = underlying.head(file, maxBytes)
override def put(file: String, contents: String, overwrite: Boolean = false): Boolean = underlying.put(file, contents, overwrite)
}
implicit class FSUtilsFromReflective(underlying: ReflectiveDBUtils.FSUtils) extends FSUtils {
override def dbfs: FileSystem = underlying.dbfs
override def ls(dir: String): Seq[FileInfo] = underlying.ls(dir).map(fi => FileInfo(fi.path, fi.name, fi.size))
override def rm(dir: String, recurse: Boolean = false): Boolean = underlying.rm(dir, recurse)
override def mkdirs(dir: String): Boolean = underlying.mkdirs(dir)
override def cp(from: String, to: String, recurse: Boolean = false): Boolean = underlying.cp(from, to, recurse)
override def mv(from: String, to: String, recurse: Boolean = false): Boolean = underlying.mv(from, to, recurse)
override def head(file: String, maxBytes: Int = 65536): String = underlying.head(file, maxBytes)
override def put(file: String, contents: String, overwrite: Boolean = false): Boolean = underlying.put(file, contents, overwrite)
}
implicit class SecretUtilsFromBackend(underlying: BackendDBUtils.SecretUtils) extends SecretUtils {
override def get(scope: String, key: String): String = underlying.get(scope, key)
override def getBytes(scope: String, key: String): Array[Byte] = underlying.getBytes(scope, key)
override def list(scope: String): Seq[SecretMetadata] = underlying.list(scope).map(sm => SecretMetadata(sm.key))
override def listScopes(): Seq[SecretScope] = underlying.listScopes().map(ss => SecretScope(ss.name))
}
implicit class SecretUtilsFromReflective(underlying: ReflectiveDBUtils.SecretUtils) extends SecretUtils {
override def get(scope: String, key: String): String = underlying.get(scope, key)
override def getBytes(scope: String, key: String): Array[Byte] = underlying.getBytes(scope, key)
override def list(scope: String): Seq[SecretMetadata] = underlying.list(scope).map(sm => SecretMetadata(sm.key))
override def listScopes(): Seq[SecretScope] = underlying.listScopes().map(ss => SecretScope(ss.name))
}
}
}
object BackendDBUtils extends DBUtilsApi {
import com.databricks.dbutils_v1
private lazy val dbutils: DBUtils = dbutils_v1.DBUtilsHolder.dbutils
override lazy val fs: FSUtils = dbutils.fs
override lazy val secrets: SecretUtils = dbutils.secrets
type DBUtils = dbutils_v1.DBUtilsV1
type FSUtils = dbutils_v1.DbfsUtils
type FileInfo = com.databricks.backend.daemon.dbutils.FileInfo
type SecretUtils = dbutils_v1.SecretUtils
type SecretMetadata = dbutils_v1.SecretMetadata
type SecretScope = dbutils_v1.SecretScope
}
object ReflectiveDBUtils extends DBUtilsApi {
// This throws a ClassNotFoundException when the Databricks Connection API isn't available -- it's much better than
// the NoClassDefFoundError, which we would get if we had a hard-coded import of com.databricks.service.DBUtils .
// As we're just using reflection, we're able to recover if it's not found.
private lazy val dbutils: DBUtils =
Class.forName("com.databricks.service.DBUtils$").getField("MODULE$").get().asInstanceOf[DBUtils]
override lazy val fs: FSUtils = dbutils.fs
override lazy val secrets: SecretUtils = dbutils.secrets
type DBUtils = AnyRef {
val fs: FSUtils
val secrets: SecretUtils
}
type FSUtils = AnyRef {
def dbfs: org.apache.hadoop.fs.FileSystem
def ls(dir: String): Seq[FileInfo]
def rm(dir: String, recurse: Boolean): Boolean
def mkdirs(dir: String): Boolean
def cp(from: String, to: String, recurse: Boolean): Boolean
def mv(from: String, to: String, recurse: Boolean): Boolean
def head(file: String, maxBytes: Int): String
def put(file: String, contents: String, overwrite: Boolean): Boolean
}
type FileInfo = AnyRef {
val path: String
val name: String
val size: Long
}
type SecretUtils = AnyRef {
def get(scope: String, key: String): String
def getBytes(scope: String, key: String): Array[Byte]
def list(scope: String): Seq[SecretMetadata]
def listScopes(): Seq[SecretScope]
}
type SecretMetadata = DefinedByConstructorParams { val key: String }
type SecretScope = DefinedByConstructorParams { val name: String }
}
If you replace the val dbutils: DBUtils.type = com.databricks.service.DBUtils
which you mentioned in your Main
with val dbutils: DBUtils.type = com.example.my.proxy.adapter.DBUtils
, everything should work as a drop-in replacement, both locally and remotely.
If you have some new NoClassDefFoundError
s, try adding specific dependencies to the JAR job, or maybe try rearranging them, changing the versions, or marking the dependencies as provided.
This adapter isn't pretty, and it uses reflection, but it should be good enough as a workaround, I hope. Good luck :)
回答2:
To access dbutils.fs and dbutils.secrets Databricks Utilities, you use the DBUtils module.
Example: Accessing DBUtils in scala programing looks like:
val dbutils = com.databricks.service.DBUtils
println(dbutils.fs.ls("dbfs:/"))
println(dbutils.secrets.listScopes())
Reference: Databricks - Accessing DBUtils.
Hope this helps.
来源:https://stackoverflow.com/questions/58941808/how-to-properly-access-dbutils-in-scala-when-using-databricks-connect