问题
I read here that storage limit on AWS Databricks is 5TB for individual file and we can store as many files as we want So does the same limit apply to Azure Databricks? or, is there some other limit applied on Azure Databricks?
Update:
@CHEEKATLAPRADEEP Thanks for the explanation but, can someone please share the reason behind: "we recommend that you store data in mounted object storage rather than in the DBFS root"
I need to use DirectQuery (because of huge data size) in Power BI and ADLS doesnt support that as of now.
回答1:
From Azure Databricks Best Practices: Do not Store any Production Data in Default DBFS Folders
Important Note: Even though the DBFS root is writeable, we recommend that you store data in mounted object storage rather than in the DBFS root.
Reason for recommending to store data in mounted storage account than storing in storage account is located in ADB workspace.
Reason1: You don't have write permission, when you use the same storage account externally via Storage Explorer.
Reason 2: You cannot use the same storage accounts for another ADB workspace or use the same storage account linked service for Azure Data Factory or Azure synapse workspace.
Reason 3: In future, you decided to use Azure Synapse workspaces than ADB.
Reason 4: What if you want to delete the existing workspace.
Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and available on Azure Databricks clusters. DBFS is an abstraction on top of scalable object storage i.e. ADLS gen2.
There is no restriction on amount of data you can store in Azure Data Lake Storage Gen2.
Note: Azure Data Lake Storage Gen2 able to store and serve many exabytes of data.
For Azure Databricks Filesystem (DBFS) - Support only files less than 2GB in size.
Note: If you use local file I/O APIs to read or write files larger than 2GB you might see corrupted files. Instead, access files larger than 2GB using the DBFS CLI, dbutils.fs, or Spark APIs or use the /dbfs/ml folder.
For Azure Storage – Maximum storage account capacity is 5 PiB Petabytes.
The following table describes default limits for Azure general-purpose v1, v2, Blob storage, and block blob storage accounts. The ingress limit refers to all data that is sent to a storage account. The egress limit refers to all data that is received from a storage account.
Note: Limitation on single block blob is 4.75 TB.
回答2:
Databricks documentation states:
Support only files less than 2GB in size. If you use local file I/O APIs to read or write files larger than 2GB you might see corrupted files. Instead, access files larger than 2GB using the DBFS CLI, dbutils
You can read more here: https://docs.microsoft.com/en-us/azure/databricks/data/databricks-file-system
来源:https://stackoverflow.com/questions/62028296/what-is-the-data-size-limit-of-dbfs-in-azure-databricks