Partitioning Data in SQL On-Demand with Blob Storage as Data Source

独自空忆成欢 提交于 2020-12-15 07:16:07

问题


In Amazon Redshift there is a way to create a partition key when using your S3 bucket as a data source. Link.

I am attempting to do something similar in Azure Synapse using the SQL On-Demand service.

Currently I have a storage account that is partitioned such that it follows this scheme:

-Sales (folder)
  - 2020-10-01 (folder)
    - File 1
    - File 2
  - 2020-10-02 (folder)
    - File 3
    - File 4

To create a view and pull in all 4 files I ran the command:

CREATE VIEW testview3 AS SELECT * FROM OPENROWSET ( BULK 'Sales/*/*.csv', FORMAT = 'CSV', PARSER_VERSION = '2.0', DATA_SOURCE = 'AzureBlob', FIELDTERMINATOR = ',', FIRSTROW = 2 ) AS tv1;

If I run a query of SELECT * FROM [myview] I receive data from all 4 files.

How can I go about creating a partition key so that I could run a query such as

SELECT * FROM [myview] WHERE folderdate > 2020-10-01

so that I can only analyze data from Files 3 and 4?

I know I can edit my OPENROWSET BULK statement but I want to be able to get all the data from my container at first and then constrain searches as needed.


回答1:


Serverless SQL can parse partitioned folder structure's using the filename (where you wish to load a specific file or files) and filepath (where you wish to load all files in this said path). More information on syntax and usage is available on documentation online.

In your case, you can parse all files from '2020-10-01' and beyond using the filepath syntax such as filepath(1) > '2020-10-01'




回答2:


To expand on the answer from Raunak I ended up with the following syntax for my query.

DROP VIEW IF EXISTS testview6
GO

CREATE VIEW testview6 AS
SELECT *,
    r.filepath(1) AS [date]
FROM OPENROWSET (
        BULK 'Sales/*/*.csv',
        FORMAT = 'CSV', PARSER_VERSION = '2.0',
        DATA_SOURCE = 'AzureBlob',
        FIELDTERMINATOR = ',',
        FIRSTROW = 2
        ) AS [r]
WHERE r.filepath(1) IN ('2020-10-02');

You can adjust the granularity of your partitioning by the addition of extra wildcards (*) and r.filepath(x) statements.

For instance you can create your query such as:

DROP VIEW IF EXISTS testview6
GO

CREATE VIEW testview6 AS
SELECT *,
    r.filepath(1) AS [year],
    r.filepath(2) as [month]
FROM OPENROWSET (
        BULK 'Sales/*-*-01/*.csv',
        FORMAT = 'CSV', PARSER_VERSION = '2.0',
        DATA_SOURCE = 'AzureBlob',
        FIELDTERMINATOR = ',',
        FIRSTROW = 2
        ) AS [r]
WHERE r.filepath(1) IN ('2020')
AND r.filepath(2) IN ('10');


来源:https://stackoverflow.com/questions/64745782/partitioning-data-in-sql-on-demand-with-blob-storage-as-data-source

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!