Partitioning Data in SQL On-Demand with Blob Storage as Data Source

问题

In Amazon Redshift there is a way to create a partition key when using your S3 bucket as a data source. Link.

I am attempting to do something similar in Azure Synapse using the SQL On-Demand service.

Currently I have a storage account that is partitioned such that it follows this scheme:

-Sales (folder)
  - 2020-10-01 (folder)
    - File 1
    - File 2
  - 2020-10-02 (folder)
    - File 3
    - File 4

To create a view and pull in all 4 files I ran the command:

CREATE VIEW testview3 AS SELECT * FROM OPENROWSET ( BULK 'Sales/*/*.csv', FORMAT = 'CSV', PARSER_VERSION = '2.0', DATA_SOURCE = 'AzureBlob', FIELDTERMINATOR = ',', FIRSTROW = 2 ) AS tv1;

If I run a query of SELECT * FROM [myview] I receive data from all 4 files.

How can I go about creating a partition key so that I could run a query such as

SELECT * FROM [myview] WHERE folderdate > 2020-10-01

so that I can only analyze data from Files 3 and 4?

I know I can edit my OPENROWSET BULK statement but I want to be able to get all the data from my container at first and then constrain searches as needed.

回答1:

Serverless SQL can parse partitioned folder structure's using the filename (where you wish to load a specific file or files) and filepath (where you wish to load all files in this said path). More information on syntax and usage is available on documentation online.

In your case, you can parse all files from '2020-10-01' and beyond using the filepath syntax such as filepath(1) > '2020-10-01'

回答2:

To expand on the answer from Raunak I ended up with the following syntax for my query.

DROP VIEW IF EXISTS testview6
GO

CREATE VIEW testview6 AS
SELECT *,
    r.filepath(1) AS [date]
FROM OPENROWSET (
        BULK 'Sales/*/*.csv',
        FORMAT = 'CSV', PARSER_VERSION = '2.0',
        DATA_SOURCE = 'AzureBlob',
        FIELDTERMINATOR = ',',
        FIRSTROW = 2
        ) AS [r]
WHERE r.filepath(1) IN ('2020-10-02');

You can adjust the granularity of your partitioning by the addition of extra wildcards (*) and r.filepath(x) statements.

For instance you can create your query such as:

DROP VIEW IF EXISTS testview6
GO

CREATE VIEW testview6 AS
SELECT *,
    r.filepath(1) AS [year],
    r.filepath(2) as [month]
FROM OPENROWSET (
        BULK 'Sales/*-*-01/*.csv',
        FORMAT = 'CSV', PARSER_VERSION = '2.0',
        DATA_SOURCE = 'AzureBlob',
        FIELDTERMINATOR = ',',
        FIRSTROW = 2
        ) AS [r]
WHERE r.filepath(1) IN ('2020')
AND r.filepath(2) IN ('10');

来源：https://stackoverflow.com/questions/64745782/partitioning-data-in-sql-on-demand-with-blob-storage-as-data-source

标签

sql

sql-server

amazon-redshift

azure-synapse

amazon-redshift-spectrum