How to access data in subdirectories for partitioned Athena table

问题

I have an Athena table with a partition for each day, where the actual files are in "sub-directories" by hour, as follows:

s3://my-bucket/data/2019/06/27/00/00001.json
s3://my-bucket/data/2019/06/27/00/00002.json
s3://my-bucket/data/2019/06/27/01/00001.json
s3://my-bucket/data/2019/06/27/01/00002.json

Athena is able to query this table without issue and find my data, but when using AWS Glue, it does not appear to be able to find this data.

ALTER TABLE mytable ADD 
PARTITION (year=2019, month=06, day=27) LOCATION 's3://my-bucket/data/2019/06/27/01';

select day, count(*)
from mytable
group by day;

day .   count
27 .    145431

I've already tried changing the location of the partition to end with a trailing slash (s3://my-bucket/data/2019/06/27/01/), but this didn't help.

Below are the partition properties in Glue. I was hoping that the storedAsSubDirectories setting would tell it to iterate the sub-directories, but this does not appear to be the case:

{
    "StorageDescriptor": {
        "cols": {
            "FieldSchema": [
                {
                    "name": "userid",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "labels",
                    "type": "array<string>",
                    "comment": ""
                }
            ]
        },
        "location": "s3://my-bucket/data/2019/06/27/01/",
        "inputFormat": "org.apache.hadoop.mapred.TextInputFormat",
        "outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
        "compressed": "false",
        "numBuckets": "0",
        "SerDeInfo": {
            "name": "JsonSerDe",
            "serializationLib": "org.openx.data.jsonserde.JsonSerDe",
            "parameters": {
                "serialization.format": "1"
            }
        },
        "bucketCols": [],
        "sortCols": [],
        "parameters": {},
        "SkewedInfo": {
            "skewedColNames": [],
            "skewedColValues": [],
            "skewedColValueLocationMaps": {}
        },
        "storedAsSubDirectories": "true"
    },
    "parameters": {}
}

When Glue runs against this same partition/table, it finds 0 rows.

However, if all the data files appear in the root "directory" of the partition (i.e. s3://my-bucket/data/2019/06/27/00001.json), then both Athena and Glue can find the data.

Is there some reason why Glue is unable to find the data files? I'd prefer not to create a partition for each hour, since that will mean 8700 partitions per year (and Athena has a limit of 20,000 partitions per table).

回答1:

Apparently there's an undocumented additional option on create_dynamic_frame for "recurse": additional_options = {"recurse": True}

Example:

athena_datasource = glueContext.create_dynamic_frame.from_catalog(database = target_database, table_name = target_table, push_down_predicate = "(year=='2019' and month=='06' and day=='27')", transformation_ctx = "athena_datasource", additional_options = {"recurse": True})

I have just tested my Glue job with this option and can confirm that it now finds all s3 files.

回答2:

AWS Glue data catalog supposed to define meta information about the actual data, e.g. table schema, location of partitions etc. Notion of partitions is a way of restrict Athena to scan only certain destinations in your S3 bucket for speed and cost efficiency. When you query data located in S3 bucket using Athena, it uses table definitions specified in Glue data catalog. This also means, that when you execute DDL statements in Athena, the corresponding table is created in Glue datacatalog. So I am not sure what you mean by "Glue finds 0 rows"

If you created your table using Athena like this:

CREATE EXTERNAL TABLE `mytable`(
  `labels` array<string>, 
  `userid` string)
PARTITIONED BY ( 
  `year` string, 
  `month` string, 
  `day` string, 
  `hour` string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'paths'='labels,userid,') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://my-bucket/data/'

Note that LOCATION points to the place where your partitioning starts. Then adding a single partition should be like:

ALTER TABLE mytable 
ADD PARTITION (year=2019, month=06, day=27, hour=00) 
LOCATION 's3://my-bucket/data/2019/06/27/00/';

ALTER TABLE mytable 
ADD PARTITION (year=2019, month=06, day=28, hour=00) 
LOCATION 's3://my-bucket/data/2019/06/28/00/';

After this two DDL query statements you should be able to see mytable in Glue datacatalog with two partitions under View partitions tab. Now, if you run query without WHERE statement:

SELECT 
    "day", COUNT(*)
FROM 
    mytable
GROUP BY "day";

Then all your data specified by partitions will be scanned and you should get

| day | count          |
|-----|----------------|
| 27  | some number    |
| 28  | another number |

Now, if you want to count records within a specific day, you would need to include WHERE statement

SELECT 
    "day", COUNT(*)
FROM 
    mytable
WHERE(
    "day" = '27'
)
GROUP BY "day";

Then you data only under s3://my-bucket/data/2019/06/27/ will be scanned and you should get something like:

| day | count          |
|-----|----------------|
| 27  | some number    |

Addition notes

According to AWS, a table in a Glue catalog could have up to 10 million partitions, so 8700 partitions per year would hardly be an issue.
AWS does not charge you for DDL statements executed by Athena.
If your paths in S3 adhere HIVE convention, i.e. s3://my-bucket/data/year=2019/month=06/day=27/hour=00/ then after you defined table you can simply run MSCK REPAIR TABLE mytable and all partitions will be added to a table in Glue datacatalog.
For a big number of partitions it is not feasible to run ALTER TABLE mytable ADD PARTITION .... Instead, you could use:
1. Glue Crawler. From my experience, this is only useful when you don't know much about your data and you huge amounts of it. Here is AWS pricing.
2. AWS SDK such as boto3 for python. It provides API for both Athena and Glue clients.
For Athena client you could generate ALTER TABLE mytable ADD PARTITION ... statements as a string and then send it for execution. Here is a post on Medium that can help you to get started.

You can also use Glue client to do the same thing with batch_create_partition or create_partition methods, but this would require different inputs then Athena client

Update 2019-07-03

If your data has structure like

s3://my-bucket/data/2019/06/27/00/00001.json
s3://my-bucket/data/2019/06/27/00/00002.json
s3://my-bucket/data/2019/06/27/01/00001.json
s3://my-bucket/data/2019/06/27/01/00002.json
...
s3://my-bucket/data/2019/06/28/00/00001.json
s3://my-bucket/data/2019/06/28/00/00002.json
s3://my-bucket/data/2019/06/28/01/00001.json
s3://my-bucket/data/2019/06/28/01/00002.json

but you only want to have only 3 partitions, i.e. year, month, day, then definition of your table should take that into account:

CREATE EXTERNAL TABLE `mytable`(
  `labels` array<string>, 
  `userid` string)
PARTITIONED BY (  -- Here we specify only three columns 
  `year` string, 
  `month` string, 
  `day` string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'paths'='labels,userid,') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://my-bucket/data/'

then DDL statements for adding partitions should be:

ALTER TABLE mytable
ADD PARTITION (year=2019, month=06, day=27)
LOCATION 's3://my-bucket/data/2019/06/27/';  -- Stop at day level

ALTER TABLE mytable
ADD PARTITION (year=2019, month=06, day=28)
LOCATION 's3://my-bucket/data/2019/06/28/';  -- Stop at day level

Remember that in S3 there is not such thing as folders or directories. This is how I see partitions and locations in a context of Athena, Glue and S3. Partition is an abstraction for a group of S3 objects, where grouping is defined by filtering all objects with respect to a certain "prefix" <=> Location. Thus, when you specify LOCATION, stop at "day level". Although, you can stop at "hour level", e.g. s3://my-bucket/data/2019/06/28/01/, but then you would need to create partitions for all other hours if you want Athena be able to scan them. On top of that, combination of partition values should be unique (which is equivalent to defining 4 partitions), otherwise AWS wouldn't allow to create it.

Just tested in my AWS account with data that resembles your S3 paths and were able to see partitions in Glue data catalog that point to correct destination.

回答3:

I have faced to the same situation.

I created a Glue Data Catalog table manually for S3 buckets. The directory has some subdirectories which are not assigned as any partition keys. Through the catalog table, Athena query handle all files even in the subdirectories. But Glue Job create_dynamic_frame.from_catalog does not. Adding additional_options = {"recurse": True} to from_catalog, Glue Job find files in subdirectories.

In my case the catalog table has a partition property "storedAsSubDirectories" = "false" because the property is assigned automatically when I create a catalog table with Glue console or Athena DDL query and I could not touch the value on the console. Despite the property it worked with the additional option recurse=True. I doubt the property storedAsSubDirectories does not work in the meaning of the word.

As @3nochroot says, it seems not to be stated on official document even today.

来源：https://stackoverflow.com/questions/56836447/how-to-access-data-in-subdirectories-for-partitioned-athena-table

标签

aws-glue

aws-glue-data-catalog