I\'m coming from a relational database background and trying to work with amazon\'s DynamoDB
I have a table with a hash key \"DataID\" and a range \"CreatedAt\" and
You can have multiple identical hash keys; but only if you have a range key that varies. Think of it like file formats; you can have 2 files with the same name in the same folder as long as their format is different. If their format is the same, their name must be different. The same concept applies to DynamoDB's hash/range keys; just think of the hash as the name and the range as the format.
Also, I don't recall if they had these at the time of the OP (I don't believe they did), but they now offer Local Secondary Indexes.
My understanding of these is that it should now allow you to perform the desired queries without having to do a full scan. The downside is that these indexes have to be specified at table creation, and also (I believe) cannot be blank when creating an item. In addition, they require additional throughput (though typically not as much as a scan) and storage, so it's not a perfect solution, but a viable alternative, for some.
I do still recommend Mike Brant's answer as the preferred method of using DynamoDB, though; and use that method myself. In my case, I just have a central table with only a hash key as my ID, then secondary tables that have a hash and range that can be queried, then the item points the code to the central table's "item of interest", directly.
Additional data regarding the secondary indexes can be found in Amazon's DynamoDB documentation here for those interested.
Anyway, hopefully this will help anyone else that happens upon this thread.
Approach I followed to solve this problem is by created a Global Secondary Index as below. Not sure if this is the best approach but hopefully if it is useful to someone.
Hash Key | Range Key
------------------------------------
Date value of CreatedAt | CreatedAt
Limitation imposed on the HTTP API user to specify the number of days to retrieve data, defaulted to 24 hr.
This way, I can always specify the HashKey as Current date's day and RangeKey can use > and < operators while retrieving. This way the data is also spread across multiple shards.
Your Hash key (primary of sort) has to be unique (unless you have a range like stated by others).
In your case, to query your table you should have a secondary index.
| ID | DataID | Created | Data |
|------+--------+---------+------|
| hash | xxxxx | 1234567 | blah |
Your Hash Key is ID Your secondary index is defined as: DataID-Created-index (that's the name that DynamoDB will use)
Then, you can make a query like this:
var params = {
TableName: "Table",
IndexName: "DataID-Created-index",
KeyConditionExpression: "DataID = :v_ID AND Created > :v_created",
ExpressionAttributeValues: {":v_ID": {S: "some_id"},
":v_created": {N: "timestamp"}
},
ProjectionExpression: "ID, DataID, Created, Data"
};
ddb.query(params, function(err, data) {
if (err)
console.log(err);
else {
data.Items.sort(function(a, b) {
return parseFloat(a.Created.N) - parseFloat(b.Created.N);
});
// More code here
}
});
Essentially your query looks like:
SELECT * FROM TABLE WHERE DataID = "some_id" AND Created > timestamp;
The secondary Index will increase the read/write capacity units required so you need to consider that. It still is a lot better than doing a scan, which will be costly in reads and in time (and is limited to 100 items I believe).
This may not be the best way of doing it but for someone used to RD (I'm also used to SQL) it's the fastest way to get productive. Since there is no constraints in regards to schema, you can whip up something that works and once you have the bandwidth to work on the most efficient way, you can change things around.
Updated Answer:
DynamoDB allows for specification of secondary indexes to aid in this sort of query. Secondary indexes can either be global, meaning that the index spans the whole table across hash keys, or local meaning that the index would exist within each hash key partition, thus requiring the hash key to also be specified when making the query.
For the use case in this question, you would want to use a global secondary index on the "CreatedAt" field.
For more on DynamoDB secondary indexes see the secondary index documentation
Original Answer:
DynamoDB does not allow indexed lookups on the range key only. The hash key is required such that the service knows which partition to look in to find the data.
You can of course perform a scan operation to filter by the date value, however this would require a full table scan, so it is not ideal.
If you need to perform an indexed lookup of records by time across multiple primary keys, DynamoDB might not be the ideal service for you to use, or you might need to utilize a separate table (either in DynamoDB or a relational store) to store item metadata that you can perform an indexed lookup against.
Updated Answer There is no convenient way to do this using Dynamo DB Queries with predictable throughput. One (sub optimal) option is to use a GSI with an artificial HashKey & CreatedAt. Then query by HashKey alone and mention ScanIndexForward to order the results. If you can come up with a natural HashKey (say the category of the item etc) then this method is a winner. On the other hand, if you keep the same HashKey for all items, then it will affect the throughput mostly when when your data set grows beyond 10GB (one partition)
Original Answer: You can do this now in DynamoDB by using GSI. Make the "CreatedAt" field as a GSI and issue queries like (GT some_date). Store the date as a number (msecs since epoch) for this kind of queries.
Details are available here: Global Secondary Indexes - Amazon DynamoDB : http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.Using
This is a very powerful feature. Be aware that the query is limited to (EQ | LE | LT | GE | GT | BEGINS_WITH | BETWEEN) Condition - Amazon DynamoDB : http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Condition.html
You could make the Hash key something along the lines of a 'product category' id, then the range key as a combination of a timestamp with a unique id appended on the end. That way you know the hash key and can still query the date with greater than.