maximum size of attributes on AWS SimpleDB

ⅰ亾dé卋堺 提交于 2019-12-03 02:50:49

There are ways to store your 10k text data but whether it will be acceptable will depend on what else you need to store and how you plan to use it.

If you need to store arbitrarily large data (especially binary data) then the S3 file pointer can be attractive. The value that SimpleDB adds in this scenario is the ability to run queries against the file metadata that you store in SimpleDB.

For text data limited to 10k I would recommend storing it directly in SimpleDB. It will easily fit in a single item but you'll have to spread it across multiple attributes. There are basically two ways to do this each with some draw backs.

One way is more flexible and search friendly but requires you to touch your data. You split your data up into chunks of about 1000 bytes and you store each chunk as an attribute value in a multi-valued attribute. There is no ordering imposed on multi-valued attributes so you have to prepend each chunk with a number for ordering (e.g. 01)

The fact that you have all the text stored in one attribute makes queries easy to do with a single attribute name in the predicate. You can add a different size text to each item anywhere from 1k to 200+k and it gets handled appropriately. But you do have to be aware that your prepended line numbers can pop positive for your queries (e.g. if you are searching for 01 every item will match that query).

The second way to store the text within SimpleDB does not require you to place arbitrary ordering data within your text chunks. You do the ordering by placing each text chunk in a different named attribute. For example you could use attribute names: desc01 desc02 ... desc10. Then you place each chunk in the appropriate attribute. You can still do full text search with both methods but the searches will be slower with this method because you will need to specify many predicates and SimpleDB will end up searching through a separate index for each attribute.

It may be easy to think of this type of work around as a hack because with databases we are used to having this type of low level detail handled for us within the database. SimpleDB is specifically designed to push this sort of thing out of the database and into the client as a means of providing availability as a first class feature.

If you found out that a relational database was splitting your text into 1k chunks to store on disk as an implementation detail it wouldn't seem like a hack. The problem is that the current state of SimpleDB clients is such that you have to implement a lot of this type of data formatting yourself. This is the type of thing that ideally will be handled for you in a smart client. There just aren't any smart clients freely available yet.

If you are concerned about cost, you might find that it is cheaper to put the text in S3 and metadata with pointers in SimpleDB.

You could put the 10k text on S3, then create an attribute that has all the unique words of the 10k of text as multiple values. Then searches would be fast. No phrase searching, though.

How many values can you store in one attribute in one 'row' (name)? I looked in the docs, no answer popped out at me.

--Tom

The upcoming release of Simple Savant (a C# persistence library for SimpleDB which I created) will support both attribute spanning as described by Mocky and full-text searches of SimpleDB data using Lucene.NET.

I realize you are probably not building your app in C#, but since your question is a top result when searching for SimpleDB and full-text indexing it seemed worth mentioning.

UPDATE: The Simple Savant release I mentioned above is now available.

SimpleDb is, well, simple. Everything in it is a string. The documentation is very straight-forward. And there are lots of usage restricts. Such as:

  • You can only do a SELECT * FROM ___ WHERE ItemName() IN (...) with 20 ItemNames in the IN.
  • You can only PUT (update) to 25 records at a time.
  • All reads are based on computation time. So if you do a SELECT with a LIMIT of 1000 it may return something like 800 (or even nothing) along with a nextToken in which you need to make an additional request (with the nextToken). This means that the next SELECT may actually return the limit count, so the sum of returned rows from the two SELECTs may be greater than your original limit. This is a concern if you are selecting a lot. Also, if you do a SELECT COUNT(*) you will hit a similar problem. It will return you a count, along with a nextToken. And you need to keep iterating over those nextTokens and sum the returning counts to get the true (total) count.
  • All of these computation times will be largely affected by larger data in the store.
  • If you end up having a large number of records you will likely have to shard your records across multiple domains
  • Amazon will throttle your requests if you make too many on a single domain

So, if you plan to use large amounts of string-data, or have a lot of records, then you may want to look elsewhere. SimpleDb is very very reliable, and works as documented, but it can cause lots of headaches.

In your case I'd recommend something like MongoDb. It has its own share of problems as well, but may be better for this case. Though, if you have lots of records (millions and upward) and then try to add indexes to too many records you may break it if it's on spindels and not SSDs.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!