Java Caching frameworks for maintaining huge data

问题

Java Caching frameworks for storing huge data.

Context: We are developing a Restful service using Jersey 2.6 and will deploy it on WAS 8.5. This service need to serve more than 10 million requests per day.

We need to implement a cache to store more than 300k object (data will come from DB). And we need some way to update the cache on a daily basis.

Is this approach of caching 300k object and updating them on a daily basis is recommended?
Are there any Java framework which supports this kind of functionality?

回答1:

Your question is too general to get a clear answer. You need to be describe what the problem you are trying to solve is.

Are you concerned about response times?
Are you trying to protect your DB from doing heavy lifting?
Are expecting to have to scale out and want to be sure that you can deal with future loads?

Additionally some more contextual information would be useful, especially:

How dynamic is your data compared to your requests?
What percentage of your data population will be requested on average per day? (How many of the 3 lakh objects will be enquired upon at least once per day? If you don't know, provide your best guess).

Your figures given as 3 lakh (300k) data points and 10M requests means that you are expecting to hit each object on average 33 times a day, which indicates that you are more concerned about back end DB load than your responses being right up to date.

In my experience there are a lot of fairly primitive solutions which will work much better than going for a heavyweight distributed systems such as Mongo, Cassandra or Coherence.

My first response would be: Keep it simple - 300k objects is not too much to store in an internal hash table which you flush once a day and populate on first request.

If you need to scale horizontally, I would suggest Memcache Spymemcached with a 1 day cache time, which populate when you don't find an existing entry.

I would NOT go for something like Cassandra or Mongo unless you have real compelling reasons to require a persistent store. Rationale: Purging can become really onerous, especially if your data is fast moving. For example: Cassandra does not really know how to delete, but instead "tombstones" deleted entries, which means that your data store will simply grow and grow until you create a strategy for purging.

回答2:

Question is if caching must be distributed. Remember the caching is something you have seen. And posting this around for the chance it might be of use... well why.

Distributed Cache system: Redis, Cassandra in Memory. MongoDB in memory.

Local RocksDB (let you store byte[] -> byte[]) and SSDs makes a fine local cache layer. You might also add distributed layer on top of it. Usually better than something from the shelves. Should also be easy to implement.

10Million Requests per day isnt much. in 10hours tops you can server 1Mio / 60 / 60 => 3000 requests per second. Based on the afford you usually can go with an efficient frontend and efficient backend. We can do 40k pages per second and core and having 24 cores.. you know the math. Data in memory no chaching done...

回答3:

For the caching provider I suggest Coherence, I am using Coherence at my company, and it is very robust and synchronized over multiple clusters.

For the other point about how to handle cache, it depends on the nature of your application, based on my experience with caching, I've decided to update the cache in the following scenarios: 1. Grid paging 2. Browsing

and decided to clear the cache and reload the data again:

Edit item
Add new item
Delete item

And I've decided so as maintaining the cache it an overkill headache that will be blown in your face when you handle some kind of statistics and nested hierarchies.

Hope this helped you.

回答4:

Yes they are for example: Coherence, Hazelcast. All are distrubuted cashes. http://java.dzone.com/articles/sneak-peek-jcache-api-jsr-107

In general you should cache what you are using, and cache should be always in sync not daily. You place in cache the recently used objects, and you get read/write through cache to your DB.

回答5:

If you have money , best one is coherence (its reputation is proved by big financial companies )

Hazelcast is an other distributed cache memory you can use, it is one level lower than coherence based on preformance metrics.

回答6:

Cou could try ehcache. It can be used as query cache or even hibernate second level cache. You can configure how long entities should be stored in cache before they are invalidated.

回答7:

If you already have WebSphere ND 8.5.5, you may take a look at WebSphere Extreme Scale, which is provided with that. It is distributed, partitioned caching solution that integrates with WebSphere. See WebSphere eXtreme Scale overview for more details.

回答8:

See the new JCache standard (JSR 107 in the Java Community Process). This API is implemented by Coherence and other caching implementations (ehcache etc.), and also has a small reference implementation that you can use for basic use cases.

Yes, any of the Java caching frameworks should be able to help you. Coherence (note: I work with Coherence at Oracle) for example can definitely handle 3,00,000 items easily (I assume you are from India if you use lakh!), but I suggest only using Coherence if you are deploying this on more than one server.

来源：https://stackoverflow.com/questions/28675644/java-caching-frameworks-for-maintaining-huge-data

标签

java

caching

ibm-was