design a system supporting massive data storage and query

前端未结

关注

 4  1030

I was asked by the interviewer to design a system to store gigabytes of data and the system also has to support some kind of query.

Description:

相关标签:

4条回答

夕颜

2020-12-23 09:01

In my opinion, create a B+ tree using time as the key to help you quickly locate the range of records during given time period (t1,t2) in disk. Then using the records during (t1,t2) to build IP and URL hash table respectively.

0 讨论(0)
发布评论:

提交评论
- 加载中...
挽巷

2020-12-23 09:08

It will be an interval tree which is also a B-Tree. An interval tree because all the queries have input as time interval only, and B-Tree due to the size of the input(billions).

0 讨论(0)
发布评论:

提交评论
- 加载中...
再見小時候

2020-12-23 09:12

I believe the interviewer was expecting a distributed computing based solution, esp when "100 billion records" are involved. With the limited knowledge of Distributed Computing I have, I would suggest you to look into Distributed Hash Table and map-reduce (for parallel query processing)

0 讨论(0)
发布评论:

提交评论
- 加载中...
有刺的猬

2020-12-23 09:12

Old question, but recently bumped so here's a few other things to think about:

What you need to consider is a few very simple boundary limits beyond your listed requirements, assuming you have no further indexes:

First, given a time period (t1, t2) and a IP, query how many urls this IP has visited in the given period.

If you have 10k users then you can expect at worst a scan of all records in a time window would result in only needing to return in 10k records accessed (on average).

Second, given a time period (t1, t2) and a url, query how many times this url has been visited.

Depending on how many urls you have in the system say 1000, then this again means that a simple scan results in 999 of 1000 records scanned not being returned.

Lets say you have only 100,000 unique urls, you could greatly reduce the space consumed by the database (by using a guid / int foreign key instead), this also means the average url is accessed 1M times on your 100Bn records.

Even with all this it tells us nothing completely, because we don't have numbers / statistics on how clusteded by time the records are for the given search times. Are we getting 1000 page requests every second and searching for a 12month time range, or are we getting 100 requests per second and searching for a 1hour time block (360k requests).

Assuming the 100Bn represents 12 months of data that's 3170 requests per second. Does that sound reasonable?

Why is this important? Because it highlights one key thing you overlooked in your answer.

With 100Bn records in the past 12months, that means in 12months time you'll have 200Bn records to deal with. If 100bn records is for 20 years then it's not such an issue, you can expect to grow by only another 25-30bn in the next 5 years... but it's unlikely that your existing data is over such a long time frame.

Your solution only answers one side of the equation (reading data), you don't consider any complications with writing that much data. A vast majority of the time you will be inserting data into whatever data store you create, will it be able to handle a constant 3k insert requests per second?

If you insert 3k records and each record is just 3x 64bit integers representing Time (in ticks), IP Address and a Foreign key to the url. Then that is only ~75kb/s of writing data which will be fine to maintain. If every URL is to be assumed unique, then you could easily run into performance issues due to IO speeds (ignoring the space requirements).

One other thing the interviewer would be interested in seeing is your thoughts on supporting IPv6.

Lastly, if you provided a solution like you have then the interviewer should have asked a followup question. "How would your system perform if I now want to know when a specific ip address last accessed a specific url?"

So yes, if you don't know about MapReduce and other distributed processing query systems then yours should be a reasonable answer.

0 讨论(0)
发布评论:

提交评论
- 加载中...