Fastest way to subset - data.table vs. MySQL

风格不统一 提交于 2019-11-30 03:44:54
Matt Dowle

If the data fits in RAM, data.table is faster. If you provide an example it will probably become evident, quickly, that you're using data.table badly. Have you read the "do's and don'ts" on the data.table wiki?

SQL has a lower bound because it is a row store. If the data fits in RAM (and 64bit is quite a bit) then data.table is faster not just because it is in RAM but because columns are contiguous in memory (minimising page fetches from RAM to L2 for column operations). Use data.table correctly and it should be faster than SQL's lower bound. This is explained in FAQ 3.1. If you're seeing slower with data.table, then chances are very high that you're using data.table incorrectly (or there's a performance bug that we need to fix). So, please post some tests, after reading the data.table wiki.

I am not an R user, but I know a little about Databases. I believe that MySQL (or any other reputatble RDBMS) will actually perform your subsetting operations faster (by, like, an order of magnitude, usually) barring any additional computation involved in the subsetting process.

I suspect your performance lag on small data sets is related to the expense of the connection and initial push of the data to MySQL. There is likely a point at which the connection overhead and data transfer time adds more to the cost of your operation than MySQL is saving you.

However, for datasets larger than a certain minimum, it seem likley that this cost is compensated for by the sheer speed of the database.

My understanding is that SQL can acheive most fetching and sorting operations much, much more quickly than iterative operations in code. But one must factor in the cost of the connection and (in this case) the initial transfer of data over the network wire.

I will be interested to hear what others have to say . . .

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!