I have a fairly large mysql table, about 30M rows, 6 columns, about 2gb when loaded into memory.
I work with both python and R. In R, I can load the table into memory a
Thanks to helpful comments, particularly from @roganjosh, it appears that the issue is that the default mysql connector is written in python rather than C, which makes it very slow. The solution is to use MySQLdb
, which is a native C connector.
In my particular setup, running python 3 with anaconda, that wasn't possible because MySQLdb
is only supported in python 2. However, there is an implementation of MySQLdb
for python 3 under the name mysqlclient
.
Using this implementation the time is now down to about 5 minutes to read the whole table, not as fast as R, but much less than the 40 or so it was taking before.
I'm still open to suggestions that would make it faster, but my guess is that this is as good as it's going to get.