I have been testing how to use dask (cluster with 20 cores) and I am surprised by the speed that I get on calling a len function vs slicing through loc.
impo
Good question, this gets at a few points about when data is moving up to the cluster and back down to the client (your python session). Lets look at a few stages of your compuation
This is a Pandas dataframe in your python session, so it's obviously still in your local process.
log = pd.read_csv('800000test', sep='\t') # on client
This breaks up your Pandas dataframe into twenty Pandas dataframes, however these are still on the client. Dask dataframes don't eagerly send data up to the cluster.
logd = dd.from_pandas(log,npartitions=20) # still on client
Calling len
actually causes computation here (normally you would use df.some_aggregation().compute()
. So now Dask kicks in. First it moves your data out to the cluster (slow) then it calls len on all of the 20 partitions (fast), it aggregates those (fast) and then moves the result down to your client so that it can print.
print(len(logd)) # costly roundtrip client -> cluster -> client
So the problem here is that our dask.dataframe still had all of its data in the local python session.
It would have been much faster to use, say, the local threaded scheduler rather than the distributed scheduler. This should compute in milliseconds
with dask.set_options(get=dask.threaded.get): # no cluster, just local threads
print(len(logd)) # stays on client
But presumably you want to know how to scale out to larger datasets, so lets do this the right way.
Instead of loading with Pandas on your client/local session, let the Dask workers load bits of the csv file. This way no client-worker communication is necessary.
# log = pd.read_csv('800000test', sep='\t') # on client
log = dd.read_csv('800000test', sep='\t') # on cluster workers
However, unlike pd.read_csv
, dd.read_csv
is lazy, so this should return almost immediately. We can force Dask to actually do the computation with the persist method
log = client.persist(log) # triggers computation asynchronously
Now the cluster kicks into action and loads your data directly in the workers. This is relatively fast. Note that this method returns immediately while work happens in the background. If you want to wait until it finishes, call wait
.
from dask.distributed import wait
wait(log) # blocks until read is done
If you're testing with a small dataset and want to get more partitions, try changing the blocksize.
log = dd.read_csv(..., blocksize=1000000) # 1 MB blocks
Regardless, operations on log
should now be fast
len(log) # fast
In response to a question on this blogpost here are the assumptions that we're making about where the file lives.
Generally when you provide a filename to dd.read_csv
it assumes that that file is visible from all of the workers. This is true if you are using a network file system, or a global store like S3 or HDFS. If you are using a network file system then you will want to either use absolute paths (like /path/to/myfile.*.csv
) or else ensure that your workers and client have the same working directory.
If this is not the case, and your data is only on your client machine, then you will have to load and scatter it out.
The simple way is just to do what you did originally, but persist your dask.dataframe
log = pd.read_csv('800000test', sep='\t') # on client
logd = dd.from_pandas(log,npartitions=20) # still on client
logd = client.persist(logd) # moves to workers
This is fine, but results in slightly less-than-ideal communication.
Instead, you might scatter your data out to the cluster explicitly
[future] = client.scatter([log])
This gets into more complex API though, so I'll just point you to docs
http://distributed.readthedocs.io/en/latest/manage-computation.html http://distributed.readthedocs.io/en/latest/memory.html http://dask.pydata.org/en/latest/delayed-collections.html