pydata blaze: does it allow parallel processing or not?

自作多情 提交于 2019-12-05 05:15:09

Note: The example below requires the latest version of blaze, which you can get via

conda install -c blaze blaze

You'll also need the latest version of the nascent into project. You'll need to install into from master, which you can do with

pip install git+git://github.com/ContinuumIO/into.git

You can't do "seamless" parallelization with an arbitrary backend, but the bcolz backend supports parallelization in a nice way. Here's an example with the NYC Taxi trip/fare dataset

Note: I've combined both the trip and fare datasets into a single dataset. There are 173,179,759 rows in the dataset

In [28]: from blaze import Data, compute

In [29]: ls -d *.bcolz
all.bcolz/  fare.bcolz/ trip.bcolz/

In [30]: d = Data('all.bcolz')

In [31]: d.head(5)
Out[31]:
                          medallion                      hack_license  \
0  89D227B655E5C82AECF13C3F540D4CF4  BA96DE419E711691B9445D6A6307C170
1  0BD7C8F5BA12B88E0B67BED28BEA73D8  9FD8F69F0804BDB5549F40E9DA1BE472
2  0BD7C8F5BA12B88E0B67BED28BEA73D8  9FD8F69F0804BDB5549F40E9DA1BE472
3  DFD2202EE08F7A8DC9A57B02ACB81FE2  51EE87E3205C985EF8431D850C786310
4  DFD2202EE08F7A8DC9A57B02ACB81FE2  51EE87E3205C985EF8431D850C786310

  vendor_id  rate_code store_and_fwd_flag     pickup_datetime  \
0       CMT          1                  N 2013-01-01 15:11:48
1       CMT          1                  N 2013-01-06 00:18:35
2       CMT          1                  N 2013-01-05 18:49:41
3       CMT          1                  N 2013-01-07 23:54:15
4       CMT          1                  N 2013-01-07 23:25:03

     dropoff_datetime  passenger_count  trip_time_in_secs  trip_distance  \
0 2013-01-01 15:18:10                4                382            1.0
1 2013-01-06 00:22:54                1                259            1.5
2 2013-01-05 18:54:23                1                282            1.1
3 2013-01-07 23:58:20                2                244            0.7
4 2013-01-07 23:34:24                1                560            2.1

     ...     pickup_latitude  dropoff_longitude  dropoff_latitude  \
0    ...           40.757977         -73.989838         40.751171
1    ...           40.731781         -73.994499         40.750660
2    ...           40.737770         -74.009834         40.726002
3    ...           40.759945         -73.984734         40.759388
4    ...           40.748528         -74.002586         40.747868

   tolls_amount  tip_amount  total_amount  mta_tax  fare_amount  payment_type  \
0             0           0           7.0      0.5          6.5           CSH
1             0           0           7.0      0.5          6.0           CSH
2             0           0           7.0      0.5          5.5           CSH
3             0           0           6.0      0.5          5.0           CSH
4             0           0          10.5      0.5          9.5           CSH

  surcharge
0       0.0
1       0.5
2       1.0
3       0.5
4       0.5

[5 rows x 21 columns]

To add process-based parallelism, we bring in the Pool class from the multiprocessing stdlib module, and pass the Pool instance's map method as a keyword argument to compute:

In [32]: from multiprocessing import Pool

In [33]: p = Pool()

In [34]: %timeit -n 1 -r 1 values = compute(trip.medallion.distinct())
1 loops, best of 1: 1min per loop

In [35]: %timeit -n 1 -r 1 values = compute(trip.medallion.distinct(), map=p.map)
1 loops, best of 1: 16.2 s per loop

So, roughly a 3x speedup for an extra line of code. Note that this is a string column, and these tend to be very inefficient compared to other types. An distinct expression computed over an integer column is finished in about 1 second (vs 3 seconds) with multiple cores (so, about the same improvement in running time):

In [38]: %timeit -n 1 -r 1 values = compute(trip.passenger_count.distinct())
1 loops, best of 1: 3.33 s per loop

In [39]: %timeit -n 1 -r 1 values = compute(trip.passenger_count.distinct(), map=p.map)
1 loops, best of 1: 1.01 s per loop
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!