dask | 易学教程

Dask-distributed. How to get task key ID in the function being calculated?

阅读更多关于 Dask-distributed. How to get task key ID in the function being calculated?

问题 My computations with dask.distributed include creation of intermediate files whose names include UUID4, that identify that chunk of work. pairs = '{}\n{}\n{}\n{}'.format(list1, list2, list3, ...) file_path = os.path.join(job_output_root, 'pairs', 'pairs-{}.txt'.format(str(uuid.uuid4()).replace('-', ''))) file(file_path, 'wt').writelines(pairs) In the same time, all tasks in the dask distributed cluster have unique keys. Therefore, it would be natural to use that key ID for file name. Is it

specify how to partition dask dataframe?

阅读更多关于 specify how to partition dask dataframe?

问题 I have a pandas df that's indexed by id and date . I would like to run some regressions for each id in parallel using dask. I know dask splits the df into N partitions but is there a way to force it to split by id column? This way when I do map_partitions I can simply apply my rolling regression function to each partition. 来源： https://stackoverflow.com/questions/51698459/specify-how-to-partition-dask-dataframe

Find maximum value of each day from hourly data

阅读更多关于 Find maximum value of each day from hourly data

问题 I have problem getting max value of each day from hourly data. Original file contain 24 data for each name each day(there are too many name). as example here is 24 data for one name: Start Time Period name value 2/23/2019 0:00 60 MBTS_H2145X 100 2/23/2019 1:00 60 MBTS_H2145X 100 2/23/2019 2:00 60 MBTS_H2145X 1 2/23/2019 3:00 60 MBTS_H2145X 1 2/23/2019 4:00 60 MBTS_H2145X 1 2/23/2019 5:00 60 MBTS_H2145X 2324 2/23/2019 6:00 60 MBTS_H2145X 2323 2/23/2019 7:00 60 MBTS_H2145X 2323 2/23/2019 8:00

python futures and tuple unpacking

阅读更多关于 python futures and tuple unpacking

问题 What is an elagant/idiomatic way to achieve something like tuple unpacking with futures? I have code like a, b, c = f(x) y = g(a, b) z = h(y, c) and I would like to convert it to use futures. Ideally I would like to write something like a, b, c = ex.submit(f, x) y = ex.submit(g, a, b) z = ex.submit(h, y, c) The first line of that throws TypeError: 'Future' object is not iterable though. How can I get a,b,c without having to make 3 additional ex.submit calls? ie. I would like to avoid having

Load many feather files in a folder into dask

阅读更多关于 Load many feather files in a folder into dask

问题 With a folder with many .feather files, I would like to load all of them into dask in python. So far, I have tried the following sourced from a similar question on GitHub https://github.com/dask/dask/issues/1277 files = [...] dfs = [dask.delayed(feather.read_dataframe)(f) for f in files] df = dd.concat(dfs) Unfortunately, this gives me the error TypeError: Truth of Delayed objects is not supported which is mentioned there, but a workaround is not clear. Is it possible to do the above in dask?

Keep indices in Pandas DataFrame with a certain number of non-NaN entires

阅读更多关于 Keep indices in Pandas DataFrame with a certain number of non-NaN entires

问题 Lets say I have the following dataframe: df1 = pd.DataFrame(data = [1,np.nan,np.nan,1,1,np.nan,1,1,1], columns = ['X'], index = ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c']) print(df1) X a 1.0 a NaN a NaN b 1.0 b 1.0 b NaN c 1.0 c 1.0 c 1.0 I want to keep only the indices which have 2 or more non-NaN entries. In this case, the 'a' entries only have one non-NaN value, so I want to drop it and have my result be: X b 1.0 b 1.0 b NaN c 1.0 c 1.0 c 1.0 What is the best way to do this? Ideally I

Keep indices in Pandas DataFrame with a certain number of non-NaN entires

阅读更多关于 Keep indices in Pandas DataFrame with a certain number of non-NaN entires

Keep indices in Pandas DataFrame with a certain number of non-NaN entires

阅读更多关于 Keep indices in Pandas DataFrame with a certain number of non-NaN entires

24式加速你的Python

阅读更多关于 24式加速你的Python

一，分析代码运行时间第1式，测算代码运行时间平凡方法快捷方法（jupyter环境）第2式，测算代码多次运行平均时间平凡方法快捷方法（jupyter环境）第3式，按调用函数分析代码运行时间平凡方法快捷方法（jupyter环境）第4式，按行分析代码运行时间平凡方法快捷方法（jupyter环境）二，加速你的查找第5式，用set而非list进行查找低速方法高速方法第6式，用dict而非两个list进行匹配查找低速方法高速方法三，加速你的循环第7式，优先使用for循环而不是while循环低速方法高速方法第8式，在循环体中避免重复计算低速方法高速方法四，加速你的函数第9式，用循环机制代替递归函数低速方法高速方法第10式，用缓存机制加速递归函数低速方法高速方法第11式，用numba加速Python函数低速方法高速方法五，使用标准库函数进行加速第12式，使用collections.Counter加速计数低速方法高速方法第13式，使用collections.ChainMap加速字典合并低速方法高速方法六，使用numpy向量化进行加速第14式，使用np.array代替list 低速方法高速方法第15式，使用np.ufunc代替math.func 低速方法高速方法第16式，使用np.where代替if

24式加速你的Python

阅读更多关于 24式加速你的Python

一，分析代码运行时间第1式，测算代码运行时间平凡方法快捷方法（jupyter环境）第2式，测算代码多次运行平均时间平凡方法快捷方法（jupyter环境）第3式，按调用函数分析代码运行时间平凡方法快捷方法（jupyter环境）第4式，按行分析代码运行时间平凡方法快捷方法（jupyter环境）二，加速你的查找第5式，用set而非list进行查找低速方法高速方法第6式，用dict而非两个list进行匹配查找低速方法高速方法三，加速你的循环第7式，优先使用for循环而不是while循环低速方法高速方法第8式，在循环体中避免重复计算低速方法高速方法四，加速你的函数第9式，用循环机制代替递归函数低速方法高速方法第10式，用缓存机制加速递归函数低速方法高速方法第11式，用numba加速Python函数低速方法高速方法五，使用标准库函数进行加速第12式，使用collections.Counter加速计数低速方法高速方法第13式，使用collections.ChainMap加速字典合并低速方法高速方法六，使用高阶函数进行加速第14式，使用map代替推导式进行加速低速方法高速方法第15式，使用filter代替推导式进行加速低速方法高速方法七，使用numpy向量化进行加速第16式，使用np