问题
I am implementing CUDF to speed up my python process. Firstly, I import CUDF and removed multiprocessing code, and initialize variables with CUDF. After changing into CUDF it gives a dictionary error.
How I can remove these loops to make effective implementation?
Code
import more_itertools
import pandas as pd
import numpy as np
import itertools
from os import cpu_count
from sklearn.metrics import confusion_matrix, accuracy_score, roc_curve, auc
import matplotlib.pyplot as plt
import json
import os
import gc
from tqdm import tqdm
import cudf
gc.collect()
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
import logging
mpl_logger = logging.getLogger('matplotlib')
mpl_logger.setLevel(logging.WARNING)
with open(Path(__file__).parent / "ageDB.json", "r") as f:
identities = json.load(f)
positives = cudf.DataFrame()
for value in tqdm(identities.values(), desc="Positives"):
positives = positives.append(cudf.DataFrame(itertools.combinations(value, 2), columns=["file_x", "file_y"]),
ignore_index=True)
positives["decision"] = "Yes"
print(positives)
samples_list = list(identities.values())
negatives = cudf.DataFrame()
######################====================Functions=============##############
def compute_cross_samples(x):
return cudf.DataFrame(itertools.product(*x), columns=["file_x", "file_y"])
####################################
if Path("positives_negatives.csv").exists():
df = cudf.read_csv("positives_negatives.csv")
else:
for combos in tqdm(more_itertools.ichunked(itertools.combinations(identities.values(), 2), cpu_count())):
for cross_samples in (compute_cross_samples, combos):
negatives = negatives.append(cross_samples)
negatives["decision"] = "No"
negatives = negatives.sample(positives.shape[0])
df = cudf.concat([positives, negatives]).reset_index(drop=True)
df.to_csv("positives_negatives.csv", index=False)
df.file_x = "deepface/tests/dataset/" + df.file_x
df.file_y = "deepface/tests/dataset/" + df.file_y
Tracback
Traceback (most recent call last):
File "Ensemble-Face-Recognition.py", line 36, in <module>
positives = positives.append(cudf.DataFrame(itertools.combinations(value, 2), columns=["file_x", "file_y"]),
File "/home/khawar/anaconda3/envs/rapids-0.17/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/home/khawar/anaconda3/envs/rapids-0.17/lib/python3.7/site-packages/cudf/core/dataframe.py", line 289, in __init__
raise TypeError("data must be list or dict-like")
TypeError: data must be list or dict-like
回答1:
itertools.combinations returns a generator, so you need to explicitly call list to get a list-like value
cudf.DataFrame(list(itertools.combinations(value, 2)) . . .
As an aside, I'm not sure if this is the case in cudf, but in pandas, it is much faster to create a list of dataframes and concatenate them at the end then it is to create an empty dataframe and then to continually append to it. Your loop continuously sets positives to an appended dataframe.
回答2:
Formatting your last comment:
positives = positives.append(cudf.DataFrame(
list(itertools.combinations(value, 2), columns=["file_x", "file_y"]),
TypeError: list() takes no keyword arguments
Because of how you pair (), columns
is taken as an argument to list
, not to DataFrame
.
来源:https://stackoverflow.com/questions/66073491/typeerror-data-must-be-list-or-dict-like-in-cudf