问题
I have 2 csv files price and performance.
Here is the data layout of each
Price:
Performance:
I import them into python using:
import pandas as pd
price = pd.read_csv("cpu.csv")
performance = pd.read_csv("geekbench.csv")
This works as intended, however I am unsure on how to create a new csv file with matches between Price[brand + model] and Performance[name]
I want to take:
- Cores, tdp and price from Price
- Score, multicore_score and name from Performance
Create a new csv file using these parameters above. Problems I've been having a finding a good way to match which ignores minor differences such as capitalization I was looking into algorithms such as fuzzy string matching but was not sure what the best option is.
This is my current attempt which throws errors;
for i in range(len(price.index)):
brand = (price.iloc[i, 0])
model = (price.iloc[i, 1])
print(model)
print(performance)
print(performance.query('name == brand+model'))
Thanks
回答1:
I suggest the following :
import nltk
import pandas as pd
tokenizer = nltk.RegexpTokenizer(r'\w+')
price = pd.DataFrame({"brand": ["AMD", "AMD", "AMD", "AMD"],
"model" : ["2650", "3800", "5150", "4200"],
"cores" : [2,4,4,4],
"tdp" : [25,25,25,25]})
performance = pd.DataFrame({"name": ["AMD Athlon 64 3200+",
"AMD Athlon 64 X2 3800+",
"AMD Athlon 64 X2 4000+",
"AMD Athlon 64 X2 4200+"],
"score" : [6,5,6,18]})
# I break down the name in performance and suppress capital letters
performance["tokens"] = (performance["name"].str.lower()
.apply(tokenizer.tokenize))
# And the same for price
price["tokens"] = price.loc[:,"brand"].values + " " + \
price.loc[:,"model"].values
price["tokens"] = (price["tokens"].str.lower()
.apply(tokenizer.tokenize))
# cartesian product
price["key"] = 1
performance["key"] = 1
df = pd.merge(price,performance, on = "key")
# define my criteria for match
n_match = 2
df['intersection'] =\
[len(list(set(a).intersection(set(b))))
for a, b in zip(df.tokens_x,
df.tokens_y)]
df = df.loc[df["intersection"]>=n_match,:]
I redefined your datasets so that in this example we would have some matches. Here is what I have as a result:
brand model cores ... score tokens_y intersection
5 AMD 3800 4 ... 5 [amd, athlon, 64, x2, 3800] 2
15 AMD 4200 4 ... 18 [amd, athlon, 64, x2, 4200] 2
[2 rows x 10 columns]
You can redefine your criteria for n_match
I put two because it seemed that it was what was required by the dataset.
Hope it helps
回答2:
You can merge the 2 frames after creating a 'name' column in price and then merging it with Performane on that 'name'.
Price['name'] = Price.brand + ' ' + Price.model.astype(str)
Price.merge(Performance, on='name')
However, the resulting frame will probably be empty, since at least the sample data rows you are showing in your question won't match. This is not a linguistic problem with e.g. capaitalization, but simply missing information. Only after you defining a relationship rule in "real language" you will be able to code it in Python.
来源:https://stackoverflow.com/questions/60943829/how-to-merge-two-csv-files-by-value-in-column-using-pandas-python