问题
I am looking to join two tables.. however the data I am looking to join on does not match exactly.. joining on NFL player names..
data sets below..
> dput(att75a)
structure(list(rusher_player_name = c("A.Ekeler", "A.Jones",
"A.Kamara", "A.Mattison", "A.Peterson", "B.Hill"), mean_epa = c(-0.110459963350783,
0.0334332018597805, -0.119488111742492, -0.155261835310445, -0.123485646124451,
-0.0689611296359916), success_rate = c(0.357664233576642, 0.40495867768595,
0.401129943502825, 0.283018867924528, 0.322727272727273, 0.35
), plays = c(137L, 242L, 177L, 106L, 220L, 80L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L))
> dput(rb2019capa)
structure(list(rusher_player_name = c("Aaron Jones", "Adrian Peterson",
"Alexander Mattison", "Alvin Kamara", "Austin Ekeler", "Brian Hill"
), Team = c("Packers", "Redskins", "Vikings", "Saints", "Chargers",
"Falcons"), `Salary Cap Value` = c(695487, 1780000, 700545, 1050693,
646668, 645000), `Cash Spent` = c(645000, 2530000, 1317180, 807500,
645000, 645000)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L))
for example I am trying to join A.Mattison on Alexander Mattison.. and so on..
i experimented with stringdist and fuzzyjoin but could not solve my problem..
please consider... took the head() of each dataset to condense per question asking guidelines.. original data sets have lengths of 51 obs. and 168 obs... will that affect how the join is performed?
What is the best way to go about cleaning these names?
thank you for your time..
回答1:
Use sub
to replace first name with first initial.
library(dplyr)
rb2019capa %>%
mutate(rusher_player_name=
sub("^([A-Z])\\S+\\s([A-Za-z].*)$", "\\1.\\2", rusher_player_name)) %>%
inner_join(att75a, by="rusher_player_name") # or left_join (up to you)
# A tibble: 6 x 7
rusher_player_name Team `Salary Cap Value` `Cash Spent` mean_epa success_rate plays
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>
1 A.Jones Packers 695487 645000 0.0334 0.405 242
2 A.Peterson Redskins 1780000 2530000 -0.123 0.323 220
3 A.Mattison Vikings 700545 1317180 -0.155 0.283 106
4 A.Kamara Saints 1050693 807500 -0.119 0.401 177
5 A.Ekeler Chargers 646668 645000 -0.110 0.358 137
6 B.Hill Falcons 645000 645000 -0.0690 0.35 80
回答2:
Replace the dot with % making an SQL pattern and join based on a match to it.
library(sqldf)
sqldf("select *
from att75a a
left join rb2019capa r
on r.rusher_player_name like replace(a.rusher_player_name, '.', '%')")
giving:
rusher_player_name mean_epa success_rate plays rusher_player_name..5
1 A.Ekeler -0.11045996 0.3576642 137 Austin Ekeler
2 A.Jones 0.03343320 0.4049587 242 Aaron Jones
3 A.Kamara -0.11948811 0.4011299 177 Alvin Kamara
4 A.Mattison -0.15526184 0.2830189 106 Alexander Mattison
5 A.Peterson -0.12348565 0.3227273 220 Adrian Peterson
6 B.Hill -0.06896113 0.3500000 80 Brian Hill
Team Salary Cap Value Cash Spent
1 Chargers 646668 645000
2 Packers 695487 645000
3 Saints 1050693 807500
4 Vikings 700545 1317180
5 Redskins 1780000 2530000
6 Falcons 645000 645000
来源:https://stackoverflow.com/questions/60329945/joining-on-inexact-strings-in-r