问题
could you please help me out with this:
I have a dataframe (df1
) that has index of all articles published in the website's CMS. There's a column for current URL
and a column of original URLs in case they were changed after publication (column name Origin
):
URL | Origin | ArticleID | Author | Category | Cost |
---|---|---|---|---|---|
https://example.com/article1 | https://example.com/article | 001 | AuthorName | Politics | 120 USD |
https://example.com/article2 | https://example.com/article2 | 002 | AuthorName | Finance | 68 USD |
Next I have an huge dataframe (df2
)with web analytics export for a timeframe. It has a date, just 1 column for URL and number of pageviews.
PageviewDate | URL | Pageviews |
---|---|---|
2019-01-01 | https://example.com/article | 224544 |
2019-01-01 | https://example.com/article1 | 656565 |
How do I left join this with first dataframe but matching on either URL
= URL
OR Origin
= URL
So that end result would look like this:
PageviewDate | Pageviews | ArticleID | Author | Category |
---|---|---|---|---|
2019-01-01 | 881109 | 001 | AuthorName | Politics |
i.e 881109
is the result of adding up 224544
and 656565
that both related to the same article
I guess what I'm looking for is the equivalent of SQL syntax like:
LEFT JOIN ...`enter code here`
ON URL = URL
OR Origin = URL```
回答1:
You could get dataframe 1 (df1
) in long format so that both Origin
and URL
are in the same column and then perform the join with second dataframe (df2
).
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = c(URL, Origin), values_to = 'URL') %>%
inner_join(df2, by = 'URL') %>%
select(-name)
# ArticleID Author Category name URL PageviewDate Pageviews
# <int> <chr> <chr> <chr> <chr> <chr> <int>
#1 1 AuthorName Politics URL https://example.com/article1 2019-01-01 656565
#2 1 AuthorName Politics Origin https://example.com/article 2019-01-01 224544
data
df1 <- structure(list(URL = c("https://example.com/article1", "https://example.com/article2"
), Origin = c("https://example.com/article", "https://example.com/article2"
), ArticleID = 1:2, Author = c("AuthorName", "AuthorName"),
Category = c("Politics", "Finance")), class = "data.frame",row.names =c(NA, -2L))
df2 <- structure(list(PageviewDate = c("2019-01-01", "2019-01-01"),
URL = c("https://example.com/article", "https://example.com/article1"),
Pageviews = c(224544L, 656565L)), class = "data.frame", row.names = c(NA, -2L))
来源:https://stackoverflow.com/questions/65638384/how-to-left-join-on-any-of-the-matching-clauses-in-r