问题
Just wanna ask if there is any good approach to scrape the website below? https://list.jd.com/list.html?cat=737,794,798&page=1&sort=sort_rank_asc&trans=1&JL=6_0_0#J_main
Basically I want to get the name and price of all products However, the price info is stored in some JQuery scripts
Is Selenium the only solution? Thought of using V8 / Jsonlite, but it seems that they are not applicable. It'd be great if you can offer some alternatives in R. (Access to exe files is blocked in my computer, I cannot use Selenium / PhantomJS]
回答1:
Couldn't find any robots.txt or terms/conditions that bar scraping (if someone does find that please flag in a comment so I can delete the answer):
library(rvest)
library(V8)
library(tidyverse)
pg <- read_html("https://list.jd.com/list.html?cat=737,794,798&page=1&sort=sort_rank_asc&trans=1&JL=6_0_0#J_main")
Tagging the question with V8
was a 👍🏼 idea.
ctx <- v8()
We need to add two missing global variables, then evaluate the javascript:
paste0(
c("var window = {}, SEARCH = {};",
html_nodes(pg, "script")[[1]] %>%
html_text()
),
collapse = "\n"
) %>%
ctx$eval()
## [1] "[object Object]"
Now get some data out:
ctx$get("aosList") %>%
bind_rows(.id = "id") %>%
tbl_df()
## # A tibble: 175 x 3
## id n v
## <chr> <chr> <chr>
## 1 1429810 39-45英寸 244_110017
## 2 1429810 全高清(1920×1080) 3613_77848
## 3 1429810 3级 1200_1656
## 4 4286570 39-45英寸 244_110017
## 5 4286570 高清(1366×768) 3613_93579
## 6 4286570 3级 1200_1656
## 7 4609652 55英寸 244_1486
## 8 4609652 4k超高清(3840×2160) 3613_77847
## 9 4609652 3级 1200_1656
## 10 4609660 65英寸 244_58269
## # ... with 165 more rows
And, more data:
ctx$get("attrList") %>%
bind_rows(.id = "id") %>%
tbl_df()
## # A tibble: 60 x 15
## id IsSam cw factoryShip isCanUseDQ isJDexpress isJX isOverseaPurchase mcat3Id soldOS tssp venderType xgzs
## <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <chr> <chr>
## 1 1429810 0 1 0 0 0 0 0 798 -1 0 0 7.3
## 2 4286570 0 1 NA 0 0 0 0 798 -1 0 0 6.2
## 3 4609652 0 1 NA 0 0 0 0 798 -1 0 0 7.5
## 4 4609660 0 1 NA 0 0 0 0 798 -1 0 0 8.8
## 5 4620979 0 1 NA 0 0 0 0 798 -1 0 0 6.4
## 6 4751739 0 1 NA 1 0 0 0 798 -1 0 0 8.9
## 7 4902977 0 1 NA NA 0 0 0 798 -1 0 0 9.5
## 8 5010925 0 1 NA 1 0 0 0 798 -1 0 0 8.6
## 9 5102214 0 1 NA 0 0 0 0 798 -1 0 0 7.8
## 10 5218185 0 1 NA 1 0 0 0 798 -1 0 0 <NA>
## # ... with 50 more rows, and 2 more variables: isFzxp <int>, shipFareTmplId <int>
来源:https://stackoverflow.com/questions/52534309/how-to-scrape-javascript-rendered-website-by-r