web-scraping

HTTP headers - Requests - Python

不羁岁月 提交于 2021-02-10 19:51:55
问题 I am trying to scrape a website in which the request headers are having some new (for me) attributes such as :authority, :method, :path, :scheme . {':authority':'xxxx',':method':'GET',':path':'/xxxx',':scheme':'https','accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','accept-encoding':'gzip, deflate, br','accept-language':'en-US,en;q=0.9','cache-control':'max-age=0',GOOGLE_ABUSE_EXEMPTION=ID=0d5af55f1ada3f1e:TM=1533116294:C=r:IP=182.71.238.62-:S

HTTP headers - Requests - Python

喜夏-厌秋 提交于 2021-02-10 19:51:01
问题 I am trying to scrape a website in which the request headers are having some new (for me) attributes such as :authority, :method, :path, :scheme . {':authority':'xxxx',':method':'GET',':path':'/xxxx',':scheme':'https','accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','accept-encoding':'gzip, deflate, br','accept-language':'en-US,en;q=0.9','cache-control':'max-age=0',GOOGLE_ABUSE_EXEMPTION=ID=0d5af55f1ada3f1e:TM=1533116294:C=r:IP=182.71.238.62-:S

Paginate with network requests scraper

跟風遠走 提交于 2021-02-10 19:05:03
问题 I am trying to scrape Naukri job postings. Web scraping was too time-consuming, so I switched to network requests. I believe I got the request pattern for pagination by changing the URL right (not clicking the next tab). URLs Example: https://www.naukri.com/maintenance-jobs?xt=catsrch&qf%5B%5D=19 https://www.naukri.com/maintenance-jobs-2?xt=catsrch&qf%5B%5D=19 https://www.naukri.com/maintenance-jobs-3?xt=catsrch&qf%5B%5D=19 https://www.naukri.com/maintenance-jobs-4?xt=catsrch&qf%5B%5D=19 The

Paginate with network requests scraper

会有一股神秘感。 提交于 2021-02-10 19:01:26
问题 I am trying to scrape Naukri job postings. Web scraping was too time-consuming, so I switched to network requests. I believe I got the request pattern for pagination by changing the URL right (not clicking the next tab). URLs Example: https://www.naukri.com/maintenance-jobs?xt=catsrch&qf%5B%5D=19 https://www.naukri.com/maintenance-jobs-2?xt=catsrch&qf%5B%5D=19 https://www.naukri.com/maintenance-jobs-3?xt=catsrch&qf%5B%5D=19 https://www.naukri.com/maintenance-jobs-4?xt=catsrch&qf%5B%5D=19 The

Find data within HTML tags using Python

自闭症网瘾萝莉.ら 提交于 2021-02-10 18:44:29
问题 I have the following HTML code I am trying to scrape from a website: <td>Net Taxes Due<td> <td class="value-column">$2,370.00</td> <td class="value-column">$2,408.00</td> What I am trying to accomplish is to search the page to find the text "Net Taxes Due" within the tag, find the siblings of the tag, and send the results into a Pandas data frame. I have the following code: soup = BeautifulSoup(url, "html.parser") table = soup.select('#Net Taxes Due') cells = table.find_next_siblings('td')

Finding number of pages using Python BeautifulSoup

强颜欢笑 提交于 2021-02-10 18:25:33
问题 I want to extract the total page number (11 in this case) from a steam page. I believe that the following code should work (return 11), but it is returning an empty list. Like if it is not finding paged_items_paging_pagelink class. import requests import re from bs4 import BeautifulSoup r = requests.get('http://store.steampowered.com/tags/en-us/RPG/') c = r.content soup = BeautifulSoup(c, 'html.parser') total_pages = soup.find_all("span",{"class":"paged_items_paging_pagelink"})[-1].text 回答1:

R scraping with a dropdown menu

做~自己de王妃 提交于 2021-02-10 18:25:05
问题 I am attempting to scrape the NBA daily ROS projections from the site:https://hashtagbasketball.com/fantasy-basketball-projections. Problem is the default number of players selected is 200, I would want 400 (or ALL would work too). This code retrieves the first 200 no problem: > url <- 'https://hashtagbasketball.com/fantasy-basketball-projections' > > page <- read_html(url) > > projs <- html_table(page)[[3]] %>% ### anything after this just cleans the df + rename_all(~gsub('3pm','threes',gsub

How to use (new) LinkedIn API from and with R?

蓝咒 提交于 2021-02-10 15:53:26
问题 It seems that Rlinkedin is deprecated, that LinkedIn API has changed, and that LinkedIn does not provide a lot of informations for R users in documentation for developers. I don't understand why. For the moment, there are references only for Bash, NodeJS and Java... Could anyone provide a very basic, recent and working example in R to begin with LinkedIn API? For instance, how to get profiles? This kind of example doesn't work: url <- 'https://www.linkedin.com/in/reidhoffman/' library(httr)

How to use (new) LinkedIn API from and with R?

 ̄綄美尐妖づ 提交于 2021-02-10 15:53:17
问题 It seems that Rlinkedin is deprecated, that LinkedIn API has changed, and that LinkedIn does not provide a lot of informations for R users in documentation for developers. I don't understand why. For the moment, there are references only for Bash, NodeJS and Java... Could anyone provide a very basic, recent and working example in R to begin with LinkedIn API? For instance, how to get profiles? This kind of example doesn't work: url <- 'https://www.linkedin.com/in/reidhoffman/' library(httr)

Extract table from webpage using VBA

家住魔仙堡 提交于 2021-02-10 15:02:13
问题 I would like to extract the table from html code into Excel using VBA. I have tried the following code several times with changing some of the code but keep on getting error. Sub GrabTable() 'dimension (set aside memory for) our variables Dim objIE As InternetExplorer Dim ele As Object Dim y As Integer 'start a new browser instance Set objIE = New InternetExplorer 'make browser visible objIE.Visible = False 'navigate to page with needed data objIE.navigate "http://www.bursamalaysia.com/market