web-scraping | 易学教程

crawlSpider seems not to follow rule

阅读更多关于 crawlSpider seems not to follow rule

问题 here's my code. Actually I followed the example in "Recursively Scraping Web Pages With Scrapy" and it seems I have included a mistake somewhere. Can someone help me find it, please? It's driving me crazy, I only want all the results from all the result pages. Instead it gives me the results from page 1. Here's my code: import scrapy from scrapy.selector import Selector from scrapy.spiders import CrawlSpider, Rule from scrapy.http.request import Request from scrapy.contrib.linkextractors.sgml

Scraping a specific website with a search box and javascripts in Python

阅读更多关于 Scraping a specific website with a search box and javascripts in Python

问题 On the website https://sray.arabesque.com/dashboard there is a search box "input" in html. I want to enter a company name in the search box, choose the first suggestion for that name in the dropout menu (e.g., "Anglo American plc"), go to the url with the info about that company, load javascripts to get full html version of the obtained page, and then scrape it for GC Score, ESG Score, Temperature Score in the bottom. !apt install chromium-chromedriver !cp /usr/lib/chromium-browser

How can I identify the element containing the link to my linkedin profile after I logged in using selenium.webdriver?

阅读更多关于 How can I identify the element containing the link to my linkedin profile after I logged in using selenium.webdriver?

问题 I have written a script (python) to login to my LinkedIn page (feed) and then I want the script to take me to my profile page. But I cannot capture the element with the link (it keeps changing its id with every restart of the browser). Obviously, I know the link but I would like for the script to be able to capture it. This is the code I have so far: import parameters from time import sleep from selenium import webdriver driver = webdriver.Chrome('/Users/uglyr/chromedriver') driver.get('https

How could I solve this error to scrape Twitter with Python?

阅读更多关于 How could I solve this error to scrape Twitter with Python?

问题 I'm trying to do a personal project for my portfolio, I would like to scrape the tweets about the president Macron but I get this error with twitterscrapper . from twitterscraper import query_tweets import datetime as dt import pandas as pd begin_date=dt.date(2020,11,18) end_date=dt.date(2020,11,19) limit=1000 lang='English' tweets=query_tweets("#macron",begindate=begin_date,enddate=end_date,limit=limit,lang=lang) Error: TypeError: query_tweets() got an unexpected keyword argument 'begindate'

Exporting DataFrame to Excel using pandas without subscribe

阅读更多关于 Exporting DataFrame to Excel using pandas without subscribe

问题 How can I export DataFrame to excel without subscribe? For exemple: I'm doing webscraping and there is a table with pagination, so I take the page 1 save it in DataFrame, export to excel e do it again in page 2. But every record is erased when a save it remaining the last one. Sorry for my english, here is my code: import time import pandas as pd from bs4 import BeautifulSoup from selenium import webdriver i=1 url = "https://stats.nba.com/players/traditional/?PerMode=Totals&Season=2019-20

Scraping multiple select options using Selenium

阅读更多关于 Scraping multiple select options using Selenium

问题 I am required to scrape PDF's from the website https://secc.gov.in/lgdStateList . There are 3 drop-down menus for a state, a district and a block. There are several states, under each state we have districts and under each district there are blocks. I tried to implement the following code. I was able to select the state, but there seems to be some error when I select the district. from selenium import webdriver from selenium.webdriver.support.ui import Select import requests from bs4 import

Click “Download csv” button using Selenium and Beautiful Soup

阅读更多关于 Click “Download csv” button using Selenium and Beautiful Soup

问题 I'm trying to download the csv file from this website: https://invasions.si.edu/nbicdb/arrivals?state=AL&submit=Search+database&begin=2000-01-01&end=2020-11-11&type=General+Cargo&bwms=any To do so, I need to click the CSV button, which downloads the CSV file. However, I need to do this for multiple links, which is why I want to use Selenium to automate the task of clicking on the link. The code I have currently runs, but it does not actually download the csv file to the designated folder (or

Click “Download csv” button using Selenium and Beautiful Soup

阅读更多关于 Click “Download csv” button using Selenium and Beautiful Soup

Accessing all elements from main website page with Beautiful Soup

阅读更多关于 Accessing all elements from main website page with Beautiful Soup

问题 I want to scrape news from this website: https://www.bbc.com/news You can see that website has categories such as Home, US Election, Coronavirus etc. For example, If I go to specific news article such as: https://www.bbc.com/news/election-us-2020-54912611 I can write a scraper that will give me the headline, this is the code: from bs4 import BeautifulSoup response = requests.get("https://www.bbc.com/news/election-us-2020-54912611", headers=headers) soup = BeautifulSoup(response.content, 'html

scraping with R using rvest and purrr, multiple pages

阅读更多关于 scraping with R using rvest and purrr, multiple pages

问题 I am trying to scrape a database containing information about previously sold houses in an area of Denmark. I want to retrieve information from not only page 1, but also 2, 3, 4 etc. I am new to R but from an tutorial i ended up with this. library(purrr) library(rvest) urlbase <- "https://www.boliga.dk/solgt/alle_boliger-4000ipostnr=4000&so=1&p=%d" map_df(1:5,function(i){ cat(".") page <- read_html(sprintf(urlbase,i)) data.frame(Address = html_text(html_nodes(page,".d-md-table-cell a")))