beautifulsoup | 易学教程

How to scrape project urls from indiegogo using BeautifulSoup?

阅读更多关于 How to scrape project urls from indiegogo using BeautifulSoup?

问题 I am trying to scrape project URLs from Indiegogo, but I had no success after hours. I can not scrape them either using XPath or Beautifulsoup. The output of the following code does not contain the information I want: soup.find_all("div") Also, Beutifulsoup did not work: import requests from bs4 import BeautifulSoup url = 'https://www.indiegogo.com/explore/all?project_type=campaign&project_timing=ending_soon&sort=trending' page = requests.get(url) soup = BeautifulSoup(page.text, 'html.parser'

Get content from certain tags with certain attributes using BS4

阅读更多关于 Get content from certain tags with certain attributes using BS4

问题 I need to get the content from the following tag with these attributes: <span class="h6 m-0"> . An example of the HTML I'll encounter would be <span class="h6 m-0">Hello world</span> , and it obviously needs to return Hello world . My current code is as follows: page = BeautifulSoup(text, 'html.parser') names = [item["class"] for item in page.find_all('span')] This works fine, and gets me all the spans in the page, but I don't know how to specify that I only want those with the specific class

Requests-html: error while running on flask

阅读更多关于 Requests-html: error while running on flask

问题 I've prepared a script that was using requests-html which was working fine. I deployed it in the flask app and now it's giving me RuntimeError: There is no current event loop in thread 'Thread-3'. Here's the full error: Traceback (most recent call last): File "C:\Users\intel\AppData\Local\Programs\Python\Python38\Lib\site-packages\flask\app.py", line 2464, in __call__ return self.wsgi_app(environ, start_response) . . . File "C:\Users\intel\Desktop\One page\main.py", line 18, in hello_world r

Problems with data retrieving using Python web scraping

阅读更多关于 Problems with data retrieving using Python web scraping

问题 I wrote a simple code for scraping data from a web page but I mention all the thing like object class with tag but my program does not scrape data. One more thing there is an email that I also want to scrape but not know how to mention its id or class. Could you please guide me - how can I fix this issue? Thanks! Here is my code: import requests from bs4 import BeautifulSoup import csv def get_page(url): response = requests.get(url) if not response.ok: print('server responded:', response

How to speed up parsing using BeautifulSoup?

阅读更多关于 How to speed up parsing using BeautifulSoup?

问题 I want to make a list of music festivals in Korea, so I tried to crawl a website selling festival tickets: import requests from bs4 import BeautifulSoup INTERPARK_BASE_URL = 'http://ticket.interpark.com' # Festival List Page req = requests.get('http://ticket.interpark.com/TPGoodsList.asp?Ca=Liv&SubCa=Fes') html = req.text soup = BeautifulSoup(html, 'lxml') for title_raw in soup.find_all('span', class_='fw_bold'): title = str(title_raw.find('a').text) url_raw = str(title_raw.find('a').get(

Wrong accented characters using Beautiful Soup in Python on a local HTML file

阅读更多关于 Wrong accented characters using Beautiful Soup in Python on a local HTML file

问题 I'm quite familiar with Beautiful Soup in Python, I have always used to scrape live site. Now I'm scraping a local HTML file (link, in case you want to test the code), the only problem is that accented characters are not represented in the correct way (this never happened to me when scraping live sites). This is a simplified version of the code import requests, urllib.request, time, unicodedata, csv from bs4 import BeautifulSoup soup = BeautifulSoup(open('AH.html'), "html.parser") tables =

BeautifulSoup parsing XML to table

阅读更多关于 BeautifulSoup parsing XML to table

问题 come back again with another issue. using BeautifulSoup really new in parsing XML , and have this problem since 2 weeks now. will appreciate your help have this structure : <detail> <page number="01"> <Bloc code="AF" A="000000000002550" B="000000000002550"/> <Bloc code="AH" A="000000000035826" C="000000000035826" D="000000000035826"/> <Bloc code="AR" A="000000000026935" B="000000000024503" C="000000000002431" D="000000000001669"/> </page> <page number="02"> <Bloc code="DA" A="000000000038486"

beautiful soup .find can't find anything

阅读更多关于 beautiful soup .find can't find anything

问题 I am trying to scrap posts in a Facebook group: URL = 'https://www.facebook.com/groups/110354088989367/' headers = { "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' } def checkSubletGroup(): page = requests.get(URL, headers=headers) soup = BeautifulSoup(page.content, 'html.parser') posts = soup.find_all("div", {"class_": "text_exposed_root"}) print(soup.prettify()) for post in posts: print(post)

Editing DOCTYPE tag with BeautifulSoup

阅读更多关于 Editing DOCTYPE tag with BeautifulSoup

问题 I need to add an ATTLIST declaration to the DOCTYPE tag in html documents. After reading the documentation and googling, this is what I've come up with: from bs4 import BeautifulSoup, Doctype # minimal html document doc = """<?xml version='1.0' encoding='UTF-8'?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" > <html/>""" soup = BeautifulSoup(doc, features='html.parser') # the modified doctype tag doctype = """<!DOCTYPE

Scraping a specific website with a search box and javascripts in Python

阅读更多关于 Scraping a specific website with a search box and javascripts in Python

问题 On the website https://sray.arabesque.com/dashboard there is a search box "input" in html. I want to enter a company name in the search box, choose the first suggestion for that name in the dropout menu (e.g., "Anglo American plc"), go to the url with the info about that company, load javascripts to get full html version of the obtained page, and then scrape it for GC Score, ESG Score, Temperature Score in the bottom. !apt install chromium-chromedriver !cp /usr/lib/chromium-browser