beautifulsoup

How to scrape project urls from indiegogo using BeautifulSoup?

江枫思渺然 提交于 2021-02-11 15:40:46
问题 I am trying to scrape project URLs from Indiegogo, but I had no success after hours. I can not scrape them either using XPath or Beautifulsoup. The output of the following code does not contain the information I want: soup.find_all("div") Also, Beutifulsoup did not work: import requests from bs4 import BeautifulSoup url = 'https://www.indiegogo.com/explore/all?project_type=campaign&project_timing=ending_soon&sort=trending' page = requests.get(url) soup = BeautifulSoup(page.text, 'html.parser'

Get content from certain tags with certain attributes using BS4

给你一囗甜甜゛ 提交于 2021-02-11 15:31:15
问题 I need to get the content from the following tag with these attributes: <span class="h6 m-0"> . An example of the HTML I'll encounter would be <span class="h6 m-0">Hello world</span> , and it obviously needs to return Hello world . My current code is as follows: page = BeautifulSoup(text, 'html.parser') names = [item["class"] for item in page.find_all('span')] This works fine, and gets me all the spans in the page, but I don't know how to specify that I only want those with the specific class

Requests-html: error while running on flask

本小妞迷上赌 提交于 2021-02-11 15:01:28
问题 I've prepared a script that was using requests-html which was working fine. I deployed it in the flask app and now it's giving me RuntimeError: There is no current event loop in thread 'Thread-3'. Here's the full error: Traceback (most recent call last): File "C:\Users\intel\AppData\Local\Programs\Python\Python38\Lib\site-packages\flask\app.py", line 2464, in __call__ return self.wsgi_app(environ, start_response) . . . File "C:\Users\intel\Desktop\One page\main.py", line 18, in hello_world r

Problems with data retrieving using Python web scraping

拟墨画扇 提交于 2021-02-11 14:53:04
问题 I wrote a simple code for scraping data from a web page but I mention all the thing like object class with tag but my program does not scrape data. One more thing there is an email that I also want to scrape but not know how to mention its id or class. Could you please guide me - how can I fix this issue? Thanks! Here is my code: import requests from bs4 import BeautifulSoup import csv def get_page(url): response = requests.get(url) if not response.ok: print('server responded:', response

How to speed up parsing using BeautifulSoup?

本秂侑毒 提交于 2021-02-11 14:47:41
问题 I want to make a list of music festivals in Korea, so I tried to crawl a website selling festival tickets: import requests from bs4 import BeautifulSoup INTERPARK_BASE_URL = 'http://ticket.interpark.com' # Festival List Page req = requests.get('http://ticket.interpark.com/TPGoodsList.asp?Ca=Liv&SubCa=Fes') html = req.text soup = BeautifulSoup(html, 'lxml') for title_raw in soup.find_all('span', class_='fw_bold'): title = str(title_raw.find('a').text) url_raw = str(title_raw.find('a').get(

Wrong accented characters using Beautiful Soup in Python on a local HTML file

瘦欲@ 提交于 2021-02-11 14:39:37
问题 I'm quite familiar with Beautiful Soup in Python, I have always used to scrape live site. Now I'm scraping a local HTML file (link, in case you want to test the code), the only problem is that accented characters are not represented in the correct way (this never happened to me when scraping live sites). This is a simplified version of the code import requests, urllib.request, time, unicodedata, csv from bs4 import BeautifulSoup soup = BeautifulSoup(open('AH.html'), "html.parser") tables =

BeautifulSoup parsing XML to table

时光怂恿深爱的人放手 提交于 2021-02-11 14:34:03
问题 come back again with another issue. using BeautifulSoup really new in parsing XML , and have this problem since 2 weeks now. will appreciate your help have this structure : <detail> <page number="01"> <Bloc code="AF" A="000000000002550" B="000000000002550"/> <Bloc code="AH" A="000000000035826" C="000000000035826" D="000000000035826"/> <Bloc code="AR" A="000000000026935" B="000000000024503" C="000000000002431" D="000000000001669"/> </page> <page number="02"> <Bloc code="DA" A="000000000038486"

beautiful soup .find can't find anything

狂风中的少年 提交于 2021-02-11 14:32:58
问题 I am trying to scrap posts in a Facebook group: URL = 'https://www.facebook.com/groups/110354088989367/' headers = { "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' } def checkSubletGroup(): page = requests.get(URL, headers=headers) soup = BeautifulSoup(page.content, 'html.parser') posts = soup.find_all("div", {"class_": "text_exposed_root"}) print(soup.prettify()) for post in posts: print(post)

Editing DOCTYPE tag with BeautifulSoup

那年仲夏 提交于 2021-02-11 14:32:53
问题 I need to add an ATTLIST declaration to the DOCTYPE tag in html documents. After reading the documentation and googling, this is what I've come up with: from bs4 import BeautifulSoup, Doctype # minimal html document doc = """<?xml version='1.0' encoding='UTF-8'?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" > <html/>""" soup = BeautifulSoup(doc, features='html.parser') # the modified doctype tag doctype = """<!DOCTYPE

Scraping a specific website with a search box and javascripts in Python

。_饼干妹妹 提交于 2021-02-11 14:30:22
问题 On the website https://sray.arabesque.com/dashboard there is a search box "input" in html. I want to enter a company name in the search box, choose the first suggestion for that name in the dropout menu (e.g., "Anglo American plc"), go to the url with the info about that company, load javascripts to get full html version of the obtained page, and then scrape it for GC Score, ESG Score, Temperature Score in the bottom. !apt install chromium-chromedriver !cp /usr/lib/chromium-browser