Get text inside a span class of a particular div

前端 未结 3 1046
-上瘾入骨i
-上瘾入骨i 2021-01-29 11:14

I am scraping the T-Mobile website for reviews on Samsung Galaxy S9. I am able to create a Beautiful Soup object for the HTML code, but I cannot fetch the text of reviews which

相关标签:
3条回答
  • 2021-01-29 11:42

    You are not getting the data due to dynamic content loading through script. You can try selenium along with scrapy.

    import scrapy
    from selenium import webdriver
    from scrapy.http import HtmlResponse
    
    class ProductSpider(scrapy.Spider):
        name = "product_spider"
        allowed_domains = ['t-mobile.com']
        start_urls = ['https://www.t-mobile.com/cell-phone/samsung-galaxy-s9']
    
        def __init__(self):
            self.driver = webdriver.Firefox()
    
        def parse(self, response):
            self.driver.get(response.url)
            body = str.encode(self.driver.page_source)
            self.parse_response(HtmlResponse(self.driver.current_url, body=body, encoding='utf-8'))
    
        def parse_response(self, response):
            tmo_ratings_s9 = []
            for review in response.css('#reviews div.BVRRContentReview'):
                text = review.css('.BVRRReviewText::text').get().strip()
                tmo_ratings_s9.append(text)
    
            print(tmo_ratings_s9)
    
        def spider_closed(self, spider, reason):
            self.driver.close()
    
    0 讨论(0)
  • 2021-01-29 11:44

    first if you are using google chrome or mozilla firefox please press ctrl+u from the page, then you will go to the page source. Check if the review content is present anywhere in the source by searching some keywords. If present write the xpath of that data, if not present, check the network section for any json requests sending while the page loads, if not present you will have to use selenium.

    In your case send request to this page https://t-mobile.ugc.bazaarvoice.com/9060redes2-en_us/E4F08F7E-8C29-4420-BE87-9226A6C0509D/reviews.djs?format=embeddedhtml

    This is a json request send while loading the whole page.

    0 讨论(0)
  • 2021-01-29 12:00

    use selenium or webscraper.io

    https://www.webscraper.io/

    https://www.seleniumhq.org/docs/01_introducing_selenium.jsp

    0 讨论(0)
提交回复
热议问题