Webscraping an IMDb page using BeautifulSoup

后端 未结 2 1751
没有蜡笔的小新
没有蜡笔的小新 2020-12-21 05:46

I am new to WebScraping/Python and BeautifulSoup and am having difficulty getting my code to work.

I would like to scrape the url: http://m.imdb.com/feature/bornond

相关标签:
2条回答
  • 2020-12-21 06:06

    First of all, screen scraping is explicitly forbidden by the IMDb "Conditions of Use":

    Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.

    Try exploring the IMDb JSON API instead of a web-scraping approach.


    Your current problem is - the list of people born on the specific date is loaded via a separate call to the IMDb API and with a javascript logic involved.

    The easiest option right now would be to switch to selenium browser automation tool. Working example using a headless PhantomJS browser:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    driver = webdriver.PhantomJS()
    driver.get("http://m.imdb.com/feature/bornondate")
    
    # waiting for posters to load
    wait = WebDriverWait(driver, 10)
    posters = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "section.posters")))
    
    # extracting the data poster by poster
    for a in posters.find_elements_by_css_selector('a.poster'):
        img = a.find_element_by_tag_name('img').get_attribute('src').split('._V1.')[0] + '._V1_SX214_AL_.jpg'
    
        person = a.find_element_by_css_selector('div.detail').text
        title = a.find_element_by_css_selector('span.title').text
    
        print img, person, title
    

    Prints:

    http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1_SX214_AL_.jpg Actor, "Ozymandias" Bryan Cranston
    http://ia.media-imdb.com/images/M/MV5BNjUxNjcxMjE4N15BMl5BanBnXkFtZTgwNDk4NjA2MzE@._V1_SX214_AL_.jpg Actress, "Karla" Laura Prepon
    http://ia.media-imdb.com/images/M/MV5BMTQ4MzM1MDAwMV5BMl5BanBnXkFtZTcwNTU4NzQwMw@@._V1_SX214_AL_.jpg Actress, "The Mummy" Rachel Weisz
    http://ia.media-imdb.com/images/M/MV5BMjE0Mjg0NzE2Nl5BMl5BanBnXkFtZTcwMDE1MTkxMw@@._V1_SX214_AL_.jpg Actor, "Jarhead" Peter Sarsgaard
    http://ia.media-imdb.com/images/M/MV5BMTMyOTYzODQ5MF5BMl5BanBnXkFtZTcwMjE3MDgzMQ@@._V1_SX214_AL_.jpg Actress, "Blades of Glory" Jenna Fischer
    http://ia.media-imdb.com/images/M/MV5BMzE2OTAwNzM0Ml5BMl5BanBnXkFtZTcwNzE1MDg0Mw@@._V1_SX214_AL_.jpg Actress, "Tangled" Donna Murphy
    http://ia.media-imdb.com/images/M/MV5BMTI0OTMzMzE0N15BMl5BanBnXkFtZTcwMjI1MzYyMQ@@._V1_SX214_AL_.jpg Actor, "How the Grinch Stole Christmas" T.J. Thyne
    http://ia.media-imdb.com/images/M/MV5BNzczODkyNzY4OV5BMl5BanBnXkFtZTcwNTU0NjQzMQ@@._V1_SX214_AL_.jpg Actor, "Home Alone" John Heard
    http://ia.media-imdb.com/images/M/MV5BMTg4MjU2MzA2OV5BMl5BanBnXkFtZTgwOTIxMjc4MjE@._V1_SX214_AL_.jpg Actress, "Beerfest" Audrey Marie Anderson
    http://ia.media-imdb.com/images/M/MV5BMTQyOTc5NzA0M15BMl5BanBnXkFtZTYwODQ2MjYz._V1_SX214_AL_.jpg Producer, "Kick-Ass" Matthew Vaughn
    
    0 讨论(0)
  • 2020-12-21 06:31

    I am working on same assignment. URLlib library loads static content of web URL. Use selenium to get complete html which includes dynamic content too. If you use urllib2 library, generated html would be

    <span class="loading"></span>
    

    Hope it helps.

    0 讨论(0)
提交回复
热议问题