How to crawl a website/extract data into database with python?

后端 未结 4 1320
心在旅途
心在旅途 2021-01-31 00:22

I\'d like to build a webapp to help other students at my university create their schedules. To do that I need to crawl the master schedules (one huge html page) as well as a lin

相关标签:
4条回答
  • 2021-01-31 01:09

    I liked using BeatifulSoup for extracting html data

    It's as easy as this:

    from BeautifulSoup import BeautifulSoup 
    import urllib
    
    ur = urllib.urlopen("http://pragprog.com/podcasts/feed.rss")
    soup = BeautifulSoup(ur.read())
    items = soup.findAll('item')
    
    urls = [item.enclosure['url'] for item in items]
    
    0 讨论(0)
  • 2021-01-31 01:10

    For this purpose there is a very useful tool called web-harvest Link to their website http://web-harvest.sourceforge.net/ I use this to crawl webpages

    0 讨论(0)
  • 2021-01-31 01:13
    • requests for downloading the pages.
      • Here's an example of how to login to a website and download pages: https://stackoverflow.com/a/8316989/311220
    • lxml for scraping the data.

    If you want to use a powerful scraping framework there's Scrapy. It has some good documentation too. It may be a little overkill depending on your task though.

    0 讨论(0)
  • 2021-01-31 01:19

    Scrapy is probably the best Python library for crawling. It can maintain state for authenticated sessions.

    Dealing with binary data should be handled separately. For each file type, you'll have to handle it differently according to your own logic. For almost any kind of format, you'll probably be able to find a library. For instance take a look at PyPDF for handling PDFs. For excel files you can try xlrd.

    0 讨论(0)
提交回复
热议问题