Python Web Crawlers and “getting” html source code

前端 未结 4 1083
不知归路
不知归路 2020-12-24 13:53

So my brother wanted me to write a web crawler in Python (self-taught) and I know C++, Java, and a bit of html. I\'m using version 2.7 and reading the python library, but I

相关标签:
4条回答
  • 2020-12-24 14:11

    If you are using Python > 3.x you don't need to install any libraries, this is directly built in the python framework. The old urllib2 package has been renamed to urllib:

    from urllib import request
    
    response = request.urlopen("https://www.google.com")
    # set the correct charset below
    page_source = response.read().decode('utf-8')
    print(page_source)
    
    0 讨论(0)
  • 2020-12-24 14:28

    Use Python 2.7, is has more 3rd party libs at the moment. (Edit: see below).

    I recommend you using the stdlib module urllib2, it will allow you to comfortably get web resources. Example:

    import urllib2
    
    response = urllib2.urlopen("http://google.de")
    page_source = response.read()
    

    For parsing the code, have a look at BeautifulSoup.

    BTW: what exactly do you want to do:

    Just for background, I need to download a page and replace any img with ones I have

    Edit: It's 2014 now, most of the important libraries have been ported, and you should definitely use Python 3 if you can. python-requests is a very nice high-level library which is easier to use than urllib2.

    0 讨论(0)
  • 2020-12-24 14:30

    An Example with python3 and the requests library as mentioned by @leoluk:

    pip install requests
    

    Script req.py:

    import requests
    
    url='http://localhost'
    
    # in case you need a session
    cd = { 'sessionid': '123..'}
    
    r = requests.get(url, cookies=cd)
    # or without a session: r = requests.get(url)
    r.content
    

    Now,execute it and you will get the html source of localhost!

    python3 req.py

    0 讨论(0)
  • 2020-12-24 14:30

    The first thing you need to do is read the HTTP spec which will explain what you can expect to receive over the wire. The data returned inside the content will be the "rendered" web page, not the source. The source could be a JSP, a servlet, a CGI script, in short, just about anything, and you have no access to that. You only get the HTML that the server sent you. In the case of a static HTML page, then yes, you will be seeing the "source". But for anything else you see the generated HTML, not the source.

    When you say modify the page and return the modified page what do you mean?

    0 讨论(0)
提交回复
热议问题