I can't get the whole source code of an HTML page

前端未结

关注

 2  954

清酒与你

Using Python, I want to crawl data on a web page whose source if quite big (it is a Facebook page of some user).

Say the URL is the URL I am trying to crawl. I run the

相关标签:

2条回答

慢半拍i

2021-01-22 20:03

Facebook is heavily Javascript orientated. The page source you see in the browser is the DOM after after any JS code has run (and the page source will frequently be changing anyway). You may have to automate a browser (using Selenium), or try other tools such as mechanize... Or look into a proper FB app and use the FB API.

0 讨论(0)
发布评论:

提交评论
- 加载中...

误落风尘

2021-01-22 20:08

This page may execute some javascript and javascript generates some content.
Try Twill.
It based on Mechanize, but executes javascript.
Sample in Python:

from twill.commands import *
go("http://google.com/")
fv("f", "q", "test")
submit("btnG")
info() #shows page info
show() #shows html

Another option is to use Zombie.js on Node.js.
This library works even better then Twill and it is browserless solution.
Sample in Coffeescript:

zombie = require "zombie"
browser = new zombie()
browser.visit "https://www.google.ru/", =>
    browser.fill "q", "node.js"
    browser.pressButton "Поиск в Google", ->
        for item in browser.queryAll "h3.r a"
            console.log item.innerHTML

0 讨论(0)