How to get the html dom of a webpage and its frames

*爱你&永不变心* 提交于 2019-12-11 10:04:08

问题


I would like to get the DOM of a website after js execution. I would also like to get all the content of the iframes in the website, similarly to what I have in Google Chrome's Inspect Element feature.

This is my code:

import sys
from PyQt4 import QtGui, QtCore, QtWebKit

class Sp():
  def save(self):
    print ("call")
    data = self.webView.page().currentFrame().documentElement().toInnerXml()
    print(data.encode('utf-8'))
    print ('finished')
  def main(self):
    self.webView = QtWebKit.QWebView()
    self.webView.load(QtCore.QUrl("http://www.w3schools.com/tags/tryit.asp?filename=tryhtml_iframe_scrolling"))
    QtCore.QObject.connect(self.webView,QtCore.SIGNAL("loadFinished(bool)"),self.save)

app = QtGui.QApplication(sys.argv)
s = Sp()
s.main()
sys.exit(app.exec_())

This gives me the html of the website, but not the html inside the iframes. Is there any way that I could get the HTML of the iframes.


回答1:


This is a very hard problem to solve in general.

The main difficulty is that there is no way to know in advance how many frames each page has. And in addition to that, each child-frame may have its own set of frames, the number of which is also unknown. In theory, there could be an infinite number of nested frames, and the page will never finish loading (which seems no exaggeration for sites that have a lot of ads).

Anyway, below is a version of your script which gets the top-level QWebFrame object of each frame as it loads, and shows how you can access some of the things you are interested in. As you will see from the output, there are a lot of "junk" frames inserted by ads and such like that you will somehow need to filter out.

import sys, signal
from PyQt4 import QtGui, QtCore, QtWebKit

class Sp():
  def save(self, ok, frame=None):
    if frame is None:
        print ('main-frame')
        frame = self.webView.page().mainFrame()
    else:
        print('child-frame')
    print('URL: %s' % frame.baseUrl().toString())
    print('METADATA: %s' % frame.metaData())
    print('TAG: %s' % frame.documentElement().tagName())
    print()

  def handleFrameCreated(self, frame):
    frame.loadFinished.connect(lambda: self.save(True, frame=frame))

  def main(self):
    self.webView = QtWebKit.QWebView()
    self.webView.page().frameCreated.connect(self.handleFrameCreated)
    self.webView.page().mainFrame().loadFinished.connect(self.save)
    self.webView.load(QtCore.QUrl("http://www.w3schools.com/tags/tryit.asp?filename=tryhtml_iframe_scrolling"))

signal.signal(signal.SIGINT, signal.SIG_DFL)
print('Press Crtl+C to quit\n')
app = QtGui.QApplication(sys.argv)
s = Sp()
s.main()
sys.exit(app.exec_())

NB: it is important that you connect to the loadFinished signal of the main frame rather than the web-view. If you connect to the latter, it will be called multiple times if the page contains more than one frame.



来源:https://stackoverflow.com/questions/35629073/how-to-get-the-html-dom-of-a-webpage-and-its-frames

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!