Parsing HTML with C++ (using Qt preferably)

只愿长相守 提交于 2020-02-23 04:10:28

问题


I'm trying to parse some HTML with C++ to extract all urls from the HTML (the urls can be inside the href and src attributes).

I tried to use Webkit to do the heavy work for me but for some reason when I load a frame with HTML the generated document is all wrong (if I make Webkit get the page from the web the generated document is just fine but Webkit also downloads all images, styles, and scripts and I don't want that)

Here is what I tried to do:

frame->setHtml(HTML);
QWebElement document = frame->documentElement();
QList<QWebElement> imgs = document.findAll("a"); // Doesn't find all links
QList<QWebElement> imgs = document.findAll("img"); // Doesn't find all images
QList<QWebElement> imgs = document.findAll("script");// Doesn't find all scripts
qDebug() << document.toInnerXml(); // Print a completely messed-up document with several missing elements

What am I doing wrong? Is there an easy way to parse HTML with Qt? (Or some other lightweight library)


回答1:


You can always use XPath expressions to make your parsing life easier, take a look at this for instance.

or you can do something like this

QWebView* view = new QWebView(parent);
view.load(QUrl("http://www.your_site.com"));
QWebElementCollection elements = view.page().mainFrame().findAllElements("a");


来源:https://stackoverflow.com/questions/6086247/parsing-html-with-c-using-qt-preferably

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!