How would I go about parsing all of the \"a\" html tags \"href\" properties on a page full of BAD html, in Qt?
I would use the builtin QtWebKit. Don't know how it does in terms of performance, but I think it should catch all "bad" HTML. Something like:
class MyPageLoader : public QObject
{
Q_OBJECT
public:
MyPageLoader();
void loadPage(const QUrl&);
public slots:
void replyFinished(bool);
private:
QWebView* m_view;
};
MyPageLoader::MyPageLoader()
{
m_view = new QWebView();
connect(m_view, SIGNAL(loadFinished(bool)),
this, SLOT(replyFinished(bool)));
}
void MyPageLoader::loadPage(const QUrl& url)
{
m_view->load(url);
}
void MyPageLoader::replyFinished(bool ok)
{
QWebElementCollection elements = m_view->page()->mainFrame()->findAllElements("a");
foreach (QWebElement e, elements) {
// Process element e
}
}
To use the class
MyPageLoader loader;
loader.loadPage("http://www.example.com")
and then do whatever you like with the collection.
this question is already quite old. Nevertheless I hope this will help someone:
I wrote two small classes for Qt which I published under sourceforge. This will help you to access a html-file comparable you are used with XML.
Here you'll find the project:
http://sourceforge.net/projects/sgml-for-qt/
Here you'll find a help-system in the wiki.
Drewle