Can you load a web page in c++, including JS and dynamic html and get the rendered DOM string?

别说谁变了你拦得住时间么 提交于 2021-01-27 18:49:17

问题


Is it possible to load a web page in c++ and get the rendered DOM? Not just the HTTP response, but the rendered DOM that occurs after java-script runs (maybe after letting it run for some amount of time). Specifically the dynamic HTML that may have changed over time? Is there a library for this?

Or if not c++, do you know of any other language which this can be done in?

Edit here's an example to illustrate better why one might want to do this:

Imagine you want to crawl a website written in angular. You can't just make an http request and use the HTTP response, because most of the DOM is rendered after javascript/dynamic html manipulates the DOM. The initial http response for an angular site probably doesn't have all the contents, its requested and rendered later through javascript/AJAX/dyanmic html.


回答1:


Since DOM is something implemented differently by each browser, how you use that from C++ will be different with each browser.

I'll give an example for IE. You can use the WebBrowser ActiveX control which exposes the IWebBrowser2 interface. From there you can call IWebBrowser2::get_Document to get an IHTMLDocument2 object, which is the root of the DOM.

#include "StdAfx.h"

using namespace ATL;
using namespace std;

void ThrowIfFailed(HRESULT hr)
{
    if (FAILED(hr))
        throw CAtlException(hr);
}

int main()
{
    ::CoInitialize(nullptr);

    try
    {
        CComPtr<IWebBrowser2> pWebBrowser;
        HRESULT hr = ::CoCreateInstance(CLSID_InternetExplorer, nullptr, CLSCTX_LOCAL_SERVER, IID_PPV_ARGS(&pWebBrowser));
        ThrowIfFailed(hr);

        hr = pWebBrowser->put_Visible(VARIANT_TRUE);
        ThrowIfFailed(hr);

        hr = pWebBrowser->GoHome();
        ThrowIfFailed(hr);

        CComPtr<IDispatch> pDispatch;
        hr = pWebBrowser->get_Document(&pDispatch);
        ThrowIfFailed(hr);

        CComPtr<IHTMLDocument2> pDocument;
        hr = pDispatch->QueryInterface(&pDocument);
        ThrowIfFailed(hr);

        CComBSTR bstrTitle;
        hr = pDocument->get_title(&bstrTitle);
        ThrowIfFailed(hr);

        wcout << bstrTitle.m_str << endl;
    }
    catch (const CAtlException& e)
    {
        wcout << L"Error (" << hex << e.m_hr << L")" << endl;
    }

    ::CoUninitialize();
    return 0;
}

This code just opens an IE window, navigates to the home page, and writes the title of the page to the console. You can also control whether the IE window becomes visible by removing the call to IWebBrowser2::put_Visible.




回答2:


As I am understanding, you are asking: "How to manipulate DOM of already rendered HTML Page through C++?"

If that's what you wanted to ask, here is my answer:

  • Technically, you can do it through C++. However, you need a right tool/lib/framework/ ... for doing this.

  • Normally, we manipulate DOM by Javascript.

  • In my experience, mobile developer have built-in control for load the page, usually called "webview". Android (Java) and iOS (Objective-C) have it. Then they manipulate DOM like this manner: "webview.evaluteScript("your javascript").

  • If you want to do it with C++. I think you can read these link:

How to embed WebKit into my C/C++/Win32 application?

How do I embed WebKit in a window?



来源:https://stackoverflow.com/questions/39340643/can-you-load-a-web-page-in-c-including-js-and-dynamic-html-and-get-the-render

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!