pull data from website using VBA excel multiple classname

南笙酒味 提交于 2019-11-29 00:38:35

Your first option is usually preferable since it is much faster than the second method, it sends a request directly to the web server and returns the response. This is much more efficient than automating Internet Explorer (the second option); automating IE is very slow, since you are effectively just browsing the site - it will inevitably result in more downloads as it must load all the resources in the page - images, scripts, css files etc. It will also run any Javascript on the page - all of this is usually not useful and you have to wait for it to finish before parsing the page.

This however is a bit of a double edged sword - whilst much slower, if you are not familiar with html requests, automating Internet Explorer is substantially easier than the first method, especially when elements are generated dynamically or the page has a reliance on AJAX. It is also easier to automate IE when you need to access data in a site that requires you to log in since it will handle the relevant cookies for you. This is not to say that web scraping cannot be done with the first method, rather than it requires a deeper understanding of web technologies and the architecture of the site.

A better option to the first method would be to use a different object to handle the request and response, using the WinHTTP library offers more resilience than the MSXML library and will generally handle any cookies automatically as well.

As for parsing the data, in your first approach you have used late binding to create the HTML Object (htmlfile), whilst this reduces the need for a reference, it also reduces functionality. For example, when using late binding, you are missing out on the features added if the user has IE9 installed, specifically in this case the getElementsByClass name function.

As such a third option (and my preferred method):

Dim oHtml       As HTMLDocument
Dim oElement    As Object

Set oHtml = New HTMLDocument


With CreateObject("WINHTTP.WinHTTPRequest.5.1")
    .Open "GET", "http://www.someurl.com", False
    .send
    oHtml.body.innerHTML = .responseText
End With

For Each oElement In oHtml.getElementsByClassName("imageElement")
    Debug.Print oElement.Children(0).src
Next oElement

'IE 8 alternative
'For Each oElement In oHtml.getElementsByTagName("div")
'    If oElement.className = "imageElement" Then
'        Debug.Print oElement.Children(0).src
'    End If
'Next oElement

This will require a reference setting to the Microsoft HTML Object Library - it will fail if the user does not have IE9 installed, but this can be handled and is becoming increasingly less relevant

JS20'07'11

To print elements to cells replace:

For Each oElement In oHtml.getElementsByClassName("imageElement")
    Debug.Print oElement.Children(0).src
Next oElement

With:

Dim wsTarget as Worksheet
dim i as Integer
i=1
set wsTarget=activeworkbook.worksheets("SomeSheet")

For Each oElement In oHtml.getElementsByClassName("imageElement")
    wstarget.range("A" & i)=oElement.Children(0).src
    i=i+1
Next

'Corrected the syntax error on For

CSS selector:

You could also use a CSS selector of #images img[src^='img/'].

This says elements with id of images that contain tagname img with attribute src having value starting with 'img/'.

The # is for id; [] for attribute; ^ for starts with; #images img, img within images.


CSS query:


As more than one element will be matched you would use the .querySelectorAll method of document and then loop the length of the returned nodeList.

VBA Code:

Option Explicit
Public Sub test()
    Dim html As HTMLDocument
    Set html = New HTMLDocument

    With CreateObject("WINHTTP.WinHTTPRequest.5.1")
        .Open "GET", "http://www.someurl.com", False
        .send
        html.body.innerHTML = .responseText
    End With

    Dim aNodeList As Object, iItem As Long
    Set aNodeList = html.querySelectorAll("#images img[src^='img/']")
    With ActiveSheet
        For iItem = 0 To aNodeList.Length - 1
            .Cells(iItem + 1, 1) = aNodeList.item(iItem).innerText
            '.Cells(iItem + 1, 1) = aNodeList(iItem).innerText '<== or potentially this syntax
        Next iItem
    End With
End Sub
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!