pull data from website using VBA excel multiple classname

问题

I know this has been asked many times, but haven't seen a clear answer for looping thru a div and findind tags with the same classname.

My first question:

If I have something like this:

<div id="carousel">
   <div id="images">

       <div class="imageElement">
          <img src="img/image1.jpg">
       </div>

       <div class="imageElement">
          <img src="img/image2.jpg">
       </div>

       <div class="imageElement">
           <img src="img/image3.jpg">
       </div>

   </div>

</div>

So I want to get all the img Src in the div "images" along with other stuff in the imageElement classnames and copy them to some cells in excel.

Second question: I've seen two ways in pulling web content with VBA, one using IE and another code using something but a browser.

Private Sub pullData_Click()

    Dim x As Long, y As Long
    Dim htm As Object

    Set htm = CreateObject("htmlFile")

    With CreateObject("msxml2.xmlhttp")
        .Open "GET", "http://website.html", False
        .send
        htm.body.innerHTML = .responsetext
    End With

End Sub

And second way:

Set ie = New InternetExplorer
    With ie
        .navigate "http://eoddata.com/stockquote/NASDAQ/AAPL.htm"
        .Visible = False
        While .Busy Or .readyState <> READYSTATE_COMPLETE
           DoEvents
        Wend
        Set objHTML = .document
        DoEvents
    End With
    Set elementONE = objHTML.getElementsByTagName("TD")
    For i = 1 To elementONE.Length
        elementTWO = elementONE.Item(i).innerText           
        If elementTWO = "08/10/12" Then
            MsgBox (elementONE.Item(i + 1).innerText)
            Exit For
        End If
    Next i
    DoEvents
    ie.Quit
    DoEvents
    Set ie = Nothing

Which one is better and why?

So if you can help me I'd appreciate.

Thank you in advance.

回答1:

Your first option is usually preferable since it is much faster than the second method, it sends a request directly to the web server and returns the response. This is much more efficient than automating Internet Explorer (the second option); automating IE is very slow, since you are effectively just browsing the site - it will inevitably result in more downloads as it must load all the resources in the page - images, scripts, css files etc. It will also run any Javascript on the page - all of this is usually not useful and you have to wait for it to finish before parsing the page.

This however is a bit of a double edged sword - whilst much slower, if you are not familiar with html requests, automating Internet Explorer is substantially easier than the first method, especially when elements are generated dynamically or the page has a reliance on AJAX. It is also easier to automate IE when you need to access data in a site that requires you to log in since it will handle the relevant cookies for you. This is not to say that web scraping cannot be done with the first method, rather than it requires a deeper understanding of web technologies and the architecture of the site.

A better option to the first method would be to use a different object to handle the request and response, using the WinHTTP library offers more resilience than the MSXML library and will generally handle any cookies automatically as well.

As for parsing the data, in your first approach you have used late binding to create the HTML Object (htmlfile), whilst this reduces the need for a reference, it also reduces functionality. For example, when using late binding, you are missing out on the features added if the user has IE9 installed, specifically in this case the getElementsByClass name function.

As such a third option (and my preferred method):

Dim oHtml       As HTMLDocument
Dim oElement    As Object

Set oHtml = New HTMLDocument


With CreateObject("WINHTTP.WinHTTPRequest.5.1")
    .Open "GET", "http://www.someurl.com", False
    .send
    oHtml.body.innerHTML = .responseText
End With

For Each oElement In oHtml.getElementsByClassName("imageElement")
    Debug.Print oElement.Children(0).src
Next oElement

'IE 8 alternative
'For Each oElement In oHtml.getElementsByTagName("div")
'    If oElement.className = "imageElement" Then
'        Debug.Print oElement.Children(0).src
'    End If
'Next oElement

This will require a reference setting to the Microsoft HTML Object Library - it will fail if the user does not have IE9 installed, but this can be handled and is becoming increasingly less relevant

回答2:

To print elements to cells replace:

For Each oElement In oHtml.getElementsByClassName("imageElement")
    Debug.Print oElement.Children(0).src
Next oElement

With:

Dim wsTarget as Worksheet
dim i as Integer
i=1
set wsTarget=activeworkbook.worksheets("SomeSheet")

For Each oElement In oHtml.getElementsByClassName("imageElement")
    wstarget.range("A" & i)=oElement.Children(0).src
    i=i+1
Next

'Corrected the syntax error on For

回答3:

CSS selector:

You could also use a CSS selector of #images img[src^='img/'].

This says elements with id of images that contain tagname img with attribute src having value starting with 'img/'.

The # is for id; [] for attribute; ^ for starts with; #images img, img within images.

CSS query:

As more than one element will be matched you would use the .querySelectorAll method of document and then loop the length of the returned nodeList.

VBA Code:

Option Explicit
Public Sub test()
    Dim html As HTMLDocument
    Set html = New HTMLDocument

    With CreateObject("WINHTTP.WinHTTPRequest.5.1")
        .Open "GET", "http://www.someurl.com", False
        .send
        html.body.innerHTML = .responseText
    End With

    Dim aNodeList As Object, iItem As Long
    Set aNodeList = html.querySelectorAll("#images img[src^='img/']")
    With ActiveSheet
        For iItem = 0 To aNodeList.Length - 1
            .Cells(iItem + 1, 1) = aNodeList.item(iItem).innerText
            '.Cells(iItem + 1, 1) = aNodeList(iItem).innerText '<== or potentially this syntax
        Next iItem
    End With
End Sub

来源：https://stackoverflow.com/questions/18709911/pull-data-from-website-using-vba-excel-multiple-classname

标签

excel

vba

excel-vba

excel-2010

getelementsbyclassname