问题
I scrape some websites with vba for fun and I use VBA as tool. I use XMLHTTP and HTMLDocument (cause it's more faster than internetExplorer.Application).
Public Sub XMLhtmlDocumentHTMLSourceScraper()
Dim XMLHTTPReq As Object
Dim htmlDoc As HTMLDocument
Dim postURL As String
postURL = "http://foodffs.tumblr.com/archive/2015/11"
Set XMLHTTPReq = New MSXML2.XMLHTTP
With XMLHTTPReq
.Open "GET", postURL, False
.Send
End With
Set htmlDoc = New HTMLDocument
With htmlDoc
.body.innerHTML = XMLHTTPReq.responseText
End With
i = 0
Set varTemp = htmlDoc.getElementsByClassName("post_glass post_micro_glass")
For Each vr In varTemp
''''the next line is important to solve this issue *1
Cells(1, 1) = vr.outerHTML
Set varTemp2 = vr.getElementsByTagName("SPAN class=post_date")
Cells(i + 1, 3) = varTemp2.Item(0).innerText
''''the next line occur 438Error''''
Set varTemp2 = vr.getElementsByClassName("hover_inner")
Cells(i + 1, 4) = varTemp2.innerText
i = i + 1
Next vr
End Sub
I figure out this problem by *1 cells(1,1) shows me the next things
<DIV class="post_glass post_micro_glass" title=""><A class=hover title="" href="http://foodffs.tumblr.com/post/134291668251/sugar-free-low-carb-coffee-ricotta-mousse-really" target=_blank>
<DIV class=hover_inner><SPAN class=post_date>...............
Yeah all the class tag lost " ". only the first function's class has " " I really don't know why this situation occur.
//Well I could pharse by getElementsByTagName("span"). but I prefer "class" Tag.....
回答1:
The getElementsByClassName method is not considered a method of itself; only of the parent HTMLDocument. If you want to use it to locate elements within a DIV element, you need to create a sub-HTMLDocument comprised of the .outerHtml of that specific DIV element.
Public Sub XMLhtmlDocumentHTMLSourceScraper()
Dim xmlHTTPReq As New MSXML2.XMLHTTP
Dim htmlDOC As New HTMLDocument, divSUBDOC As New HTMLDocument
Dim iDIV As Long, iSPN As Long, iEL As Long
Dim postURL As String, nr As Long, i As Long
postURL = "http://foodffs.tumblr.com/archive/2015/11"
With xmlHTTPReq
.Open "GET", postURL, False
.Send
End With
'Set htmlDOC = New HTMLDocument
With htmlDOC
.body.innerHTML = xmlHTTPReq.responseText
End With
i = 0
With htmlDOC
For iDIV = 0 To .getElementsByClassName("post_glass post_micro_glass").Length - 1
nr = Sheet1.Cells(Rows.Count, 3).End(xlUp).Offset(1, 0).Row
With .getElementsByClassName("post_glass post_micro_glass")(iDIV)
'method 1 - run through multiples in a collection
For iSPN = 0 To .getElementsByTagName("span").Length - 1
With .getElementsByTagName("span")(iSPN)
Select Case LCase(.className)
Case "post_date"
Cells(nr, 3) = .innerText
Case "post_notes"
Cells(nr, 4) = .innerText
Case Else
'do nothing
End Select
End With
Next iSPN
'method 2 - create a sub-HTML doc to facilitate getting els by classname
divSUBDOC.body.innerHTML = .outerHTML 'only the HTML from this DIV
With divSUBDOC
If CBool(.getElementsByClassName("hover_inner").Length) Then 'there is at least 1
'use the first
Cells(nr, 5) = .getElementsByClassName("hover_inner")(0).innerText
End If
End With
End With
Next iDIV
End With
End Sub
While other .getElementsByXXXX can readily retrieve collections within another element, the getElementsByClassName method needs to consider what it believes to be the HTMLDocument as a whole, even if you have fooled it into thinking that.
回答2:
Here's an alternative approach. It's very similar to the original code but uses querySelectorAll to select the relevant span elements. One important point for this method is that vr has to be declared as being a specific element type and not as an IHTMLElement or generic Object:
Option Explicit
Public Sub XMLhtmlDocumentHTMLSourceScraper()
' Changed from generic Object to specific type - not
' strictly necessary to do this
Dim XMLHTTPReq As MSXML2.XMLHTTP60
Dim htmlDoc As HTMLDocument
' These declarations weren't included in the original code
Dim i As Integer
Dim varTemp As Object
' IMPORTANT: vr must be declared as a specific element type and not
' as an IHTMLElement or generic Object
Dim vr As HTMLDivElement
Dim varTemp2 As Object
Dim postURL As String
postURL = "http://foodffs.tumblr.com/archive/2015/11"
' Changed from XMLHTTP to XMLHTTP60 as XMLHTTP is equivalent
' to the older XMLHTTP30
Set XMLHTTPReq = New MSXML2.XMLHTTP60
With XMLHTTPReq
.Open "GET", postURL, False
.Send
End With
Set htmlDoc = New HTMLDocument
With htmlDoc
.body.innerHTML = XMLHTTPReq.responseText
End With
i = 0
Set varTemp = htmlDoc.getElementsByClassName("post_glass post_micro_glass")
For Each vr In varTemp
''''the next line is important to solve this issue *1
Cells(1, 1) = vr.outerHTML
Set varTemp2 = vr.querySelectorAll("span.post_date")
Cells(i + 1, 3) = varTemp2.Item(0).innerText
Set varTemp2 = vr.getElementsByClassName("hover_inner")
' incorporating correction from Jeeped's comment (#56349646)
Cells(i + 1, 4) = varTemp2.Item(0).innerText
i = i + 1
Next vr
End Sub
Notes:
- XMLHTTP equivalent to XMLHTTP30 as described here
- apparent need to declare a specific element type explored in this question but, unlike getElementsByClassName, querySelectorAll doesn't exist in any version of IHTMLElement
来源:https://stackoverflow.com/questions/34302502/vba-getelementsbyclassname-htmlsources-double-quotation-marks-are-gone