How do I parse html without creating an object of internet explorer in vba?

后端未结

关注

 2  1134

I don\'t have internet explorer on any of the computers at work, therefore creating a object of internet explorer and using ie.navigate to parse the html and search for the tags

相关标签:

2条回答

梦如初夏

2021-01-21 18:23

Anyone who has done some web scraping will be familiar with creating an instance of Internet Explorer (IE) and the navigating to a web address and then once the page is ready start navigating the DOM using the 'Microsoft HTML Object Library' (MSHTML) type library. The question asks if IE is unavailable what to do. I am in the same situation for my box running Windows 10.

I had suspected it was possible to spin up an instance of MSHTML.HTMLDocument independent of IE but its creation is not obvious. Thanks to the questioner for asking this now. The answer lies in the MSHTML.IHTMLDocument4.createDocumentFromUrl method. One needs a local file to work (EDIT: actually one can put a webby url in as well!) with but we have a nice tidy Windows API function called URLDownloadToFile to download a file.

This codes runs on my Windows 10 box where Microsoft Edge is running and not Internet Explorer. This is an important find and thanks to the questioner for raising it.

Option Explicit

'* Tools->Refernces Microsoft HTML Object Library


'* MSDN - URLDownloadToFile function - https://msdn.microsoft.com/en-us/library/ms775123(v=vs.85).aspx
Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" Alias "URLDownloadToFileA" _
        (ByVal pCaller As Long, ByVal szURL As String, ByVal szFileName As String, _
        ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long

Sub Test()

    Dim fso As Object
    Set fso = CreateObject("Scripting.FileSystemObject")

    Dim sLocalFilename As String
    sLocalFilename = Environ$("TMP") & "\urlmon.html"

    Dim sURL As String
    sURL = "https://stackoverflow.com/users/3607273/s-meaden"


    Dim bOk As Boolean
    bOk = (URLDownloadToFile(0, sURL, sLocalFilename, 0, 0) = 0)
    If bOk Then
        If fso.FileExists(sLocalFilename) Then

            '* Tools->Refernces Microsoft HTML Object Library
            Dim oHtml4 As MSHTML.IHTMLDocument4
            Set oHtml4 = New MSHTML.HTMLDocument

            Dim oHtml As MSHTML.HTMLDocument
            Set oHtml = Nothing

            '* IHTMLDocument4.createDocumentFromUrl
            '* MSDN - IHTMLDocument4 createDocumentFromUrl method - https://msdn.microsoft.com/en-us/library/aa752523(v=vs.85).aspx
            Set oHtml = oHtml4.createDocumentFromUrl(sLocalFilename, "")

            '* need to wait a little whilst the document parses
            '* because it is multithreaded
            While oHtml.readyState <> "complete"
                DoEvents  '* do not comment this out it is required to break into the code if in infinite loop
            Wend
            Debug.Assert oHtml.readyState = "complete"


            Dim sTest As String
            sTest = Left$(oHtml.body.outerHTML, 100)
            Debug.Assert Len(Trim(sTest)) > 50  '* just testing we got a substantial block of text, feel free to delete

            '* page specific logic goes here
            Dim htmlAnswers As Object 'MSHTML.DispHTMLElementCollection
            Set htmlAnswers = oHtml.getElementsByClassName("answer-hyperlink")

            Dim lAnswerLoop As Long
            For lAnswerLoop = 0 To htmlAnswers.Length - 1
                Dim vAnswerLoop
                Set vAnswerLoop = htmlAnswers.Item(lAnswerLoop)
                Debug.Print vAnswerLoop.outerText

            Next

        End If
    End If
End Sub

Thanks for asking this.

P.S. I have used TaskList to verify that IExplore.exe is not created under the hoods when this code runs.

P.P.S If you liked this then see more at my Excel Development Platform blog

0 讨论(0)

囚心锁ツ

2021-01-21 18:27

You could use XMLHTTP to retrieve the HTML source of a web page:

Function GetHTML(url As String) As String
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", url, False
        .Send
        GetHTML = .ResponseText
    End With
End Function

I wouldn't suggest using this as a worksheet function, or else the site URL will be re-queried every time the worksheet recalculates. Some sites have logic in place to detect scraping via frequent, repeated calls, and your IP could become banned, temporarily or permanently, depending on the site.

Once you have the source HTML string (preferably stored in a variable to avoid unnecessary repeat calls), you can use basic text functions to parse the string to search for your tag.

This basic function will return the value between the <tag> and </tag>:

Public Function getTag(url As String, tag As String, Optional occurNum As Integer) As String
    Dim html As String, pStart As Long, pEnd As Long, o As Integer
    html = GetHTML(url)

    'remove <> if they exist so we can add our own
    If Left(tag, 1) = "<" And Right(tag, 1) = ">" Then
        tag = Left(Right(tag, Len(tag) - 1), Len(Right(tag, Len(tag) - 1)) - 1)
    End If

    ' default to Occurrence #1
    If occurNum = 0 Then occurNum = 1
    pEnd = 1

    For o = 1 To occurNum
        ' find start <tag> beginning at 1 (or after previous Occurence)
        pStart = InStr(pEnd, html, "<" & tag & ">", vbTextCompare)
        If pStart = 0 Then
            getTag = "{Not Found}"
            Exit Function
        End If
        pStart = pStart + Len("<" & tag & ">")

        ' find first end </tag> after start <tag>
        pEnd = InStr(pStart, html, "</" & tag & ">", vbTextCompare)
    Next o

    'return string between start <tag> & end </tag>
    getTag = Mid(html, pStart, pEnd - pStart)
End Function

This will find only basic <tag>'s but you could add/remove/change the text functions to suit your needs.

Example Usage:

Sub findTagExample()

    Const testURL = "https://en.wikipedia.org/wiki/Web_scraping"

    'search for 2nd occurence of tag: <h2> which is "Contents" :
    Debug.Print getTag(testURL, "<h2>", 2)

    '...this returns the 8th occurence, "Navigation Menu" :
    Debug.Print getTag(testURL, "<h2>", 8)

    '...and this returns an HTML <span> containing a title for the 'Legal Issues' section:
    Debug.Print getTag("https://en.wikipedia.org/wiki/Web_scraping", "<h2>", 4)

End Sub

0 讨论(0)