How do I parse html without creating an object of internet explorer in vba?

后端 未结 2 1132
我寻月下人不归
我寻月下人不归 2021-01-21 17:49

I don\'t have internet explorer on any of the computers at work, therefore creating a object of internet explorer and using ie.navigate to parse the html and search for the tags

相关标签:
2条回答
  • 2021-01-21 18:23

    Anyone who has done some web scraping will be familiar with creating an instance of Internet Explorer (IE) and the navigating to a web address and then once the page is ready start navigating the DOM using the 'Microsoft HTML Object Library' (MSHTML) type library. The question asks if IE is unavailable what to do. I am in the same situation for my box running Windows 10.

    I had suspected it was possible to spin up an instance of MSHTML.HTMLDocument independent of IE but its creation is not obvious. Thanks to the questioner for asking this now. The answer lies in the MSHTML.IHTMLDocument4.createDocumentFromUrl method. One needs a local file to work (EDIT: actually one can put a webby url in as well!) with but we have a nice tidy Windows API function called URLDownloadToFile to download a file.

    This codes runs on my Windows 10 box where Microsoft Edge is running and not Internet Explorer. This is an important find and thanks to the questioner for raising it.

    Option Explicit
    
    '* Tools->Refernces Microsoft HTML Object Library
    
    
    '* MSDN - URLDownloadToFile function - https://msdn.microsoft.com/en-us/library/ms775123(v=vs.85).aspx
    Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" Alias "URLDownloadToFileA" _
            (ByVal pCaller As Long, ByVal szURL As String, ByVal szFileName As String, _
            ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
    
    Sub Test()
    
        Dim fso As Object
        Set fso = CreateObject("Scripting.FileSystemObject")
    
        Dim sLocalFilename As String
        sLocalFilename = Environ$("TMP") & "\urlmon.html"
    
        Dim sURL As String
        sURL = "https://stackoverflow.com/users/3607273/s-meaden"
    
    
        Dim bOk As Boolean
        bOk = (URLDownloadToFile(0, sURL, sLocalFilename, 0, 0) = 0)
        If bOk Then
            If fso.FileExists(sLocalFilename) Then
    
                '* Tools->Refernces Microsoft HTML Object Library
                Dim oHtml4 As MSHTML.IHTMLDocument4
                Set oHtml4 = New MSHTML.HTMLDocument
    
                Dim oHtml As MSHTML.HTMLDocument
                Set oHtml = Nothing
    
                '* IHTMLDocument4.createDocumentFromUrl
                '* MSDN - IHTMLDocument4 createDocumentFromUrl method - https://msdn.microsoft.com/en-us/library/aa752523(v=vs.85).aspx
                Set oHtml = oHtml4.createDocumentFromUrl(sLocalFilename, "")
    
                '* need to wait a little whilst the document parses
                '* because it is multithreaded
                While oHtml.readyState <> "complete"
                    DoEvents  '* do not comment this out it is required to break into the code if in infinite loop
                Wend
                Debug.Assert oHtml.readyState = "complete"
    
    
                Dim sTest As String
                sTest = Left$(oHtml.body.outerHTML, 100)
                Debug.Assert Len(Trim(sTest)) > 50  '* just testing we got a substantial block of text, feel free to delete
    
                '* page specific logic goes here
                Dim htmlAnswers As Object 'MSHTML.DispHTMLElementCollection
                Set htmlAnswers = oHtml.getElementsByClassName("answer-hyperlink")
    
                Dim lAnswerLoop As Long
                For lAnswerLoop = 0 To htmlAnswers.Length - 1
                    Dim vAnswerLoop
                    Set vAnswerLoop = htmlAnswers.Item(lAnswerLoop)
                    Debug.Print vAnswerLoop.outerText
    
                Next
    
            End If
        End If
    End Sub
    

    Thanks for asking this.

    P.S. I have used TaskList to verify that IExplore.exe is not created under the hoods when this code runs.

    P.P.S If you liked this then see more at my Excel Development Platform blog

    0 讨论(0)
  • 2021-01-21 18:27

    You could use XMLHTTP to retrieve the HTML source of a web page:

    Function GetHTML(url As String) As String
        With CreateObject("MSXML2.XMLHTTP")
            .Open "GET", url, False
            .Send
            GetHTML = .ResponseText
        End With
    End Function
    

    I wouldn't suggest using this as a worksheet function, or else the site URL will be re-queried every time the worksheet recalculates. Some sites have logic in place to detect scraping via frequent, repeated calls, and your IP could become banned, temporarily or permanently, depending on the site.

    Once you have the source HTML string (preferably stored in a variable to avoid unnecessary repeat calls), you can use basic text functions to parse the string to search for your tag.

    This basic function will return the value between the <tag> and </tag>:

    Public Function getTag(url As String, tag As String, Optional occurNum As Integer) As String
        Dim html As String, pStart As Long, pEnd As Long, o As Integer
        html = GetHTML(url)
    
        'remove <> if they exist so we can add our own
        If Left(tag, 1) = "<" And Right(tag, 1) = ">" Then
            tag = Left(Right(tag, Len(tag) - 1), Len(Right(tag, Len(tag) - 1)) - 1)
        End If
    
        ' default to Occurrence #1
        If occurNum = 0 Then occurNum = 1
        pEnd = 1
    
        For o = 1 To occurNum
            ' find start <tag> beginning at 1 (or after previous Occurence)
            pStart = InStr(pEnd, html, "<" & tag & ">", vbTextCompare)
            If pStart = 0 Then
                getTag = "{Not Found}"
                Exit Function
            End If
            pStart = pStart + Len("<" & tag & ">")
    
            ' find first end </tag> after start <tag>
            pEnd = InStr(pStart, html, "</" & tag & ">", vbTextCompare)
        Next o
    
        'return string between start <tag> & end </tag>
        getTag = Mid(html, pStart, pEnd - pStart)
    End Function
    

    This will find only basic <tag>'s but you could add/remove/change the text functions to suit your needs.

    Example Usage:

    Sub findTagExample()
    
        Const testURL = "https://en.wikipedia.org/wiki/Web_scraping"
    
        'search for 2nd occurence of tag: <h2> which is "Contents" :
        Debug.Print getTag(testURL, "<h2>", 2)
    
        '...this returns the 8th occurence, "Navigation Menu" :
        Debug.Print getTag(testURL, "<h2>", 8)
    
        '...and this returns an HTML <span> containing a title for the 'Legal Issues' section:
        Debug.Print getTag("https://en.wikipedia.org/wiki/Web_scraping", "<h2>", 4)
    
    End Sub
    
    0 讨论(0)
提交回复
热议问题