How do I parse html without creating an object of internet explorer in vba?

后端 未结 2 1133
我寻月下人不归
我寻月下人不归 2021-01-21 17:49

I don\'t have internet explorer on any of the computers at work, therefore creating a object of internet explorer and using ie.navigate to parse the html and search for the tags

2条回答
  •  囚心锁ツ
    2021-01-21 18:27

    You could use XMLHTTP to retrieve the HTML source of a web page:

    Function GetHTML(url As String) As String
        With CreateObject("MSXML2.XMLHTTP")
            .Open "GET", url, False
            .Send
            GetHTML = .ResponseText
        End With
    End Function
    

    I wouldn't suggest using this as a worksheet function, or else the site URL will be re-queried every time the worksheet recalculates. Some sites have logic in place to detect scraping via frequent, repeated calls, and your IP could become banned, temporarily or permanently, depending on the site.

    Once you have the source HTML string (preferably stored in a variable to avoid unnecessary repeat calls), you can use basic text functions to parse the string to search for your tag.

    This basic function will return the value between the and :

    Public Function getTag(url As String, tag As String, Optional occurNum As Integer) As String
        Dim html As String, pStart As Long, pEnd As Long, o As Integer
        html = GetHTML(url)
    
        'remove <> if they exist so we can add our own
        If Left(tag, 1) = "<" And Right(tag, 1) = ">" Then
            tag = Left(Right(tag, Len(tag) - 1), Len(Right(tag, Len(tag) - 1)) - 1)
        End If
    
        ' default to Occurrence #1
        If occurNum = 0 Then occurNum = 1
        pEnd = 1
    
        For o = 1 To occurNum
            ' find start  beginning at 1 (or after previous Occurence)
            pStart = InStr(pEnd, html, "<" & tag & ">", vbTextCompare)
            If pStart = 0 Then
                getTag = "{Not Found}"
                Exit Function
            End If
            pStart = pStart + Len("<" & tag & ">")
    
            ' find first end  after start 
            pEnd = InStr(pStart, html, "", vbTextCompare)
        Next o
    
        'return string between start  & end 
        getTag = Mid(html, pStart, pEnd - pStart)
    End Function
    

    This will find only basic 's but you could add/remove/change the text functions to suit your needs.

    Example Usage:

    Sub findTagExample()
    
        Const testURL = "https://en.wikipedia.org/wiki/Web_scraping"
    
        'search for 2nd occurence of tag: 

    which is "Contents" : Debug.Print getTag(testURL, "

    ", 2) '...this returns the 8th occurence, "Navigation Menu" : Debug.Print getTag(testURL, "

    ", 8) '...and this returns an HTML containing a title for the 'Legal Issues' section: Debug.Print getTag("https://en.wikipedia.org/wiki/Web_scraping", "

    ", 4) End Sub

提交回复
热议问题