I don\'t have internet explorer on any of the computers at work, therefore creating a object of internet explorer and using ie.navigate to parse the html and search for the tags
Anyone who has done some web scraping will be familiar with creating an instance of Internet Explorer (IE) and the navigating to a web address and then once the page is ready start navigating the DOM using the 'Microsoft HTML Object Library' (MSHTML) type library. The question asks if IE is unavailable what to do. I am in the same situation for my box running Windows 10.
I had suspected it was possible to spin up an instance of MSHTML.HTMLDocument independent of IE but its creation is not obvious. Thanks to the questioner for asking this now. The answer lies in the MSHTML.IHTMLDocument4.createDocumentFromUrl method. One needs a local file to work (EDIT: actually one can put a webby url in as well!) with but we have a nice tidy Windows API function called URLDownloadToFile to download a file.
This codes runs on my Windows 10 box where Microsoft Edge is running and not Internet Explorer. This is an important find and thanks to the questioner for raising it.
Option Explicit
'* Tools->Refernces Microsoft HTML Object Library
'* MSDN - URLDownloadToFile function - https://msdn.microsoft.com/en-us/library/ms775123(v=vs.85).aspx
Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" Alias "URLDownloadToFileA" _
(ByVal pCaller As Long, ByVal szURL As String, ByVal szFileName As String, _
ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
Sub Test()
Dim fso As Object
Set fso = CreateObject("Scripting.FileSystemObject")
Dim sLocalFilename As String
sLocalFilename = Environ$("TMP") & "\urlmon.html"
Dim sURL As String
sURL = "https://stackoverflow.com/users/3607273/s-meaden"
Dim bOk As Boolean
bOk = (URLDownloadToFile(0, sURL, sLocalFilename, 0, 0) = 0)
If bOk Then
If fso.FileExists(sLocalFilename) Then
'* Tools->Refernces Microsoft HTML Object Library
Dim oHtml4 As MSHTML.IHTMLDocument4
Set oHtml4 = New MSHTML.HTMLDocument
Dim oHtml As MSHTML.HTMLDocument
Set oHtml = Nothing
'* IHTMLDocument4.createDocumentFromUrl
'* MSDN - IHTMLDocument4 createDocumentFromUrl method - https://msdn.microsoft.com/en-us/library/aa752523(v=vs.85).aspx
Set oHtml = oHtml4.createDocumentFromUrl(sLocalFilename, "")
'* need to wait a little whilst the document parses
'* because it is multithreaded
While oHtml.readyState <> "complete"
DoEvents '* do not comment this out it is required to break into the code if in infinite loop
Wend
Debug.Assert oHtml.readyState = "complete"
Dim sTest As String
sTest = Left$(oHtml.body.outerHTML, 100)
Debug.Assert Len(Trim(sTest)) > 50 '* just testing we got a substantial block of text, feel free to delete
'* page specific logic goes here
Dim htmlAnswers As Object 'MSHTML.DispHTMLElementCollection
Set htmlAnswers = oHtml.getElementsByClassName("answer-hyperlink")
Dim lAnswerLoop As Long
For lAnswerLoop = 0 To htmlAnswers.Length - 1
Dim vAnswerLoop
Set vAnswerLoop = htmlAnswers.Item(lAnswerLoop)
Debug.Print vAnswerLoop.outerText
Next
End If
End If
End Sub
Thanks for asking this.
P.S. I have used TaskList to verify that IExplore.exe is not created under the hoods when this code runs.
P.P.S If you liked this then see more at my Excel Development Platform blog
You could use XMLHTTP to retrieve the HTML source of a web page:
Function GetHTML(url As String) As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", url, False
.Send
GetHTML = .ResponseText
End With
End Function
I wouldn't suggest using this as a worksheet function, or else the site URL will be re-queried every time the worksheet recalculates. Some sites have logic in place to detect scraping via frequent, repeated calls, and your IP could become banned, temporarily or permanently, depending on the site.
Once you have the source HTML string (preferably stored in a variable to avoid unnecessary repeat calls), you can use basic text functions to parse the string to search for your tag.
This basic function will return the value between the <tag>
and </tag>
:
Public Function getTag(url As String, tag As String, Optional occurNum As Integer) As String
Dim html As String, pStart As Long, pEnd As Long, o As Integer
html = GetHTML(url)
'remove <> if they exist so we can add our own
If Left(tag, 1) = "<" And Right(tag, 1) = ">" Then
tag = Left(Right(tag, Len(tag) - 1), Len(Right(tag, Len(tag) - 1)) - 1)
End If
' default to Occurrence #1
If occurNum = 0 Then occurNum = 1
pEnd = 1
For o = 1 To occurNum
' find start <tag> beginning at 1 (or after previous Occurence)
pStart = InStr(pEnd, html, "<" & tag & ">", vbTextCompare)
If pStart = 0 Then
getTag = "{Not Found}"
Exit Function
End If
pStart = pStart + Len("<" & tag & ">")
' find first end </tag> after start <tag>
pEnd = InStr(pStart, html, "</" & tag & ">", vbTextCompare)
Next o
'return string between start <tag> & end </tag>
getTag = Mid(html, pStart, pEnd - pStart)
End Function
This will find only basic <tag>
's but you could add/remove/change the text functions to suit your needs.
Sub findTagExample()
Const testURL = "https://en.wikipedia.org/wiki/Web_scraping"
'search for 2nd occurence of tag: <h2> which is "Contents" :
Debug.Print getTag(testURL, "<h2>", 2)
'...this returns the 8th occurence, "Navigation Menu" :
Debug.Print getTag(testURL, "<h2>", 8)
'...and this returns an HTML <span> containing a title for the 'Legal Issues' section:
Debug.Print getTag("https://en.wikipedia.org/wiki/Web_scraping", "<h2>", 4)
End Sub