For starters I am by no means an expert in VBA. Just know enough to be dangerous 8).
I started out by doing a search on how to extract a table from a web page and sa
So if written a small Sub which i think should solve your Problem if i understood you correctly. Of course you will invest some work, since it only reads one stage right now. But it reads the data from every Group:
Option Explicit
Private Sub CommandButton1_Click()
'make sure you add references to Microsoft Internet Controls (shdocvw.dll) and
'Microsoft HTML object Library.
'Code will NOT run otherwise.
Dim objIE As SHDocVw.InternetExplorer 'microsoft internet controls (shdocvw.dll)
Dim htmlDoc As MSHTML.HTMLDocument 'Microsoft HTML Object Library
Dim htmlInput As MSHTML.HTMLInputElement
Dim htmlColl As MSHTML.IHTMLElementCollection
Set objIE = New SHDocVw.InternetExplorer
Dim htmlCurrentDoc As MSHTML.HTMLDocument 'Microsoft HTML Object Library
Dim RowNumber As Integer
RowNumber = 1
With objIE
.Navigate "http://worldoftanks.com/en/tournaments/1000000017/" ' Main page
.Visible = 0
Do While .READYSTATE <> 4: DoEvents: Loop
Application.Wait (Now + TimeValue("0:00:01"))
Set htmlDoc = .document
Dim ButtonRoundData As Variant
Set ButtonRoundData = htmlDoc.getElementsByClassName("group-stage_link")
Dim ButtonData As Variant
Set ButtonData = htmlDoc.getElementsByClassName("groups_link")
Dim button As HTMLLinkElement
For Each button In ButtonData
Debug.Print button.nodeName
button.Click
Application.Wait (Now + TimeValue("0:00:02")) ' This is to prevent double entryies but it is not clean. you should definitly check if the table is still the same and wait then
Set htmlCurrentDoc = .document
Dim RawData As HTMLTable
Set RawData = htmlCurrentDoc.getElementsByClassName("tournament-table tournament-table__indent")(0)
Dim ColumnNumber As Integer
ColumnNumber = 1
Dim hRow As HTMLTableRow
Dim hCell As HTMLTableCell
For Each hRow In RawData.Rows
For Each hCell In hRow.Cells
Cells(RowNumber, ColumnNumber).Value = hCell.innerText
ColumnNumber = ColumnNumber + 1
Next hCell
ColumnNumber = 1
RowNumber = RowNumber + 1
Next hRow
RowNumber = RowNumber + 3
Next button
End With
End Sub
What it does is starting an invisible IE, reads the data, clicks the button, reads the next and so on ...
for Debugging i suggest to set .Visible to 1, so you will se what happens.
EDIT 1: if you get a debbuging error, try to Abort and run it again, it definitly Needs some error handling, if the Website isn't loaded right.
EDIT 2: Made it a bit stabler, you should really pay Attention, since the Webpage takes some time to load, you MUST check if the data has changed before writting it. if it hasn't changed wait a second or so and then try again.
Here some sample data i got in Excel:
Although extracting data from a webpage can be automated with VBA (see below), the specific example webpage you provided comes with some obstacles:
This webpage loads and displays only a small portion of the desired data at a time. This is probably done for performance reasons, since the whole table of Teams would consist of several thousand entries. Only the Teams of the currently displayed Round and currently displayed Group are loaded. If you click on another Group, a JavaScript program (running in your browser) is started that connects to the server, fetches the Teams of that Group and replaces the data in the webpage. You can verify this by yourself if you press F12 and observe the Network tab that lists all requests to the server.
Thus, the webpage does not provide at any point a complete list of Teams. You would have to work around this:
The following will focus on just extracting content from a given page, that is, retrieving the Teams of the currently loaded Group. I won't use any of the workarounds mentioned above.
The program basically consists of these three parts:
The following is a helper function that opens a webpage and returns an object with the webpage's content. It needs the libraries Microsoft Internet Controls and Microsoft HTML Object Library referenced (see here for instructions).
' return the document containg the DOM of the page strWebAddress
' returns Nothing if the timeout lngTimeoutInSeconds was reached
Public Function GetIEDocument(ByVal strWebAddress As String, Optional ByVal lngTimeoutInSeconds As Long = 15) As MSHTML.HTMLDocument
Dim IE As SHDocVw.InternetExplorer
Dim IEDocument As MSHTML.HTMLDocument
Dim dateNow As Date
' create an IE application, representing a tab
Set IE = New SHDocVw.InternetExplorer
' optionally make the application visible, though it will work perfectly fine in the background otherwise
IE.Visible = True
' open a webpage in the tab represented by IE and wait until the main request successfully finished
' times out after lngTimeoutInSeconds with a warning
IE.Navigate strWebAddress
dateNow = Now
Do While IE.Busy
If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
Loop
' retrieve the webpage's content (that is, the HTML DOM) and wait until everything is loaded (images, etc.)
' times out after lngTimeoutInSeconds with a warning
Set IEDocument = IE.Document
dateNow = Now
Do While IEDocument.ReadyState <> "complete"
If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
Loop
Set GetIEDocument = IEDocument
End Function
You can now load the webpage by using Set IEDocument = GetIEDocument("http://worldoftanks.com/en/tournaments/1000000017/")
. The object IEDocument
then contains everything you need to extract the desired data.
First you need to find the part that you want to extract (the critical "Tag", as you called it).
Since the content of a webpage is represented as a tree of HTML tags, you need to find the table tag that contains all other tags that you are interested in. You already spotted it in your 16/03/19 1600 update. The <table>
tag contains two <tr>
tags (table row), the first being the header row filled with <th>
tags (table header) representing the header of a single column.
The second row is a dummy row representing the entry of one Team.
The prepending line <!-- ko foreach: {data: rrBrackets().teams, as: 'team' } -->
is part of the Knockout Framwork, a JavaScript library employed by the website to dynamically fill the bare HTML tags with content. This is the reason why there is only one row in the HTML source, but in the rendered page you see nine rows: After the page is loaded, the JavaScript code loops over the list of Teams and creates a new row for each, populated with their respective data.
This, however, does not need to concern us: IEDocument
contains the final version of the HTML DOM, after all loading was done (also see edit at the bottom). The first row looks actually like this (press F12 and have a look at the DOM Explorer tab for yourself):
<tr class="tournament-table_tr" data-bind="css: {'tournament-table_tr__my-team': team.team_id === $root.currentUserTeamIdInCurrentGroup()}">
<td class="tournament-table_td" data-bind="text: team.position">1</td>
<td class="tournament-table_td" data-bind="css: {'tournament-table_td__my-team': team.team_id === $root.currentUserTeamIdInCurrentGroup()}">
<a class="tournament-table_team tournament-table_team__big" href="/en/tournaments/1000000017/team/1000006728/" target="_blank" data-bind="text: team.team_title, attr: {href: $root.getTournamentTeamUrl(team.team_id)}">Pubbies</a>
</td>
<td class="tournament-table_td" data-bind="text: team.battle_played">8</td>
<td class="tournament-table_td" data-bind="text: team.wins">7</td>
<td class="tournament-table_td tournament-table_td__mobile-hide" data-bind="text: team.losses">1</td>
<td class="tournament-table_td tournament-table_td__mobile-hide" data-bind="text: team.draws">0</td>
<td class="tournament-table_td" data-bind="text: team.extra_statistics.points">21</td>
</tr>
Programmatically finding the tag in the first place is, however, a bit more complicated. Usually structurally important tags have an id
attribute that is unique. In such a case we could simply find it by using IEDocument.getElementById("id_of_table_tag")
.
In this case our best bet is probably searching for the heading Tournament brackets:
<div class="wrapper">
<h2 class="tournament-heading">Tournament brackets</h2>
</div>
If you inspect the following tree of HTML tags, to get to our <table>
tag we need to go one step up in the hierarchy, skip the next two tags and from there on, use the first child tag for the next two tags:
' retrieve anchor element
For Each objH2 In IEDocument.getElementsByTagName("h2")
If objH2.innerText = "Tournament brackets" Then Exit For
Next objH2
' traverse HTML tree to desired table element
' * move up one element in the hierarchy
' * skip two elements to proceed to the third (interjected each time with whitespace that is interpreted as an element of its own)
' * move down two elements n the hierarchy
' this may fail if the JavaScript code has not already populated the table
Set objTable = objH2.parentElement _
.nextSibling.nextSibling _
.nextSibling.nextSibling _
.nextSibling.nextSibling _
.children(0) _
.children(0)
As you can imagine, this is not very robust and is bound to break at any time if the layout of the webpage changes. There are other possible ways how to traverse the tree of HTML tags to finally reach the tag you seek. See the documentation of the Document object for more.
All we need to do now is loop over the Rows
of objTable
and output each of its Cells
.
As for the output, in this example, we keep it as simple as possible. Put together with the above, the following code just outputs the table to the current worksheet in Excel:
Public Sub GetTeamData()
Dim strWebAddress As String
Dim strH2AnchorContent As String
Dim IEDocument As MSHTML.HTMLDocument
Dim objH2 As MSHTML.HTMLHeaderElement
Dim objTable As MSHTML.HTMLTable
Dim objRow As MSHTML.HTMLTableRow
Dim objCell As MSHTML.HTMLTableCell
Dim lngRow As Long
Dim lngColumn As Long
' initialize some variables that should probably better be passed as paramaters or defined as constants
strWebAddress = "http://worldoftanks.com/en/tournaments/1000000017/"
strH2AnchorContent = "Tournament brackets"
' open page
Set IEDocument = GetIEDocument(strWebAddress)
If IEDocument Is Nothing Then
MsgBox "Timeout reached opening this address:" & vbNewLine & strWebAddress, vbCritical
Exit Sub
End If
' retrieve anchor element
For Each objH2 In IEDocument.getElementsByTagName("h2")
If objH2.innerText = strH2AnchorContent Then Exit For
Next objH2
If objH2 Is Nothing Then
MsgBox "Could not find """ & strH2AnchorContent & """ in DOM!", vbCritical
Exit Sub
End If
' traverse HTML tree to desired table element
' * move up one element in the hierarchy
' * skip two elements to proceed to the third (interjected each time with whitespace that is interpreted as an element of its own)
' * move down two elements n the hierarchy
Set objTable = objH2.parentElement _
.nextSibling.nextSibling _
.nextSibling.nextSibling _
.nextSibling.nextSibling _
.children(0) _
.children(0)
' iterate over the table and output its contents
lngRow = 1
For Each objRow In objTable.rows
lngColumn = 1
For Each objCell In objRow.cells
Cells(lngRow, lngColumn) = objCell.innerText
lngColumn = lngColumn + 1
Next objCell
lngRow = lngRow + 1
Next
End Sub
Although this is only a partial solution for your current problem, this offers a general solution for how to programmatically extract data from a website using VBA. As you said that you regularly encounter such problems, this might be of some use to you nonetheless.
Object
. I didn't know of the Microsoft HTML Object Library. Thanks @Doktor OSwaldo. :)
I incorporated the use of the library in my code above.objTable
is set, the element might not yet exist in the DOM because of the JavaScript having not yet completely filled in all the data. You could put a loop around this statement checking if objTable
was indeed successfully set:
On Error Resume Next
Do
Err.Clear
Set objTable = ...
Loop While Err
On Error GoTo 0
You should probably include a timeout option as shown in function GetIEDocument()
. All of this is best moved to a separate function that also clicks the Round and Group buttons as shown in Doktor OSwaldo's answer.objCell.tagName = "TH" And objCell.children.length = 2
, in which case you should use objCell.children(1).innerText
instead of objCell.innerText
to output to Excel.