Extract data from a web page that may not be formatted as a table

后端未结

关注

 2  1107

一生所求

For starters I am by no means an expert in VBA. Just know enough to be dangerous 8).

I started out by doing a search on how to extract a table from a web page and sa

相关标签:

2条回答

醉酒成梦

2021-01-06 20:05

So if written a small Sub which i think should solve your Problem if i understood you correctly. Of course you will invest some work, since it only reads one stage right now. But it reads the data from every Group:

Option Explicit

Private Sub CommandButton1_Click()

'make sure you add references to Microsoft Internet Controls (shdocvw.dll) and
 'Microsoft HTML object Library.
 'Code will NOT run otherwise.

Dim objIE As SHDocVw.InternetExplorer 'microsoft internet controls (shdocvw.dll)
Dim htmlDoc As MSHTML.HTMLDocument 'Microsoft HTML Object Library
Dim htmlInput As MSHTML.HTMLInputElement
Dim htmlColl As MSHTML.IHTMLElementCollection

Set objIE = New SHDocVw.InternetExplorer

Dim htmlCurrentDoc As MSHTML.HTMLDocument 'Microsoft HTML Object Library

Dim RowNumber As Integer
            RowNumber = 1

With objIE
    .Navigate "http://worldoftanks.com/en/tournaments/1000000017/" ' Main page
    .Visible = 0
    Do While .READYSTATE <> 4: DoEvents: Loop
        Application.Wait (Now + TimeValue("0:00:01"))


        Set htmlDoc = .document

        Dim ButtonRoundData As Variant
        Set ButtonRoundData = htmlDoc.getElementsByClassName("group-stage_link")

        Dim ButtonData As Variant
        Set ButtonData = htmlDoc.getElementsByClassName("groups_link")



        Dim button As HTMLLinkElement
        For Each button In ButtonData

           Debug.Print button.nodeName

            button.Click

               Application.Wait (Now + TimeValue("0:00:02")) ' This is to prevent double entryies but it is not clean. you should definitly check if the table is still the same and wait then

            Set htmlCurrentDoc = .document
            Dim RawData As HTMLTable
            Set RawData = htmlCurrentDoc.getElementsByClassName("tournament-table tournament-table__indent")(0)



            Dim ColumnNumber As Integer
            ColumnNumber = 1

            Dim hRow As HTMLTableRow
            Dim hCell As HTMLTableCell
            For Each hRow In RawData.Rows

                For Each hCell In hRow.Cells
                    Cells(RowNumber, ColumnNumber).Value = hCell.innerText
                    ColumnNumber = ColumnNumber + 1
                Next hCell
                ColumnNumber = 1
                RowNumber = RowNumber + 1
            Next hRow

            RowNumber = RowNumber + 3
        Next button
    End With

End Sub

What it does is starting an invisible IE, reads the data, clicks the button, reads the next and so on ...

for Debugging i suggest to set .Visible to 1, so you will se what happens.

EDIT 1: if you get a debbuging error, try to Abort and run it again, it definitly Needs some error handling, if the Website isn't loaded right.

EDIT 2: Made it a bit stabler, you should really pay Attention, since the Webpage takes some time to load, you MUST check if the data has changed before writting it. if it hasn't changed wait a second or so and then try again.

Here some sample data i got in Excel:

0 讨论(0)

你的背包

2021-01-06 20:10
Although extracting data from a webpage can be automated with VBA (see below), the specific example webpage you provided comes with some obstacles:

This webpage loads and displays only a small portion of the desired data at a time. This is probably done for performance reasons, since the whole table of Teams would consist of several thousand entries. Only the Teams of the currently displayed Round and currently displayed Group are loaded. If you click on another Group, a JavaScript program (running in your browser) is started that connects to the server, fetches the Teams of that Group and replaces the data in the webpage. You can verify this by yourself if you press F12 and observe the Network tab that lists all requests to the server.

Thus, the webpage does not provide at any point a complete list of Teams. You would have to work around this:
1. Make your program automatically click on each Round, then click on each Group and finally extract the 9 teams of that Group, merging everything together afterwards.
2. Hook into the JavaScript code that loads each Group's Teams and call it in a loop, or reverse-engineer the requests made by that code and try to re-create them in VBA. Although this could be an elegant solution, many website owners do not like having their API used in ways they did not intend. A misuse could create a huge server load. I would only recommend this method if the API was designed for this purpose (some websites do this, like Twitter or Steam).
The following will focus on just extracting content from a given page, that is, retrieving the Teams of the currently loaded Group. I won't use any of the workarounds mentioned above.

The program basically consists of these three parts:

Open Webpage

The following is a helper function that opens a webpage and returns an object with the webpage's content. It needs the libraries Microsoft Internet Controls and Microsoft HTML Object Library referenced (see here for instructions).
```
' return the document containg the DOM of the page strWebAddress
' returns Nothing if the timeout lngTimeoutInSeconds was reached
Public Function GetIEDocument(ByVal strWebAddress As String, Optional ByVal lngTimeoutInSeconds As Long = 15) As MSHTML.HTMLDocument
    Dim IE As SHDocVw.InternetExplorer
    Dim IEDocument As MSHTML.HTMLDocument
    Dim dateNow As Date

    ' create an IE application, representing a tab
    Set IE = New SHDocVw.InternetExplorer

    ' optionally make the application visible, though it will work perfectly fine in the background otherwise
    IE.Visible = True

    ' open a webpage in the tab represented by IE and wait until the main request successfully finished
    ' times out after lngTimeoutInSeconds with a warning
    IE.Navigate strWebAddress
    dateNow = Now
    Do While IE.Busy
        If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
    Loop

    ' retrieve the webpage's content (that is, the HTML DOM) and wait until everything is loaded (images, etc.)
    ' times out after lngTimeoutInSeconds with a warning
    Set IEDocument = IE.Document
    dateNow = Now
    Do While IEDocument.ReadyState <> "complete"
        If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
    Loop

    Set GetIEDocument = IEDocument
End Function
```
Extract Information

You can now load the webpage by using Set IEDocument = GetIEDocument("http://worldoftanks.com/en/tournaments/1000000017/"). The object IEDocument then contains everything you need to extract the desired data.

First you need to find the part that you want to extract (the critical "Tag", as you called it). Since the content of a webpage is represented as a tree of HTML tags, you need to find the table tag that contains all other tags that you are interested in. You already spotted it in your 16/03/19 1600 update. The <table> tag contains two <tr> tags (table row), the first being the header row filled with <th> tags (table header) representing the header of a single column. The second row is a dummy row representing the entry of one Team.

The prepending line  is part of the Knockout Framwork, a JavaScript library employed by the website to dynamically fill the bare HTML tags with content. This is the reason why there is only one row in the HTML source, but in the rendered page you see nine rows: After the page is loaded, the JavaScript code loops over the list of Teams and creates a new row for each, populated with their respective data.

This, however, does not need to concern us: IEDocument contains the final version of the HTML DOM, after all loading was done (also see edit at the bottom). The first row looks actually like this (press F12 and have a look at the DOM Explorer tab for yourself):
```
<tr class="tournament-table_tr" data-bind="css: {'tournament-table_tr__my-team': team.team_id === $root.currentUserTeamIdInCurrentGroup()}">
    <td class="tournament-table_td" data-bind="text: team.position">1</td>
    <td class="tournament-table_td" data-bind="css: {'tournament-table_td__my-team': team.team_id === $root.currentUserTeamIdInCurrentGroup()}">
        <a class="tournament-table_team tournament-table_team__big" href="/en/tournaments/1000000017/team/1000006728/" target="_blank" data-bind="text: team.team_title, attr: {href: $root.getTournamentTeamUrl(team.team_id)}">Pubbies</a>
    </td>
    <td class="tournament-table_td" data-bind="text: team.battle_played">8</td>
    <td class="tournament-table_td" data-bind="text: team.wins">7</td>
    <td class="tournament-table_td tournament-table_td__mobile-hide" data-bind="text: team.losses">1</td>
    <td class="tournament-table_td tournament-table_td__mobile-hide" data-bind="text: team.draws">0</td>
    <td class="tournament-table_td" data-bind="text: team.extra_statistics.points">21</td>
</tr>
```
Programmatically finding the tag in the first place is, however, a bit more complicated. Usually structurally important tags have an id attribute that is unique. In such a case we could simply find it by using IEDocument.getElementById("id_of_table_tag"). In this case our best bet is probably searching for the heading Tournament brackets:
```
<div class="wrapper">
    <h2 class="tournament-heading">Tournament brackets</h2>
</div>
```
If you inspect the following tree of HTML tags, to get to our <table> tag we need to go one step up in the hierarchy, skip the next two tags and from there on, use the first child tag for the next two tags:
```
' retrieve anchor element
For Each objH2 In IEDocument.getElementsByTagName("h2")
    If objH2.innerText = "Tournament brackets" Then Exit For
Next objH2

' traverse HTML tree to desired table element
' * move up one element in the hierarchy
' * skip two elements to proceed to the third (interjected each time with whitespace that is interpreted as an element of its own)
' * move down two elements n the hierarchy
' this may fail if the JavaScript code has not already populated the table
Set objTable = objH2.parentElement _
                    .nextSibling.nextSibling _
                    .nextSibling.nextSibling _
                    .nextSibling.nextSibling _
                    .children(0) _
                    .children(0)
```
As you can imagine, this is not very robust and is bound to break at any time if the layout of the webpage changes. There are other possible ways how to traverse the tree of HTML tags to finally reach the tag you seek. See the documentation of the Document object for more.

All we need to do now is loop over the Rows of objTable and output each of its Cells.

Output to Excel

As for the output, in this example, we keep it as simple as possible. Put together with the above, the following code just outputs the table to the current worksheet in Excel:
```
Public Sub GetTeamData()
    Dim strWebAddress As String
    Dim strH2AnchorContent As String
    Dim IEDocument As MSHTML.HTMLDocument
    Dim objH2 As MSHTML.HTMLHeaderElement
    Dim objTable As MSHTML.HTMLTable
    Dim objRow As MSHTML.HTMLTableRow
    Dim objCell As MSHTML.HTMLTableCell
    Dim lngRow As Long
    Dim lngColumn As Long

    ' initialize some variables that should probably better be passed as paramaters or defined as constants
    strWebAddress = "http://worldoftanks.com/en/tournaments/1000000017/"
    strH2AnchorContent = "Tournament brackets"

    ' open page
    Set IEDocument = GetIEDocument(strWebAddress)
    If IEDocument Is Nothing Then
        MsgBox "Timeout reached opening this address:" & vbNewLine & strWebAddress, vbCritical
        Exit Sub
    End If

    ' retrieve anchor element
    For Each objH2 In IEDocument.getElementsByTagName("h2")
        If objH2.innerText = strH2AnchorContent Then Exit For
    Next objH2
    If objH2 Is Nothing Then
        MsgBox "Could not find """ & strH2AnchorContent & """ in DOM!", vbCritical
        Exit Sub
    End If

    ' traverse HTML tree to desired table element
    ' * move up one element in the hierarchy
    ' * skip two elements to proceed to the third (interjected each time with whitespace that is interpreted as an element of its own)
    ' * move down two elements n the hierarchy
    Set objTable = objH2.parentElement _
                        .nextSibling.nextSibling _
                        .nextSibling.nextSibling _
                        .nextSibling.nextSibling _
                        .children(0) _
                        .children(0)

    ' iterate over the table and output its contents
    lngRow = 1
    For Each objRow In objTable.rows
        lngColumn = 1
        For Each objCell In objRow.cells
            Cells(lngRow, lngColumn) = objCell.innerText
            lngColumn = lngColumn + 1
        Next objCell
        lngRow = lngRow + 1
    Next
End Sub
```
Although this is only a partial solution for your current problem, this offers a general solution for how to programmatically extract data from a website using VBA. As you said that you regularly encounter such problems, this might be of some use to you nonetheless.

Edit
1. In his answer, Doktor OSwaldo rightfully declares the objects as exactly what they are - in contrast to my previous version where everything was of type Object. I didn't know of the Microsoft HTML Object Library. Thanks @Doktor OSwaldo. :) I incorporated the use of the library in my code above.
2. You should be aware that at the moment where objTable is set, the element might not yet exist in the DOM because of the JavaScript having not yet completely filled in all the data. You could put a loop around this statement checking if objTable was indeed successfully set: On Error Resume Next Do Err.Clear Set objTable = ... Loop While Err On Error GoTo 0 You should probably include a timeout option as shown in function GetIEDocument(). All of this is best moved to a separate function that also clicks the Round and Group buttons as shown in Doktor OSwaldo's answer.
3. As you probably have already noticed, the header columns are output twice. This is actually correct because of the way the icon is shown before the header text. You can identify this with objCell.tagName = "TH" And objCell.children.length = 2, in which case you should use objCell.children(1).innerText instead of objCell.innerText to output to Excel.
0 讨论(0)
发布评论:

提交评论
- 加载中...

Extract data from a web page that may not be formatted as a table

Open Webpage

Extract Information

Output to Excel

Edit