Extract data from a web page that may not be formatted as a table

后端 未结 2 1106
一生所求
一生所求 2021-01-06 19:44

For starters I am by no means an expert in VBA. Just know enough to be dangerous 8).

I started out by doing a search on how to extract a table from a web page and sa

相关标签:
2条回答
  • 2021-01-06 20:05

    So if written a small Sub which i think should solve your Problem if i understood you correctly. Of course you will invest some work, since it only reads one stage right now. But it reads the data from every Group:

    Option Explicit
    
    Private Sub CommandButton1_Click()
    
    'make sure you add references to Microsoft Internet Controls (shdocvw.dll) and
     'Microsoft HTML object Library.
     'Code will NOT run otherwise.
    
    Dim objIE As SHDocVw.InternetExplorer 'microsoft internet controls (shdocvw.dll)
    Dim htmlDoc As MSHTML.HTMLDocument 'Microsoft HTML Object Library
    Dim htmlInput As MSHTML.HTMLInputElement
    Dim htmlColl As MSHTML.IHTMLElementCollection
    
    Set objIE = New SHDocVw.InternetExplorer
    
    Dim htmlCurrentDoc As MSHTML.HTMLDocument 'Microsoft HTML Object Library
    
    Dim RowNumber As Integer
                RowNumber = 1
    
    With objIE
        .Navigate "http://worldoftanks.com/en/tournaments/1000000017/" ' Main page
        .Visible = 0
        Do While .READYSTATE <> 4: DoEvents: Loop
            Application.Wait (Now + TimeValue("0:00:01"))
    
    
            Set htmlDoc = .document
    
            Dim ButtonRoundData As Variant
            Set ButtonRoundData = htmlDoc.getElementsByClassName("group-stage_link")
    
            Dim ButtonData As Variant
            Set ButtonData = htmlDoc.getElementsByClassName("groups_link")
    
    
    
            Dim button As HTMLLinkElement
            For Each button In ButtonData
    
               Debug.Print button.nodeName
    
                button.Click
    
                   Application.Wait (Now + TimeValue("0:00:02")) ' This is to prevent double entryies but it is not clean. you should definitly check if the table is still the same and wait then
    
                Set htmlCurrentDoc = .document
                Dim RawData As HTMLTable
                Set RawData = htmlCurrentDoc.getElementsByClassName("tournament-table tournament-table__indent")(0)
    
    
    
                Dim ColumnNumber As Integer
                ColumnNumber = 1
    
                Dim hRow As HTMLTableRow
                Dim hCell As HTMLTableCell
                For Each hRow In RawData.Rows
    
                    For Each hCell In hRow.Cells
                        Cells(RowNumber, ColumnNumber).Value = hCell.innerText
                        ColumnNumber = ColumnNumber + 1
                    Next hCell
                    ColumnNumber = 1
                    RowNumber = RowNumber + 1
                Next hRow
    
                RowNumber = RowNumber + 3
            Next button
        End With
    
    End Sub
    

    What it does is starting an invisible IE, reads the data, clicks the button, reads the next and so on ...

    for Debugging i suggest to set .Visible to 1, so you will se what happens.

    EDIT 1: if you get a debbuging error, try to Abort and run it again, it definitly Needs some error handling, if the Website isn't loaded right.

    EDIT 2: Made it a bit stabler, you should really pay Attention, since the Webpage takes some time to load, you MUST check if the data has changed before writting it. if it hasn't changed wait a second or so and then try again.

    Here some sample data i got in Excel:

    0 讨论(0)
  • 2021-01-06 20:10

    Although extracting data from a webpage can be automated with VBA (see below), the specific example webpage you provided comes with some obstacles:

    This webpage loads and displays only a small portion of the desired data at a time. This is probably done for performance reasons, since the whole table of Teams would consist of several thousand entries. Only the Teams of the currently displayed Round and currently displayed Group are loaded. If you click on another Group, a JavaScript program (running in your browser) is started that connects to the server, fetches the Teams of that Group and replaces the data in the webpage. You can verify this by yourself if you press F12 and observe the Network tab that lists all requests to the server.

    Thus, the webpage does not provide at any point a complete list of Teams. You would have to work around this:

    1. Make your program automatically click on each Round, then click on each Group and finally extract the 9 teams of that Group, merging everything together afterwards.
    2. Hook into the JavaScript code that loads each Group's Teams and call it in a loop, or reverse-engineer the requests made by that code and try to re-create them in VBA. Although this could be an elegant solution, many website owners do not like having their API used in ways they did not intend. A misuse could create a huge server load. I would only recommend this method if the API was designed for this purpose (some websites do this, like Twitter or Steam).

    The following will focus on just extracting content from a given page, that is, retrieving the Teams of the currently loaded Group. I won't use any of the workarounds mentioned above.

    The program basically consists of these three parts:

    Open Webpage

    The following is a helper function that opens a webpage and returns an object with the webpage's content. It needs the libraries Microsoft Internet Controls and Microsoft HTML Object Library referenced (see here for instructions).

    ' return the document containg the DOM of the page strWebAddress
    ' returns Nothing if the timeout lngTimeoutInSeconds was reached
    Public Function GetIEDocument(ByVal strWebAddress As String, Optional ByVal lngTimeoutInSeconds As Long = 15) As MSHTML.HTMLDocument
        Dim IE As SHDocVw.InternetExplorer
        Dim IEDocument As MSHTML.HTMLDocument
        Dim dateNow As Date
    
        ' create an IE application, representing a tab
        Set IE = New SHDocVw.InternetExplorer
    
        ' optionally make the application visible, though it will work perfectly fine in the background otherwise
        IE.Visible = True
    
        ' open a webpage in the tab represented by IE and wait until the main request successfully finished
        ' times out after lngTimeoutInSeconds with a warning
        IE.Navigate strWebAddress
        dateNow = Now
        Do While IE.Busy
            If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
        Loop
    
        ' retrieve the webpage's content (that is, the HTML DOM) and wait until everything is loaded (images, etc.)
        ' times out after lngTimeoutInSeconds with a warning
        Set IEDocument = IE.Document
        dateNow = Now
        Do While IEDocument.ReadyState <> "complete"
            If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
        Loop
    
        Set GetIEDocument = IEDocument
    End Function
    

    Extract Information

    You can now load the webpage by using Set IEDocument = GetIEDocument("http://worldoftanks.com/en/tournaments/1000000017/"). The object IEDocument then contains everything you need to extract the desired data.

    First you need to find the part that you want to extract (the critical "Tag", as you called it). Since the content of a webpage is represented as a tree of HTML tags, you need to find the table tag that contains all other tags that you are interested in. You already spotted it in your 16/03/19 1600 update. The <table> tag contains two <tr> tags (table row), the first being the header row filled with <th> tags (table header) representing the header of a single column. The second row is a dummy row representing the entry of one Team.

    The prepending line <!-- ko foreach: {data: rrBrackets().teams, as: 'team' } --> is part of the Knockout Framwork, a JavaScript library employed by the website to dynamically fill the bare HTML tags with content. This is the reason why there is only one row in the HTML source, but in the rendered page you see nine rows: After the page is loaded, the JavaScript code loops over the list of Teams and creates a new row for each, populated with their respective data.

    This, however, does not need to concern us: IEDocument contains the final version of the HTML DOM, after all loading was done (also see edit at the bottom). The first row looks actually like this (press F12 and have a look at the DOM Explorer tab for yourself):

    <tr class="tournament-table_tr" data-bind="css: {'tournament-table_tr__my-team': team.team_id === $root.currentUserTeamIdInCurrentGroup()}">
        <td class="tournament-table_td" data-bind="text: team.position">1</td>
        <td class="tournament-table_td" data-bind="css: {'tournament-table_td__my-team': team.team_id === $root.currentUserTeamIdInCurrentGroup()}">
            <a class="tournament-table_team tournament-table_team__big" href="/en/tournaments/1000000017/team/1000006728/" target="_blank" data-bind="text: team.team_title, attr: {href: $root.getTournamentTeamUrl(team.team_id)}">Pubbies</a>
        </td>
        <td class="tournament-table_td" data-bind="text: team.battle_played">8</td>
        <td class="tournament-table_td" data-bind="text: team.wins">7</td>
        <td class="tournament-table_td tournament-table_td__mobile-hide" data-bind="text: team.losses">1</td>
        <td class="tournament-table_td tournament-table_td__mobile-hide" data-bind="text: team.draws">0</td>
        <td class="tournament-table_td" data-bind="text: team.extra_statistics.points">21</td>
    </tr>
    

    Programmatically finding the tag in the first place is, however, a bit more complicated. Usually structurally important tags have an id attribute that is unique. In such a case we could simply find it by using IEDocument.getElementById("id_of_table_tag"). In this case our best bet is probably searching for the heading Tournament brackets:

    <div class="wrapper">
        <h2 class="tournament-heading">Tournament brackets</h2>
    </div>
    

    If you inspect the following tree of HTML tags, to get to our <table> tag we need to go one step up in the hierarchy, skip the next two tags and from there on, use the first child tag for the next two tags:

    ' retrieve anchor element
    For Each objH2 In IEDocument.getElementsByTagName("h2")
        If objH2.innerText = "Tournament brackets" Then Exit For
    Next objH2
    
    ' traverse HTML tree to desired table element
    ' * move up one element in the hierarchy
    ' * skip two elements to proceed to the third (interjected each time with whitespace that is interpreted as an element of its own)
    ' * move down two elements n the hierarchy
    ' this may fail if the JavaScript code has not already populated the table
    Set objTable = objH2.parentElement _
                        .nextSibling.nextSibling _
                        .nextSibling.nextSibling _
                        .nextSibling.nextSibling _
                        .children(0) _
                        .children(0)
    

    As you can imagine, this is not very robust and is bound to break at any time if the layout of the webpage changes. There are other possible ways how to traverse the tree of HTML tags to finally reach the tag you seek. See the documentation of the Document object for more.

    All we need to do now is loop over the Rows of objTable and output each of its Cells.

    Output to Excel

    As for the output, in this example, we keep it as simple as possible. Put together with the above, the following code just outputs the table to the current worksheet in Excel:

    Public Sub GetTeamData()
        Dim strWebAddress As String
        Dim strH2AnchorContent As String
        Dim IEDocument As MSHTML.HTMLDocument
        Dim objH2 As MSHTML.HTMLHeaderElement
        Dim objTable As MSHTML.HTMLTable
        Dim objRow As MSHTML.HTMLTableRow
        Dim objCell As MSHTML.HTMLTableCell
        Dim lngRow As Long
        Dim lngColumn As Long
    
        ' initialize some variables that should probably better be passed as paramaters or defined as constants
        strWebAddress = "http://worldoftanks.com/en/tournaments/1000000017/"
        strH2AnchorContent = "Tournament brackets"
    
        ' open page
        Set IEDocument = GetIEDocument(strWebAddress)
        If IEDocument Is Nothing Then
            MsgBox "Timeout reached opening this address:" & vbNewLine & strWebAddress, vbCritical
            Exit Sub
        End If
    
        ' retrieve anchor element
        For Each objH2 In IEDocument.getElementsByTagName("h2")
            If objH2.innerText = strH2AnchorContent Then Exit For
        Next objH2
        If objH2 Is Nothing Then
            MsgBox "Could not find """ & strH2AnchorContent & """ in DOM!", vbCritical
            Exit Sub
        End If
    
        ' traverse HTML tree to desired table element
        ' * move up one element in the hierarchy
        ' * skip two elements to proceed to the third (interjected each time with whitespace that is interpreted as an element of its own)
        ' * move down two elements n the hierarchy
        Set objTable = objH2.parentElement _
                            .nextSibling.nextSibling _
                            .nextSibling.nextSibling _
                            .nextSibling.nextSibling _
                            .children(0) _
                            .children(0)
    
        ' iterate over the table and output its contents
        lngRow = 1
        For Each objRow In objTable.rows
            lngColumn = 1
            For Each objCell In objRow.cells
                Cells(lngRow, lngColumn) = objCell.innerText
                lngColumn = lngColumn + 1
            Next objCell
            lngRow = lngRow + 1
        Next
    End Sub
    

    Although this is only a partial solution for your current problem, this offers a general solution for how to programmatically extract data from a website using VBA. As you said that you regularly encounter such problems, this might be of some use to you nonetheless.


    Edit

    1. In his answer, Doktor OSwaldo rightfully declares the objects as exactly what they are - in contrast to my previous version where everything was of type Object. I didn't know of the Microsoft HTML Object Library. Thanks @Doktor OSwaldo. :) I incorporated the use of the library in my code above.
    2. You should be aware that at the moment where objTable is set, the element might not yet exist in the DOM because of the JavaScript having not yet completely filled in all the data. You could put a loop around this statement checking if objTable was indeed successfully set: On Error Resume Next Do Err.Clear Set objTable = ... Loop While Err On Error GoTo 0 You should probably include a timeout option as shown in function GetIEDocument(). All of this is best moved to a separate function that also clicks the Round and Group buttons as shown in Doktor OSwaldo's answer.
    3. As you probably have already noticed, the header columns are output twice. This is actually correct because of the way the icon is shown before the header text. You can identify this with objCell.tagName = "TH" And objCell.children.length = 2, in which case you should use objCell.children(1).innerText instead of objCell.innerText to output to Excel.
    0 讨论(0)
提交回复
热议问题