Is there a way to slow down a Web Scraper so it will pick up the code?

后端 未结 1 1991
太阳男子
太阳男子 2020-11-27 23:43

I wrote a macro to go to WU to get historical data and for the most part, it works. However, I believe that the macro is running too fast for it to pick up the data from the

相关标签:
1条回答
  • 2020-11-28 00:21

    One way around this is to call the API that the page is using to get that info.

    The API returns json which you can parse with a json parser. I use jsonconverter.bas. After installing the code from that link in a standard module called JsonConverter, go to VBE > Tools > References > Add a reference to Microsoft Scripting Runtime.


    Finding the API:

    If you press F12 to open developer tools and go to the Network tab and then press F5 to refresh any url of interest you will see the recorded web traffic. You can find the API call there.

    See my answer here on how to search the network traffic using a specific observation value you expect to see in the response - this will filter the list of network traffic to those items containing the value of interest. Be sensible in selecting the value - you want something unlikely to occur elsewhere. You can also filter the network traffic to XHR only.


    The API response:

    The API returns json. More specifically, it returns a dictionary containing 2 keys. The second key, "observations", can be used to return a collection (denoted by []) of dictionaries (denoted by {}). Each dictionary represents a row of the table (daily observations). You can loop this collection, and then loop the inner dictionaries, to access the table row values and reconstruct the table by populating an array. Explore example json response here.


    Explanation of json structure:

    click here to enlarge


    Explanation of code:

    The code is broken down into a number of helper subs and functions, allocating certains tasks to each, to make code easier to debug and follow, as well as better align with Object Oriented Programming Principles.

    Overall the process is:

    1. Gather urls for Worksheet("Sheet1"). Helper function GetAllUrls.
    2. Process those urls and only retain the dates which correspond with Tue-Thur. These are kept as strings formatted as "yyyymmdd" so can be passed to API later. This is handled by helper functions GetOnlyQualifyingUrlsDates and IncludeThisDate. IncludeThisDate performs the check for whether to include; GetOnlyQualifyingUrlsDates handles the looping and formatting of results.
    3. Issue xmlhttp requests by looping over qualifying url dates and concatenating those into the url for the API call, then issuing the request. This is performed by the main sub GetTables.
    4. Sheet creation, for output, is handled by helper function CreateWorksheet. This function calls another helper function, SheetExists, to ensure sheets are only created if they don't already exist, otherwise, the existing sheet by that name is used.
    5. The resultant json response, from step 3, is passed to a helper sub WriteOutResults which accepts the json variable and the output sheet object as arguments. It extracts all the info from the json response; essentially reconstructing the table. It adds the table and headers to the appropriate sheet. It calls helper function Epoch2Date, which handles the unix timestamp to datetime conversion for the two unix fields in the json object.

    TODO:

    1. The API key may be time limited. Add a helper function which returns the current valid key.
    2. The API accepts start date and end date parameters in the url construct. It would be far better to issue one request for the entire range if possible, or chunked ranges e.g. months, to reduce the number of requests made. This would also reduce the likelihood of being blocked. This would mean some additional code would need to be written, before writing out results, to ensure only dates of interest are being written to sheets. Though you could write out all then simply loop all sheets and delete those that aren't wanted (perfectly doable if we are talking about 365 dates total). Personally, I would handle the include date part in the construction of the table from a single request (if possible) that has the min and max dates for entire urls listed passed as start and end date parameters. I would then write a single flat table out to one sheet as this will be much easier for later data analysis.

    VBA:

    Option Explicit
    
    Public Sub GetTables()
        'VBE > Tools > References > Microsoft Scripting Runtime
        Dim json As Object, qualifyingUrlsDates(), urls(), url As String
        Dim ws As Worksheet, wsOutput As Worksheet, i As Long, startDate As String, endDate As String
    
        Set ws = ThisWorkbook.Worksheets("Sheet1")
        urls = GetAllUrls(2, ws, "A")
        qualifyingUrlsDates = GetOnlyQualifyingUrlsDates(urls)
    
        'API key may be not be valid over time so look at obtaining by prior request
    
        With CreateObject("MSXML2.XMLHTTP")          'issue xmlhttp request for each valid date (this would be better done using start and enddate to specify entire range _
                                                     of batches e.g. months within total range to cut down on requests
            For i = LBound(qualifyingUrlsDates) To UBound(qualifyingUrlsDates)
                startDate = qualifyingUrlsDates(i)
                endDate = startDate                 ' a little verbose but useful for explaining
                url = "https://api.weather.com/v1/geocode/31.76/-106.49/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&startDate=" & startDate & "&endDate=" & endDate & "&units=e"
                .Open "GET", url, False
                .send
                Set json = JsonConverter.ParseJson(.responseText)("observations")
                Set wsOutput = CreateWorksheet(qualifyingUrlsDates(i))
                WriteOutResults wsOutput, json
            Next
        End With
    End Sub
    
    Public Sub WriteOutResults(ByVal wsOutput As Worksheet, ByVal json As Object)
    'json is a collection of dictionaries. Each dictionary is a time period reading from the day i.e. one row in output
        Dim results(), item As Object, headers(), r As Long, c As Long, key As Variant
        headers = json.item(1).keys 'get the headers which are the keys of each dictionary
        ReDim results(1 To json.Count, 1 To UBound(headers) + 1)
        For Each item In json
            r = r + 1: c = 0 'increase row in results array to store results for table row
            For Each key In item.keys
                c = c + 1 'increase column number in results array for writing out results
                Select Case key
                Case "valid_time_gmt", "expire_time_gmt" 'convert unix timestamp fields to datetime
                    results(r, c) = Epoch2Date(item(key))
                Case Else
                    results(r, c) = item(key)
                End Select
            Next
        Next
        With wsOutput
            .Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
            .Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
        End With
    End Sub
    
    Public Function GetOnlyQualifyingUrlsDates(ByVal urls As Variant) As Variant
        Dim i As Long, output(), counter As Long
        ReDim output(1 To UBound(urls))
    
        For i = LBound(urls) To UBound(urls)
            If IncludeThisDate(urls(i)) Then 'check if weekday is to be included
                counter = counter + 1
                output(counter) = Format$(Right$(urls(i), 8), "yyyymmdd") 'if to include then add to output array of urls of interest
            End If
        Next
        ReDim Preserve output(1 To counter)
        GetOnlyQualifyingUrlsDates = output
    End Function
    
    Public Function IncludeThisDate(ByVal url As String) As Boolean
        'tue, wed, thurs are valid
        IncludeThisDate = Not IsError(Application.Match(Weekday(Right$(url, 8), vbSunday), Array(3, 4, 5)))
    End Function
    
    Public Function SheetExists(ByVal sheetName As String) As Boolean '<==  function by @Rory
        SheetExists = Evaluate("ISREF('" & sheetName & "'!A1)")
    End Function
    
    Public Function GetAllUrls(ByVal startRow As Long, ByVal ws As Worksheet, ByVal columnName As String) As Variant
        'transpose used based on premise no more than a couple of years of dates
        'startRow is start row for urls, ws is sheet where urls found, columnName is string representation of column for urls e.g. "A"
        With ws
            GetAllUrls = Application.Transpose(ws.Range("A" & startRow & ":A" & .Cells(.rows.Count, columnName).End(xlUp).Row).Value)
        End With
    End Function
    
    Public Function CreateWorksheet(ByVal sheetName As String) As Worksheet
        Dim ws As Worksheet
        If SheetExists(sheetName) Then
            Set ws = ThisWorkbook.Worksheets(sheetName)
            'do something.... clear it? Then add new data to it?
        Else
            Set ws = ThisWorkbook.Worksheets.Add
            ws.Name = sheetName
        End If
        Set CreateWorksheet = ws
    End Function
    
    Public Function Epoch2Date(ByVal E As Currency, Optional msFrac) As Date '@ Schmidt http://www.vbforums.com/showthread.php?805245-EPOCH-to-Date-and-vice-versa
        Const Estart As Double = #1/1/1970#
        msFrac = 0
        If E > 10000000000@ Then E = E * 0.001: msFrac = E - Int(E)
        Epoch2Date = Estart + (E - msFrac) / 86400
    End Function
    
    0 讨论(0)
提交回复
热议问题