Scraping data from all asp.net pages with AJAX pagination implemented

后端 未结 4 1498
生来不讨喜
生来不讨喜 2021-02-01 23:09

I want to scrap a webpage containing a list of user with addresses, email etc. webpage contain list of user with pagination i.e. page contains 10 users when I click on page 2 li

相关标签:
4条回答
  • 2021-02-01 23:41

    I got some test code working using yours as a basis and the only issue I found was this line.

    CURLOPT_POSTFIELDS => urlencode('ctl00%24scriptManager=ctl00%24cphMainContent%24ctl00%24cphMainContent%24rdpMembersPanel%7C' . $pageNumberCodes[0] . '&__EVENTTARGET=' . $pageNumberCodes[0] . '&__EVENTARGUMENT=' . '&__VIEWSTATE=' . $viewstate . '&__EVENTVALIDATION=' . $eventvalidation . "&google=" . '&ctl00%24cphMainContent%24txtZip=' . '&ctl00%24cphMainContent%24cboRadius=Exact' . '&ctl00%24cphMainContent%24txtMemberName=' . '&ctl00%24cphMainContent%24txtCity=Honolulu' . '&ctl00%24cphMainContent%24cboState=HI' . '&ctl00%24cphMainContent%24txtAddress=' . '&ctl00_cphMainContent_rdpMembers_ClientState=' . '&ctl00%24cphMainContent%24ddList=-Select%20field%20to%20sort-' . '&ctl00_cphMainContent_ddList_ClientState=' . '&ctl00_cphMainContent_rdlMembers_ClientState=' . '&ctl00_cphMainContent_ddList_ClientState=' . '&ctl00_cphMainContent_rdlMembers_ClientState=' . '&ctl00_cphMainContent_rdpMembers1_ClientState=' . '&__ASYNCPOST=true' . 'RadAJAXControlID=ctl00_cphMainContent_RadAjaxManager1')
    

    would need the urlencode moved to look like so

    CURLOPT_POSTFIELDS => 'ctl00%24scriptManager=ctl00%24cphMainContent%24ctl00%24cphMainContent%24rdpMembersPanel%7C' . $pageNumberCodes[0] . '&__EVENTTARGET=' . $pageNumberCodes[0] . '&__EVENTARGUMENT=' . '&__VIEWSTATE=' . rawurlencode($viewstate) . '&__EVENTVALIDATION=' . rawurlencode($eventvalidation) . "&google=" . '&ctl00%24cphMainContent%24txtZip=' . '&ctl00%24cphMainContent%24cboRadius=Exact' . '&ctl00%24cphMainContent%24txtMemberName=' . '&ctl00%24cphMainContent%24txtCity=Honolulu' . '&ctl00%24cphMainContent%24cboState=HI' . '&ctl00%24cphMainContent%24txtAddress=' . '&ctl00_cphMainContent_rdpMembers_ClientState=' . '&ctl00%24cphMainContent%24ddList=-Select%20field%20to%20sort-' . '&ctl00_cphMainContent_ddList_ClientState=' . '&ctl00_cphMainContent_rdlMembers_ClientState=' . '&ctl00_cphMainContent_ddList_ClientState=' . '&ctl00_cphMainContent_rdlMembers_ClientState=' . '&ctl00_cphMainContent_rdpMembers1_ClientState=' . '&__ASYNCPOST=true' . 'RadAJAXControlID=ctl00_cphMainContent_RadAjaxManager1'
    
    0 讨论(0)
  • 2021-02-01 23:50

    My best advice is to use iMacros https://addons.mozilla.org/en-US/firefox/addon/imacros-for-firefox/

    iMacros :

    1. Record your flow of page downloading. http://wiki.imacros.net/First_Steps
    2. Save web page to local directory. http://wiki.imacros.net/SAVEAS
    3. Scrap email, addresses etc using PHP script.

    No matter whether it's ajax - .aspx, .jsp or .php.

    0 讨论(0)
  • 2021-02-01 23:51

    In general, in order to fake the ASP.NET web site to think that you actually pressed a button (in more general terms - performed a postback), you need to do the following:

    1. Get the value of every single INPUT and SELECT element on the page. It might not be required in every scenario, but you should always at least get the values of all hidden fields where the name starts with "__" (such as __VIEWSTATE). You don't really need to know what is written in them - just that the value in them has to be sent back to the server unchanged.

    2. Create a POST request to the server. You need to use the classic POST, avoiding any AJAX requests. Using some browser plugins (in Firefox or Chrome) it might be possible to disable XMLHttpRequest so you can then intercept the non-AJAX request with tools like Fiddler.

    3. Add every value from #1 to that post request. There are only two values you need to overwrite: __EVENTTARGET and __EVENTARGUMENT. You would leave those empty except if the link or button that you try to imitate has a onclick handler like <a href="javascript:__doPostBack('ctl00$login','')">. If it is, parse the values from this link - the first one is the event target (it usually will match the ID of some element on the page), the second is the event argument.

    4. If you executed the request correctly, you should get back HTML page. If you get a partial response, check if you didn't pass the HTTP header that asks for async result.

    0 讨论(0)
  • 2021-02-01 23:59

    I would recommend branching out into Ruby and trying Capybara which is a sane way of using Selenium. It lets you do a visit of a page, then examine the actual DOM. You can click on everything, wait for events, etc. It uses a real browser.

    visit "http://www.google.com" 
    page.find("button[name=btnK]")
    
    0 讨论(0)
提交回复
热议问题