Website scraping using jquery and ajax

前端 未结 6 1720
孤独总比滥情好
孤独总比滥情好 2020-12-24 09:46

I want to be able to manipulate the html of a given url. Something like html scraping. I know this can be done using curl or some scraping library.But i would like to know i

相关标签:
6条回答
  • 2020-12-24 10:06

    http://www.nathanm.com/ajax-bypassing-xmlhttprequest-cross-domain-restriction/

    The only problem is that due to security in both Internet Explorer and in FireFox, the XMLHTTPRequest object is not allowed to make cross-domain, cross-protocol, or cross-port requests.

    0 讨论(0)
  • 2020-12-24 10:10

    You cannot do Ajax request to another domain-name than the one your website is on, because of the Same Origin Policy ; which means you will not be quite able to do what you want... At least directly.

    A solution would be to :

    • have some kind of "proxy" on your own server,
    • send your Ajax request to that proxy,
    • which, in turn, will fetch the page on the other domain name ; and return it to your JS code as response to the Ajax request.

    This can be done in a couple of lines with almost any language (like PHP, using curl, for instance)... Or you might be able to use some functionnality of your webserver (see mod_proxy and mod_proxy_http, for instance, for Apache)

    0 讨论(0)
  • 2020-12-24 10:10

    I do this with a small PHP proxy, temporarily stripping IMG tags to speed up load times. I've wrapped it in a jQuery plugin that makes it relatively easy to use, see here for demo/github link

    0 讨论(0)
  • 2020-12-24 10:17

    Instead of curl, you could use a tool like Selenium which will automate loading the page in the browser. You can run JavaScript with it.

    0 讨论(0)
  • 2020-12-24 10:27

    I would like to point out that there are situations where it is perfectly acceptable to use jQuery to scrape screens across domains. Windows Sidebar gadgets run in a 'Local Machine Zone' that allows cross domain scripting.

    And jQuery does have the ability to apply selectors to retreived html content. You just need to add the selector to a load() method's url parameter after a space.

    The example gadget code below checks this page every hour and reports the total number of page views.

    <html>
    <head>
        <script type="text/javascript" src="jquery.min.js"></script>
        <style>
            body { 
                height: 120px;
                width: 130px;
                background-color: white;
            };
        </style>
    </head>
    
    <body>
    Question Viewed:
    <div id="data"></div>
    
    <script type="text/javascript">
    
        var url = "http://stackoverflow.com/questions/1936495/website-scraping-using-jquery-and-ajax"
    
        updateGadget();
    
        inervalID = setInterval("updateGadget();", 60 * 1000);
    
        function updateGadget(){
    
            $(document).ready(function(){
                $("#data").load(url + " .label-value:contains('times')");
            });
    
        }
    
    </script>
    
    </body>
    </html>
    
    0 讨论(0)
  • 2020-12-24 10:27

    Its not that difficult.

    $(document).ready(function() {
      baseUrl = "http://www.somedomain.com/";
      $.ajax({
        url: baseUrl,
        type: "get",
        dataType: "",
        success: function(data) {
          //do something with data
        }
      });
    });
    

    I think this can give you a good clue - http://jsfiddle.net/skelly/m4QCt/

    0 讨论(0)
提交回复
热议问题