How to use cURL to fetch specific data from a website and then save it my database using php

前端 未结 2 360
闹比i
闹比i 2020-12-03 06:15

can any one tell me how to use curl or file_get_contents for downloading specific data from a website and then save those specific data into my mysql database. I want to get

相关标签:
2条回答
  • 2020-12-03 06:40

    Using cURL:

    $ch = curl_init();
    curl_setopt( $ch, CURLOPT_URL, 'http://www.something.com');
    curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true);
    
    $content = curl_exec($ch);
    

    Then you can load the element into a DOM Object and parse the dom for the specific data. You could also try and parse the data using search strings, but using regex on HTML is highly frowned upon.

    $dom = new DOMDocument();
    $dom->loadHTML( $content );
    
    // Parse the dom for your desired content
    
    • http://www.php.net/manual/en/class.domdocument.php
    0 讨论(0)
  • 2020-12-03 06:54

    This should work but it's messy and possible it will break if the site you are scraping happens to change it's markup which will affect the scraping:

    $sites[0] = 'http://www.traileraddict.com/';
    
    // use this if you want to retrieve more than one page:
    // $sites[1] = 'http://www.traileraddict.com/trailers/2';
    
    
    foreach ($sites as $site)
    {
        $ch = curl_init($site);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        $html = curl_exec($ch);
    
    
        // ok, you have the whole page in the $html variable
        // now you need to find the common div that contains all the review info
        // and that appears to be <div class="info"> (I think you could use abstract aswell)
        $title_start = '<div class="info">';
    
        $parts = explode($title_start,$html);
    
        // now you have an array of the info divs on the page
    
        foreach($parts as $part){
    
        // so now you just need to get your title and link from each part
    
        $link = explode('<a href="/trailer/', $part);
    
        // this means you now have part of the trailer url, you just need to cut off the end which you don't need:
    
       $link = explode('">', $link[1]);
    
       // this should give something of the form:
       // overnight-2012/trailer
       // so just make an absolute url out of it:
    
       $url = 'http://www.traileraddict.com/trailer/'.$link[0];
    
      // now for the title we need to follow a similar process:
    
      $title = explode('<h2>', $part);
    
      $title = explode('</h2>', $title[1]);
    
      $title = strip_tags($title[0]);
    
      // INSERT DB CODE HERE e.g.
    
      $db_conn = mysql_connect('$host', '$user', '$password') or die('error');
      mysql_select_db('$database', $db_conn) or die(mysql_error());
    
     $sql = "INSERT INTO trailers(url, title) VALUES ('".$url."', '".$title."')"
    
     mysql_query($sql) or die(mysql_error()); 
    
    }
    

    That should be it, now you have a variable for the link and title that you can insert into your database.

    DISCLAIMER

    I have written this from the top of my head at work so I apologise if it doesn't work straight off the bat but let me know if it doesn't and I will try and help further.

    ALSO, I am aware this could be done smarter and using less steps but that would involve more thinking on my part and the OP can do this if they wish once they have understood the code I have written, since I would assume it would be a lot more important that they understand what I have done and be able to edit it themselves.

    Also, I would advise scraping the site at night so as not to burden it with extra traffic and I would suggest asking for the permission of that site aswell since if they catch you they will be able to put an end to your scraping :(

    To answer your final point - to run this at a set time period you would use a cron job.

    0 讨论(0)
提交回复
热议问题