Best way to parse multiple plain text websites that contain no HTML

陌路散爱 提交于 2019-12-11 14:37:47

问题


I am looking for a way to read multiple (over 50) plain text websites and parse only certain information into a html table, or as a csv file.When I say "plain text" I mean that while it is a web address, it does not have any html associated with it.This would be an example of the source. I am pretty new to this, and was looking for help in seeing how this could be done.

update-token:179999210
vessel-name:Name Here
vessel-length:57.30
vessel-beam:14.63
vessel-draft:3.35
vessel-airdraft:0.00
time:20140104T040648.259Z
position:25.04876667 -75.57001667 GPS
river-mile:sd 178.71
rate-of-turn:0.0
course-over-ground:58.5
speed-over-ground:0.0
ais-367000000 {
    pos:45.943912 -87.384763 DGPS
    cog:249.8
    sog:0.0
    name:name here
    call:1113391
    imo:8856857
    type:31
    dim:10 20 4 5
    draft:3.8
    destination:
}
ais-367000000 {
    pos:25.949652 -86.384535 DGPS
    cog:105.6
    sog:0.0
    name:CHRISTINE
    call:5452438
    type:52
    status:0
    dim:1 2 3 4
    draft:3.0
    destination:IMTT ST.ROSE
    eta:06:00
}

Thanks for any suggestions you guys might have.


回答1:


I may be completely missing the point here - but here is how you could take the contents (assuming you had them as a string) and put them into a php key/value array. I "hard-coded" the string you had, and changed one value (the key ais-3670000 seemed to repeat, and that makes the second object overwrite the first).

This is a very basic parser that assumes a format like you described above. I give the output below the code:

<?php
echo "<html>";
$s="update-token:179999210
vessel-name:Name Here
vessel-length:57.30
vessel-beam:14.63
vessel-draft:3.35
vessel-airdraft:0.00
time:20140104T040648.259Z
position:25.04876667 -75.57001667 GPS
river-mile:sd 178.71
rate-of-turn:0.0
course-over-ground:58.5
speed-over-ground:0.0
ais-367000000 {
    pos:45.943912 -87.384763 DGPS
    cog:249.8
    sog:0.0
    name:name here
    call:1113391
    imo:8856857
    type:31
    dim:10 20 4 5
    draft:3.8
    destination:
}
ais-367000001 {
    pos:25.949652 -86.384535 DGPS
    cog:105.6
    sog:0.0
    name:CHRISTINE
    call:5452438
    type:52
    status:0
    dim:1 2 3 4
    draft:3.0
    destination:IMTT ST.ROSE
    eta:06:00
}";
$lines = explode("\n", $s);
$output = Array();
$thisElement = & $output;
foreach($lines as $line) {
  $elements = explode(":", $line);
  if (count($elements) > 1) {
    $thisElement[trim($elements[0])] = $elements[1];
  }
  if(strstr($line, "{")) {
      $elements = explode("{", $line);
      $key = trim($elements[0]);
      $output[$key] = Array();
      $thisElement = & $output[$key];
  }
  if(strstr($line, "}")) {
      $thisElement = & $output;
  }
}
echo '<pre>';
print_r($output);
echo '</pre>';
echo '</html>';
?>

Output of the above (can be seen working at http://www.floris.us/SO/ships.php):

Array
(
    [update-token] => 179999210
    [vessel-name] => Name Here
    [vessel-length] => 57.30
    [vessel-beam] => 14.63
    [vessel-draft] => 3.35
    [vessel-airdraft] => 0.00
    [time] => 20140104T040648.259Z
    [position] => 25.04876667 -75.57001667 GPS
    [river-mile] => sd 178.71
    [rate-of-turn] => 0.0
    [course-over-ground] => 58.5
    [speed-over-ground] => 0.0
    [ais-367000000] => Array
        (
            [pos] => 45.943912 -87.384763 DGPS
            [cog] => 249.8
            [sog] => 0.0
            [name] => name here
            [call] => 1113391
            [imo] => 8856857
            [type] => 31
            [dim] => 10 20 4 5
            [draft] => 3.8
            [destination] => 
        )

    [ais-367000001] => Array
        (
            [pos] => 25.949652 -86.384535 DGPS
            [cog] => 105.6
            [sog] => 0.0
            [name] => CHRISTINE
            [call] => 5452438
            [type] => 52
            [status] => 0
            [dim] => 1 2 3 4
            [draft] => 3.0
            [destination] => IMTT ST.ROSE
            [eta] => 06
        )

)

A better approach would be to turn the string into "properly formed JSON", then use json_decode. That might look like the following:

<?php
echo "<html>";
$s="update-token:179999210
vessel-name:Name Here
vessel-length:57.30
vessel-beam:14.63
vessel-draft:3.35
vessel-airdraft:0.00
time:20140104T040648.259Z
position:25.04876667 -75.57001667 GPS
river-mile:sd 178.71
rate-of-turn:0.0
course-over-ground:58.5
speed-over-ground:0.0
ais-367000000 {
    pos:45.943912 -87.384763 DGPS
    cog:249.8
    sog:0.0
    name:name here
    call:1113391
    imo:8856857
    type:31
    dim:10 20 4 5
    draft:3.8
    destination:
}
ais-367000001 {
    pos:25.949652 -86.384535 DGPS
    cog:105.6
    sog:0.0
    name:CHRISTINE
    call:5452438
    type:52
    status:0
    dim:1 2 3 4
    draft:3.0
    destination:IMTT ST.ROSE
    eta:06:00
}";

echo '<pre>';
print_r(parseString($s));
echo '</pre>';

function parseString($s) {
  $lines = explode("\n", $s);
  $jstring = "{ ";
  $comma = "";
  foreach($lines as $line) {
    $elements = explode(":", $line);
    if (count($elements) > 1) {
      $jstring = $jstring . $comma . '"' . trim($elements[0]) . '" : "' . $elements[1] .'"';
      $comma = ",";
    }
    if(strstr($line, "{")) {
      $elements = explode("{", $line);
      $key = trim($elements[0]);
      $jstring = $jstring . $comma . '"' . $key .'" : {';
      $comma = "";
    }
    if(strstr($line, "}")) {
      $jstring = $jstring . '} ';
      $comma = ",";
    }
  }
  $jstring = $jstring ."}";
  return json_decode($jstring);
}
echo '</html>';
?>

Demo at http://www.floris.us/SO/ships2.php ; note that I use the variable $comma to make sure that commas are either included, or not included, at various points in the string.

Output of this code looks similar to what we had before:

stdClass Object
(
    [update-token] => 179999210
    [vessel-name] => Name Here
    [vessel-length] => 57.30
    [vessel-beam] => 14.63
    [vessel-draft] => 3.35
    [vessel-airdraft] => 0.00
    [time] => 20140104T040648.259Z
    [position] => 25.04876667 -75.57001667 GPS
    [river-mile] => sd 178.71
    [rate-of-turn] => 0.0
    [course-over-ground] => 58.5
    [speed-over-ground] => 0.0
    [ais-367000000] => stdClass Object
        (
            [pos] => 45.943912 -87.384763 DGPS
            [cog] => 249.8
            [sog] => 0.0
            [name] => name here
            [call] => 1113391
            [imo] => 8856857
            [type] => 31
            [dim] => 10 20 4 5
            [draft] => 3.8
            [destination] => 
        )

    [ais-367000001] => stdClass Object
        (
            [pos] => 25.949652 -86.384535 DGPS
            [cog] => 105.6
            [sog] => 0.0
            [name] => CHRISTINE
            [call] => 5452438
            [type] => 52
            [status] => 0
            [dim] => 1 2 3 4
            [draft] => 3.0
            [destination] => IMTT ST.ROSE
            [eta] => 06
        )

)

But maybe your question is "how do I get the text into php in the first place". In that case, you might look at something like this:

<?php
$urlstring = file_get_contents('/path/to/urlFile.csv');
$urls = explode("\n", $urlstring); // one url per line

$responses = Array();

// loop over the urls, and get the information
// then parse it into the $responses array
$i = 0;
foreach($urls as $url) {
  $responses[$i] = parseString(file_get_contents($url));
  $i = $i + 1;
}


function parseString($s) {
  $lines = explode("\n", $s);
  $jstring = "{ ";
  $comma = "";
  foreach($lines as $line) {
    $elements = explode(":", $line);
    if (count($elements) > 1) {
      $jstring = $jstring . $comma . '"' . trim($elements[0]) . '" : "' . $elements[1] .'"';
      $comma = ",";
    }
    if(strstr($line, "{")) {
      $elements = explode("{", $line);
      $key = trim($elements[0]);
      $jstring = $jstring . $comma . '"' . $key .'" : {';
      $comma = "";
    }
    if(strstr($line, "}")) {
      $jstring = $jstring . '} ';
      $comma = ",";
    }
  }
  $jstring = $jstring ."}";
  return json_decode($jstring);
}
?>

I include the same parsing function as before; it's possible to make it much better, or leave it out altogether. Hard to know from your question.

Questions welcome.

UPDATE

Based on comments I have added a function that will perform the curl on the file resource; let me know if this works for you. I have created a file http://www.floris.us/SO/ships.txt that is an exact copy of the file you showed above, and a http://www.floris.us/SO/ships3.php that contains the following source code - you can run it and see that it works (note - in this version I don't read anything from a .csv file - you already know how to do that. This is just taking the array, and using it to obtain a text file, then converting it to a data structure you can use - display, whatever):

<?php
$urls = Array();
$urls[0] = "http://www.floris.us/SO/ships.txt";

$responses = Array();

// loop over the urls, and get the information
// then parse it into the $responses array
$i = 0;
foreach($urls as $url) {
//  $responses[$i] = parseString(file_get_contents($url));
  $responses[$i] = parseString(myCurl($url));
  $i = $i + 1;
}
echo '<html><body><pre>';
print_r($responses);
echo '</pre></body></html>';

function parseString($s) {
  $lines = explode("\n", $s);
  $jstring = "{ ";
  $comma = "";
  foreach($lines as $line) {
    $elements = explode(":", $line);
    if (count($elements) > 1) {
      $jstring = $jstring . $comma . '"' . trim($elements[0]) . '" : "' . $elements[1] .'"';
      $comma = ",";
    }
    if(strstr($line, "{")) {
      $elements = explode("{", $line);
      $key = trim($elements[0]);
      $jstring = $jstring . $comma . '"' . $key .'" : {';
      $comma = "";
    }
    if(strstr($line, "}")) {
      $jstring = $jstring . '} ';
      $comma = ",";
    }
  }
  $jstring = $jstring ."}";
  return json_decode($jstring);
}

function myCurl($f) {
// create curl resource 
   $ch = curl_init();
// set url 
   curl_setopt($ch, CURLOPT_URL, $f); 

//return the transfer as a string 
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 

// $output contains the output string 
   $output = curl_exec($ch); 

// close curl resource to free up system resources 
   curl_close($ch);    
   return $output;
}
?>

Note - because two entries have the same "tag", the second one overwrites the first when using the original source data. If that is a problem let me know. Also if you have ideas on how you actually want to display the data, try to mock up something and I can help you get it right.

On the topic of time-outs

There are several possible timeout mechanisms that can be causing you problems; depending on which it is, one of the following solutions may help you:

  1. If the browser doesn't get any response from the server, it will eventually time out. This is almost certainly not your problem right now; but it might become your issue if you fix the other problems
  2. php scripts typically have a built in "maximum time to run" before they decide you sent them into an infinite loop. If you know you will be making lots of requests, and these requests will take a lot of time, you may want to set the time-out higher. See http://www.php.net/manual/en/function.set-time-limit.php for details on how to do this. I would recommend setting the limit to a "reasonable" value inside the curl loop - so the counter gets reset for every new request.
  3. Your attempt to connect to the server may take too long (this is the most likely problem as you said). You can set the value (time you expect to wait to make the connection) to something "vaguely reasonable" like 10 seconds; this means you won't wait forever for the servers that are offline. Use

    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);

for a 10 second wait. See Setting Curl's Timeout in PHP   Finally you will want to handle the errors gracefully - if the connection did not succeed, you don't want to process the response. Putting all this together gets you something like this:

$i = 0;
foreach($urls as $url) {
  $temp = myCurl($url);
  if (strlen($temp) == 0) {
    echo 'no response from '.$url.'<br>';
  }
  else {
    $responses[$i] = parseString(myCurl($url));
    $i = $i + 1;
  }
}

echo '<html><body><pre>';
print_r($responses);
echo '</pre></body></html>';

function myCurl($f) {
// create curl resource 
   $ch = curl_init();
// set url 
   curl_setopt($ch, CURLOPT_URL, $f); 

//return the transfer as a string 
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
   curl_setopt($ch, CURLOPT_NOSIGNAL, 1);
   curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10); // try for 10 seconds to get a connection
   curl_setopt($ch, CURLOPT_TIMEOUT, 30);        // try for 30 seconds to complete the transaction

// $output contains the output string 
   $output = curl_exec($ch); 

// see if any error was set:
   $curl_errno = curl_errno($ch);

// close curl resource to free up system resources 
   curl_close($ch);    

// make response depending on whether there was an error
   if($curl_errno > 0) {
      return '';
   }
   else {
      return $output;
  }
}

Last update? I have updated the code one more time. It now

  1. Reads a list of URLs from a file (one URL per line - fully formed)
  2. Tries to fetch the contents from each file in turn, handling time-outs and echoing progress to the screen
  3. Creates tables with the some of the information from the files (including a reformatted time stamp)

To make this work, I had the following files:

www.floris.us/SO/ships.csv containing three lines with

http://www.floris.us/SO/ships.txt
http://floris.dnsalias.com/noSuchFile.html
http://www.floris.us/SO/ships2.txt

Files ships.txt and ships2.txt at the same location (almost identical copies but for name of ship) - these are like your plain text files.

File ships3.php in the same location. This contains the following source code, that performs the various steps described earlier, and attempts to string it all together:

<?php
$urlstring = file_get_contents('http://www.floris.us/SO/ships.csv');
$urls = explode("\n", $urlstring); // one url per line

$responses = Array();

// loop over the urls, and get the information
// then parse it into the $responses array
$i = 0;
foreach($urls as $url) {
 $temp = myCurl($url);
  if(strlen($temp) > 0) {
    $responses[$i] = parseString($temp);
    $i = $i + 1;
  }
  else {
    echo "URL ".$url." did not repond<br>";
  }
}

// produce the actual output table:
echo '<html><body>';
writeTable($responses);
echo '</pre></body></html>';

// ------------ support functions -------------
function parseString($s) {
  $lines = explode("\n", $s);
  $jstring = "{ ";
  $comma = "";
  foreach($lines as $line) {
    $elements = explode(":", $line);
    if (count($elements) > 1) {
      $jstring = $jstring . $comma . '"' . trim($elements[0]) . '" : "' . $elements[1] .'"';
      $comma = ",";
    }
    if(strstr($line, "{")) {
      $elements = explode("{", $line);
      $key = trim($elements[0]);
      $jstring = $jstring . $comma . '"' . $key .'" : {';
      $comma = "";
    }
    if(strstr($line, "}")) {
      $jstring = $jstring . '} ';
      $comma = ",";
    }
  }
  $jstring = $jstring ."}";
  return json_decode($jstring, true);
}

function myCurl($f) {
// create curl resource 

   $ch = curl_init();
// set url 
   curl_setopt($ch, CURLOPT_URL, $f); 

//return the transfer as a string 
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
   curl_setopt($ch, CURLOPT_NOSIGNAL, 1);
   curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10); // try for 10 seconds to get a connection
   curl_setopt($ch, CURLOPT_TIMEOUT, 30);        // try for 30 seconds to complete the transaction

// $output contains the output string 
   $output = curl_exec($ch); 

// see if any error was set:
   $curl_errno = curl_errno($ch);
   $curl_error = curl_error($ch);

// close curl resource to free up system resources 
   curl_close($ch);    

// make response depending on whether there was an error
   if($curl_errno > 0) {
      echo 'Curl reported error '.$curl_error.'<br>';
      return '';
   }
   else {
      echo 'Successfully fetched '.$f.'<br>';
      return $output;
  }
}

function writeTable($r) {
  echo 'The following ships reported: <br>';
  echo '<table border=1>';
  foreach($r as $value) {
    if (strlen($value["vessel-name"]) > 0) {
      echo '<tr><table border=1><tr>';
      echo '<td>Vessel Name</td><td>'.$value["vessel-name"].'</td></tr>';
      echo '<tr><td>Time:</td><td>'.dateFormat($value["time"]).'</td></tr>';
      echo '<tr><td>Position:</td><td>'.$value["position"].'</td></tr>';
      echo '</table></tr>';
    }
    echo '</table>';
  }
}

function dateFormat($d) {
  // with input yyyymmddhhmm
  // return dd/mm/yy hh:mm
  $date = substr($d, 6, 2) ."/". substr($d, 4, 2) ."/". substr($d, 2, 2) ." ". substr($d, 9, 2) . ":" . substr($d, 11, 2);
  return $date;
}
?>

Output of this is:

You can obviously make this prettier, and include other fields etc. I think this should get you a long way there, though. You might consider (if you can) having a script run in the background to create these tables every 30 minutes or so, and saving the resulting html tables to a local file on your server; then, when people want to see the result, they would not have to wait for the (slow) responses of the different remote servers, but get an "almost instant" result.

But that's somewhat far removed from the original question. If you are able to implement all this in a workable fashion, and then want to come back and ask a follow-up question (if you're still stuck / not happy with the outcome), that is probably the way to go. I think we've pretty much beaten this one to death now.




回答2:


First combine the websites into a csv or hard coded array, then file_get_contents() / file_put_contents() on each. Essentially:

$file = dataFile.csv
foreach($arrayOfSites as $site){

    $data = file_get_contents($site);
    file_put_contents($file, $data . "\n", FILE_APPEND);

}

Edit: Sorry was trying to do this fast. here is the full



来源:https://stackoverflow.com/questions/20924100/best-way-to-parse-multiple-plain-text-websites-that-contain-no-html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!