PHP script that can extract text between multiple title tags of certain website?

后端未结

关注

 4  1067

Hello I found few and tried few, but nothing really works for me. Best I found was able to extract title of the page, but there are many title tags on the page and it extrac

相关标签:

4条回答

旧巷少年郎

2021-01-16 19:00
I'm sorry I have made big mistake, I do not need title tag, it is something different. In code of site the part of html looks like this:
```
<td><a title="Ravellavegas.com Analysis" href="http://www.statscrop.com/www/ravellavegas.com">
```
From it I need to exctract only the webadress, so from this, only ravellavegas.com
0 讨论(0)
发布评论:

提交评论
- 加载中...

半阙折子戏

2021-01-16 19:08

If it's HTML there should only be 1 tag... but, granted, it could be XML with an XSLT. In which case, instead of mucking about with RegExps to attempt to parse it, it's generally better to create a DOMDocument object and use that instead:

Of course, if the document isn't XML well formed this is going to fall over.

//taken directly from the comments on PHP documentation at : 
//  http://uk3.php.net/manual/en/domdocument.load.php
//  so that you can load in an XML file over HTTP

$opts = array(
    'http' => array(
        'user_agent' => 'PHP libxml agent',
    )
);

$context = stream_context_create($opts);
libxml_set_streams_context($context);

// request a file through HTTP
$xml = DOMDocument::load('http://www.example.com/file.xml');


// added this bit to get the <title> elements
$aTitles = $xml->getElementsByTagName('title');

//  loop and output
foreach($aTitles as $oTitle) {
  echo "<p>{$oTitle->nodeValue}</p>\n";
}

0 讨论(0)

闹比i

2021-01-16 19:17

Use preg_match_all, it'll give you an array of matches and you can then work with each one.

0 讨论(0)
发布评论:

提交评论
- 加载中...

臣服心动

2021-01-16 19:18

Try this solution

$text = file_get_contents("http://www.example.com");
preg_match_all('/<title>.*?<\/title>/is', $text, $matches);
foreach($matches[0] as $m)
{
    echo htmlentities($m)."<br />";
}

For example:

// input text
$text = <<<EOT
<title>Lorem ipsum dolor</title>
sit amet, consectetur adipisicing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua.
Ut enim <title>ad minim</title> veniam,
quis nostrud exercitation ullamco laboris nisi ut
aliquip <title>ex ea</title> commodo consequat.
EOT;

// solution
preg_match_all('/<title>(.+?)<\/title>/is', $text, $matches);
foreach($matches[0] as $m)
{
    echo htmlentities($m)."<br />";
}

Output:

<title>Lorem ipsum dolor</title>
<title>ad minim</title>
<title>ex ea</title>

POST UPDATED (to reflect the changes in the question).

For example you want to load some "a.html" file:

<html>
<body>
Lorem ipsum dolor
<a title="Ravellavegas.com Analysis" href="http://somewebsite.com/" />
sit amet, consectetur adipisicing elit, sed do eiusmod tempor
<a title="Articlesiteslist.com Analysis" href="http://someanotherwebsite.com/" />
incididunt ut labore et dolore magna aliqua.
</body>
</html>

Then, you have to write the script as follows:

<?php

$dom = new DOMDocument();
$dom->load('a.html');

foreach ($dom->getElementsByTagName('a') as $tag) {
    echo $tag->getAttribute('title').'<br/>';
}

?>

This outputs:

Ravellavegas.com Analysis
Articlesiteslist.com Analysis

0 讨论(0)