PHP script that can extract text between multiple title tags of certain website?

后端 未结 4 1052
独厮守ぢ
独厮守ぢ 2021-01-16 18:28

Hello I found few and tried few, but nothing really works for me. Best I found was able to extract title of the page, but there are many title tags on the page and it extrac

相关标签:
4条回答
  • 2021-01-16 19:00

    I'm sorry I have made big mistake, I do not need title tag, it is something different. In code of site the part of html looks like this:

    <td><a title="Ravellavegas.com Analysis" href="http://www.statscrop.com/www/ravellavegas.com">
    

    From it I need to exctract only the webadress, so from this, only ravellavegas.com

    0 讨论(0)
  • 2021-01-16 19:08

    If it's HTML there should only be 1 tag... but, granted, it could be XML with an XSLT. In which case, instead of mucking about with RegExps to attempt to parse it, it's generally better to create a DOMDocument object and use that instead:

    Of course, if the document isn't XML well formed this is going to fall over.

    //taken directly from the comments on PHP documentation at : 
    //  http://uk3.php.net/manual/en/domdocument.load.php
    //  so that you can load in an XML file over HTTP
    
    $opts = array(
        'http' => array(
            'user_agent' => 'PHP libxml agent',
        )
    );
    
    $context = stream_context_create($opts);
    libxml_set_streams_context($context);
    
    // request a file through HTTP
    $xml = DOMDocument::load('http://www.example.com/file.xml');
    
    
    // added this bit to get the <title> elements
    $aTitles = $xml->getElementsByTagName('title');
    
    //  loop and output
    foreach($aTitles as $oTitle) {
      echo "<p>{$oTitle->nodeValue}</p>\n";
    }
    
    0 讨论(0)
  • 2021-01-16 19:17

    Use preg_match_all, it'll give you an array of matches and you can then work with each one.

    0 讨论(0)
  • 2021-01-16 19:18

    Try this solution

    $text = file_get_contents("http://www.example.com");
    preg_match_all('/<title>.*?<\/title>/is', $text, $matches);
    foreach($matches[0] as $m)
    {
        echo htmlentities($m)."<br />";
    }
    

    For example:

    // input text
    $text = <<<EOT
    <title>Lorem ipsum dolor</title>
    sit amet, consectetur adipisicing elit, sed do eiusmod tempor
    incididunt ut labore et dolore magna aliqua.
    Ut enim <title>ad minim</title> veniam,
    quis nostrud exercitation ullamco laboris nisi ut
    aliquip <title>ex ea</title> commodo consequat.
    EOT;
    
    // solution
    preg_match_all('/<title>(.+?)<\/title>/is', $text, $matches);
    foreach($matches[0] as $m)
    {
        echo htmlentities($m)."<br />";
    }
    

    Output:

    <title>Lorem ipsum dolor</title>
    <title>ad minim</title>
    <title>ex ea</title>
    

    POST UPDATED (to reflect the changes in the question).

    For example you want to load some "a.html" file:

    <html>
    <body>
    Lorem ipsum dolor
    <a title="Ravellavegas.com Analysis" href="http://somewebsite.com/" />
    sit amet, consectetur adipisicing elit, sed do eiusmod tempor
    <a title="Articlesiteslist.com Analysis" href="http://someanotherwebsite.com/" />
    incididunt ut labore et dolore magna aliqua.
    </body>
    </html>
    

    Then, you have to write the script as follows:

    <?php
    
    $dom = new DOMDocument();
    $dom->load('a.html');
    
    foreach ($dom->getElementsByTagName('a') as $tag) {
        echo $tag->getAttribute('title').'<br/>';
    }
    
    ?>
    

    This outputs:

    Ravellavegas.com Analysis
    Articlesiteslist.com Analysis
    
    0 讨论(0)
提交回复
热议问题