Using regular expressions to extract the first image source from html codes?

后端 未结 10 1061
深忆病人
深忆病人 2020-12-05 01:07

I would like to know how this can be achieved.

Assume: That there\'s a lot of html code containing tables, divs, images, etc.

Problem: How can I get matches

相关标签:
10条回答
  • 2020-12-05 01:33

    I really think you can not predict all the cases with on regular expression.

    The best way is to use the DOM with the PHP5 class DOMDocument and xpath. It's the cleanest way to do what you want.

    $dom = new DOMDocument();
    $dom->loadHTML( $htmlContent );
    $xml = simplexml_import_dom($dom);
    $images = $xml -> xpath('//img/@src');
    
    0 讨论(0)
  • 2020-12-05 01:35
    <?php    
    /* PHP Simple HTML DOM Parser @ http://simplehtmldom.sourceforge.net */
    
    require_once('simple_html_dom.php');
    
    $html = file_get_html('http://example.com');
    $image = $html->find('img')[0]->src;
    
    echo "<img src='{$image}'/>"; // BOOM!
    

    PHP Simple HTML DOM Parser will do the job in few lines of code.

    0 讨论(0)
  • 2020-12-05 01:36

    I agree with Andrew Moore. Using the DOM is much, much better. The HTML DOM images collection will return to you a reference to all image objects.

    Let's say in your header you have,

    <script type="text/javascript">
        function getFirstImageSource()
        {
            var img = document.images[0].src;
            return img;
        }
    </script>
    

    and then in your body you have,

    <script type="text/javascript">
      alert(getFirstImageSource());
    </script>
    

    This will return the 1st image source. You can also loop through them along the lines of, (in head section)

    function getAllImageSources()
        {
            var returnString = "";
            for (var i = 0; i < document.images.length; i++)
            {
                returnString += document.images[i].src + "\n"
            }
            return returnString;
        }
    

    (in body)

    <script type="text/javascript">
      alert(getAllImageSources());
    </script>
    

    If you're using JavaScript to do this, remember that you can't run your function looping through the images collection in your header. In other words, you can't do something like this,

    <script type="text/javascript">
        function getFirstImageSource()
        {
            var img = document.images[0].src;
            return img;
        }
        window.onload = getFirstImageSource;  //bad function
    
    </script>
    

    because this won't work. The images haven't loaded when the header is executed and thus you'll get a null result.

    Hopefully this can help in some way. If possible, I'd make use of the DOM. You'll find that a good deal of your work is already done for you.

    0 讨论(0)
  • 2020-12-05 01:37

    since you're not worrying about validating the HTML, you might try using strip_tags() on the text first to clear out most of the cruft.

    Then you can search for an expression like

    "/\<img .+ \/\>/i"
    

    The backslashes escape special characters like <,>,/. .+ insists that there be 1 or more of any character inside the img tag You can capture part of the expression by putting parentheses around it. e.g. (.+) captures the middle part of the img tag.

    When you decide what part of the middle you wish specifically to capture, you can modify the (.+) to something more specific.

    0 讨论(0)
提交回复
热议问题