Regular expression to replace an with respective

后端 未结 3 1767
一个人的身影
一个人的身影 2021-01-17 01:50

I\'m looking for a PHP preg_replace() solution find links to images and replace them with respective image tags.

Find:



        
相关标签:
3条回答
  • 2021-01-17 02:00

    I would suggest using this more flexible non-greddy regex:

    <a[^>]+?href=\"(http:\/\/[^\"]+?\/([^\"]*?)\.(jpg|jpeg|png|gif))[^>]*?>[^<]*?<\/a>
    

    And a more complex regex (including PHP test code) to hopefully please Gumbo :)

    <?php
    $test_data = <<<END
    <a blabla="asldlsaj" alksjada="aslkdj" href="http://www.domain.tld/any/valid/path/to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
    Lorem ipsum..
    <a    blabla=asldlsaj alksjada="aslkdj" href="http://www.domain.tld/any/valid/path/to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
    <a lkjafs='asdsa> ' blabla="asldlksjada=>"aslkdj" href="http://www.domain.tld/any/valid/path/to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
    <a    blabla="ajada="aslk href="http://www.domain.tld/any/valid/path>/to/imagefile.jpg" lkjasd>asdlaskjd>This will be ignored.</a>
    <a    blabla="asldlsaj>" aslkdj href="http://www.domain.tld/any/valid/path/ to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
    Something:
    <a    blabla='asldls<ajslkdj' href="http://www.domain.tld/any/valid'/path/to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
    <a    blabla=  asldlsadj href="http://www.domain.tld/any/valid/path/to/imagefile.jpg" lkjasd>This will be ignored.</a>
    <a blabla="asldlsaj" alksjslkdj" href='http://www.domain.tld/any/valid/path/to/imagefile.jpg' lkjasdskjd>This will be ignored.</a>
    Something else...
    <a    blabla="asldlsaj" alksjslkdj" href='http://www.domain.tld/any/valid/path/to/imagefile.jpg' lkjasdskjd>This will be ignored.</a>
    <a    blabla="asldlsaj" alksjada="aslkdj" href=http://www.domain.tld/any/valid/path/to/imagefile.jpg lkjdlaskjdll> be ignored.</a>
    END;
    $regex = "/<a\s(\s*\w+(\s*=\s*(\".*?\"|'.*?'|[^'\">\s]+))?)+?\s+href\s*=\s*(\"(http:\/\/[^\"]+\/(.*?)\.(jpg|jpeg|png|gif))\"|'(http:\/\/[^']+\/(.*?)\.(jpg|jpeg|png|gif))'|(http:\/\/[^'\">\s]+\/([^'\">\s]+)\.(jpg|jpeg|png|gif)))\s(\s*\w+(\s*=\s*(\".*?\"|'.*?'|[^'\">\s]+))?)+>[^<]*?<\/a>/i";
    $replaced = preg_replace($regex, '<img src="$5$8$11" alt="$6$9$12" />', $test_data);
    
    echo '<pre>'.htmlentities($replaced);
    ?>
    
    0 讨论(0)
  • 2021-01-17 02:10

    Congratulations, you are the one millionth customer to ask Stack Overflow how to parse HTML with regex!

    [X][HT]ML is not a regular language and cannot reliably be parsed with regex. Use an HTML parser. PHP itself gives you DOMDocument, or you may prefer simplehtmldom.

    Incidentally, you cannot tell what type a file is by looking at its URL. There is no reason a JPEG has to have ‘.jpeg’ as its extension — and indeed, no guarantee that a file with ‘.jpeg’ extension will actually be JPEG. The only way to be certain is to fetch the resource (eg. using a HEAD request) and look at the Content-Type header.

    0 讨论(0)
  • 2021-01-17 02:15

    Ahh, my daily DOM practice. You should use DOM to parse HTML and regex to parse strings such as html attributes.

    Note: I have some basic regexes that could surely be improved upon by some wizards :)

    Note #2: Though it might be extra overhead you could use something like curl to thoroughly check if the href is an actual image by sending a HEAD request and looking at the Content-Type, but this would work in 80-90% of cases.

    <?php
    
    $content = '
    
    <a href="http://www.domain.tld/any/valid/path/to/imagefile.ext">This will be ignored.</a>
    <br>
    
    <a href="http://col.stb.s-msn.com/i/43/A4711309495C88F8CD154C99FCE.jpg">this will not be ignored</a>
    
    <br>
    
    <a href="http://col.stb.s-msn.com/i/A0/8E9A454F701E4F5F89E58E14B532C.jpg">bah</a>
    ';
    
    $dom = new DOMDocument();
    $dom->loadHTML($content);
    
    $anchors = $dom->getElementsByTagName('a');
    
    $i = $anchors->length-1;
    
    $protocol = '/^http:\/\//';
    $ext = '/([\w+]+)\.(?:gif|jpg|jpeg|png)$/';
    
    if ( count($anchors->length) > 0 ) {
        while( $i > -1 ) {
        $anchor = $anchors->item($i);
        if ( $anchor->hasAttribute('href') ) {
            $link = $anchor->getAttribute('href');
    
            if ( 
            preg_match ( $protocol , $link ) &&
            preg_match ( $ext, $link )
            ) {
            //echo 'replacing this one.';
            $image = $dom->createElement('img');
    
            if ( preg_match( $ext, $link, $matches ) ) {
                if ( count($matches) ) {
                $altName = $matches[1];
                $image->setAttribute('alt', $altName);
                }
                $image->setAttribute('src', $link);
                $anchor->parentNode->replaceChild( $image, $anchor );
            }
            }
    
        }
        $i--;
        }
    }
    
    echo $dom->saveHTML();
    
    0 讨论(0)
提交回复
热议问题