How to remove images and text from RSS feed description tag?

问题

I'm getting the description from some RSS feed websites, Some of these description contain images and specific text I want to remove.

The code to get the feed:

$rss = simplexml_load_file($website);
foreach ($rss->channel->item as $item) {
    $description = (string)$item->descritpion;
}

These are the different formats I get:

<description><![CDATA[
    <p> //Post Description </p>
    <p>The post <a rel="nofollow" href="">
        //Post Title.</a> appeared first on 
        <a rel="nofollow" href="">//Feed Website.</a>.
    </p>
]]></description>
_________________________________________________________________
<description><![CDATA[
    <div>
        <strong>//Some Text.</strong>
    </div>
    <div>
        &nbsp;
    </div>
    <div>//Some Text.</div>
    <div>
        <img alt="" src="" style="width: 640px; height: 427px;" />
    </div>
]]></description>
_______________________________________________________________
<description>
    &lt;img style="margin:0 1em 1em 0;" align="left" src=""/&gt;
    „//Some Text. 
</description>

To remove images:

$description = (string)strip_tags($item->description);

The text is "The post (Post Title) appeared first on (Website)".

To remove that text I use:

if (strpos($description, 'appeared first')) {
    $siteNames = array('a.com', 'b.com', 'c.com');
    foreach ($siteNames as $siteName) {
        if(strpos($description, $siteName)){
            $appeared = 'The post '.$item->title.' appeared first on '.$siteName;
            $description = str_replace($appeared, '', $description);
        }

    }
}

So for example if the description contains:

 <p>The post 
    <a rel="nofollow" href="http://a.com/what-is-php">What is PHP.</a> 
    appeared first on 
    <a rel="nofollow" href="http://a.com">a.com.</a>.
</p>

Then that text should be removed.

Then I use strip_tags($item->description), No images is shown.

But when I use the code to remove the string, It doesn't work with all the descriptions and some of them still have the string.

UPDATE:

<description><![CDATA[
    <p>Við vorum að fá inn til okkar forfallaholl í Laugardalsá á best tíma. Annarsvegar er um að ræða hollið 18-21. júlí og síðan hollið 24-27. júlí. Bæði eru hollin á frábærum tíma í ánn. Þó svo um 3ja daga holl sé að ræða, er að hægt að skoða staka daga eða 1 1/2 eða 2
    </p>
    <p>The post <a rel="nofollow" href="https://a.com/post-title/">Laugardalsá &#8211; forfallaholl á besta tíma</a> appeared first on <a rel="nofollow" href="https://a.com">a.com</a>.</p>
]]></description>

回答1:

Code: (Demo)

$xml = '<![CDATA[
    <p>Við vorum að fá inn til okkar forfallaholl í Laugardalsá á best tíma. Annarsvegar er um að ræða hollið 18-21. júlí og síðan hollið 24-27. júlí. Bæði eru hollin á frábærum tíma í ánn. Þó svo um 3ja daga holl sé að ræða, er að hægt að skoða staka daga eða 1 1/2 eða 2
    </p>
    <p>The post <a rel="nofollow" href="https://a.com/post-title/">Laugardalsá &#8211; forfallaholl á besta tíma</a> appeared first on <a rel="nofollow" href="https://a.com">a.com</a>.</p>
]]>';

$finds = [
    '~<p>The post <a rel="nofollow" href="https?://[a-z]+\.com[^"]*">.*?</a> appeared first on <a rel="nofollow" href="https?://[a-z]+\.com[^"]*">.*?</a>\.</p>~iu',
    '~^<!\[CDATA\[~',
    '~\]\]>$~'
];

var_export(trim(strip_tags(preg_replace($finds, '', $xml))));

Output:

'Við vorum að fá inn til okkar forfallaholl í Laugardalsá á best tíma. Annarsvegar er um að ræða hollið 18-21. júlí og síðan hollið 24-27. júlí. Bæði eru hollin á frábærum tíma í ánn. Þó svo um 3ja daga holl sé að ræða, er að hægt að skoða staka daga eða 1 1/2 eða 2'

I expect this should largely handle your data in the way that you require. The first regex pattern is certainly the hairiest one (see the link for pattern explanation). You will need to adjust the [abc]\.com to suit your needs -- potentially doing something like (?:test\.com|example\.net|sample\.co\.uk). Until you get it "just right" just feed some input data into regex101 and keep tweaking your pattern until it works.

The 2nd and 3rd patterns are just to clear away the text wrappers. While the 2nd one is not truly necessary because strip_tags() will clean that substring away, the 3rd is critical because strip_tags() will leave a dangling ]]>.

The first pattern is case-insensitive (i) and unicode-tolerant (u) for best results.

^ and $ are beginning and end of string delimiters. If they are not suitable for your actual data, they can be removed. These steps are just attempts to "mop up" any unwanted residual substrings. The trim() call is certainly something that I would include so that the stored data is as clean as it can be.

If the specific <p> tagged substring to be removed is nested between two substrings to be kept, you may like to add another pattern to condense multiple \s{2,} to be a single space OR you might write \s* at the end of my first pattern to capture trailing whitespaces. Only you will know this.

来源：https://stackoverflow.com/questions/51154846/how-to-remove-images-and-text-from-rss-feed-description-tag

标签

php

regex

xml

rss

feed