pulling xml feed and detecting changes/deletion php

问题

I want to setup an xml feed polling system which would download an xml feed from a given URL every hour and detect whether the feed has changed. If it has, it would need to do a few things.

How can I efficiently accomplish this? The feed I would be pulling would have thousands of items inside and every item may have quite a bit of data in it.

I want to be able to detect any new data/item and save it to a database.
I want to be able to detect any modified data/item and update the database accordingly.
I want to be able to detect any deleted data/item and update it the database accordingly.

The order of items doesn't matter to me, so if the order changes but nothing else does, then we can say the feeds are identical.

I've seen a few people mention hashing the items and the whole feed to compare to the previous downloaded one. If there are many items, this could potentially take long..

Would there be an easy way to do a diff on the last downloaded feed and new one to then somehow remove all identical items? And maybe then go through the items that are left and do the comparison?

I'm not sure what the right approach would be. Any suggestions would be greatly appreciated.

An example of a similar feed I would be pulling would be:

<properties>
<property>
<location>
<unit-number>301</unit-number>
<street-address>123 Main St</street-address>
<city-name>San Francisco</city-name>
<zipcode>94123</zipcode>
<county>San Francisco</county>
<state-code>California</state-code>
<street-intersection>Broadway</street-intersection>
<parcel-id>359-02-4158</parcel-id>
<building-name>The Avalon</building-name>
<subdivision></subdivision>
<neighborhood-name>Marina</neighborhood-name>
<neighborhood-description>The Marina is a neighborhood on the Northern part of San
Francisco</neighborhood-description>
<elevation>10</elevation>
<longitude>-70.1200</longitude>
<latitude>30.0000</latitude>
<geocode-type>exact</geocode-type>
<display-address>yes</display-address>
<directions>Take 101 North to Lombard St. Make a left on Lombard and 3rd right
onto Main. 123 is at the end of the block on the right. </directions>
</location>
<details>
<listing-title>A great deal in the Marina</listing-title>
<price>725000</price>
<year-built>1928</year-built>
<num-bedrooms>3</num-bedrooms>
<num-full-bathrooms>2</num-full-bathrooms>
<num-half-bathrooms>1</num-half-bathrooms>
<num-bathrooms></num-bathrooms>
<lot-size>0.25</lot-size>
<living-area-square-feet>1720</living-area-square-feet>
<date-listed>2010-06-20</date-listed>
<date-available></date-available>
<date-sold></date-sold>
<sale-price></sale-price>
<property-type>condo</property-type>
<description>Newly remodeled condo in great location.</description>
<mlsId>582649</mlsId>
<mlsName>SFAR</mlsName>
<provider-listingid>258136842</provider-listingid>
</details>
<landing-page>
<lp-url>http://www.BrokerRealty.com/listing?id=123456&amp;source=Trulia</lp-url>
</landing-page>
<listing-type>resale</listing-type>
<status>for sale</status>
<foreclosure-status></foreclosure-status>
<site>
<site-url>http://www.BrokerRealty.com</site-url>
<site-name>Broker Realty</site-name>
</site>

etc..

回答1:

Would there be an easy way to do a diff on the last downloaded feed and new one to then somehow remove all identical items?

Sure, in fact it should be pretty easy. It looks like these are real estate listings, right? If so, the name of the MLS provider and the identifier that they issue for the listing forms a unique key:

<details>
    <!-- ... -->
    <mlsId>582649</mlsId>
    <mlsName>SFAR</mlsName>
    <provider-listingid>258136842</provider-listingid>
</details>

Now that you can uniquely identify each listing, it should be pretty trivial to decide how you will detect changes. I'd personally mangle the XML into a multidimensional associative array, sort every level by key name, then serialize it and run it through a hash routine (say, md5), for that oh-so-attractive sloppy-but-it-works effect. In fact, you already had that idea, kind of:

I've seen a few people mention hashing the items and the whole feed to compare to the previous downloaded one. If there are many items, this could potentially take long..

By hashing each unique entry in the document, you avoid having to reimport the entire thing when a single entry changes. Stick the per-entry hash in with the rest of the data in your database, with the information that makes up the unique key. When the hash changes, the XML has changed, and it's worth re-importing.

And again, once you have that unique key, it's amazingly easy to detect new listings. No matching key in the database? Import.

Likewise, it's amazingly easy to detect deleted listings. Key's in the database but isn't in the XML? Maybe it should be nuked.

来源：https://stackoverflow.com/questions/13930485/pulling-xml-feed-and-detecting-changes-deletion-php

标签

php

xml

feed