Extracting data from HTML using PHP and xPath

前端 未结 2 1598
轻奢々
轻奢々 2020-12-21 15:45

I am trying to extract data from a webpage to insert it to a database. The data I\'m interested in is in the div\'s which have a class=\"company\". On one webpage there are

相关标签:
2条回答
  • 2020-12-21 16:31

    To check if a node exists, verify that the length property is equal to 1 in the returned query result:

    if ($company_name->length == 1) {
       $object->company_name = trim($company_name->item(0)->nodeValue);
    }
    
    0 讨论(0)
  • 2020-12-21 16:41

    Each Company can be represented by a context-node while having each property represented by an xpath-expression relative to it:

    Company company-6666:
     ->id ....... = "company-6666"    --    string(@id)
     ->name ..... = "Company Name"    --    .//a[1]/text()
     ->href ..... = "/company-name"    --    .//a[1]/@href
     ->img ...... = "/graphics/company/logo/listing/123456.jpg?_ts=1365390237"    --    .//img[1]/@src
     ->address .. = "StreetName 500, 7777 City, County"    --    .//*[@class="address"]/text()
     ...
    

    If you wrap that into objects, this is pretty nifty to use:

    $doc = new DOMDocument();
    $doc->loadHTML($html);
    
    /* @var $companies DOMValueObject[] */
    $companies = new Companies($doc);
    
    foreach ($companies as $company) {
        printf("Company %s:\n", $company->id);
        foreach ($company->getObjectProperties() as $name => $value) {
            $expression = $company->getPropertyExpression($name);
            printf(" ->%'.-10s = \"%s\"    --    %s\n", $name.' ', $value, $expression);
        }
    }
    

    This works with DOMObjectCollection and DOMValueObject, defining your own type:

    class Companies extends DOMValueCollection
    {
        public function __construct(DOMDocument $doc) {
            parent::__construct($doc, '//*[@class="company"]');
        }
    
        /**
         * @return DOMValueObject
         */
        public function current() {
            $object = parent::current();
            $object->defineProperty('id', 'string(@id)');
            $object->defineProperty('name', './/a[1]/text()');
            $object->defineProperty('href', './/a[1]/@href');
            $object->defineProperty('img', './/img[1]/@src');
            $object->defineProperty('address', './/*[@class="address"]/text()');
            # ... add your definitions
            return $object;
        }
    }
    

    And for your array requirements there is a getArrayCopy() method:

    echo "\nGet Array Copy:\n\n";
    
    print_r($companies->getArrayCopy());
    

    Output:

    Get Array Copy:
    
    Array
    (
        [0] => Array
            (
                [id] => company-6666
                [name] => Company Name
                [href] => /company-name
                [img] => /graphics/company/logo/listing/123456.jpg?_ts=1365390237
                [address] => StreetName 500, 7777 City, County
            )
    
    )
    
    0 讨论(0)
提交回复
热议问题