How can I extract structured text from an HTML list in PHP?

后端 未结 3 481
名媛妹妹
名媛妹妹 2021-01-16 21:46

I have this string:

  • Page 1
  • Page 2
    • Sub Page A
相关标签:
3条回答
  • 2021-01-16 22:15

    I leave a second answer because this time this demonstrates how to do it with the single mapping (in pseudocode):

    foreach //li ::
        ID       := string(./@id)
        ParentID := string(./ancestor::li[1]/@id)
        Label    := normalize-space(./text()[1])
        Order    := count(./preceding-sibling::li)+1
        Children := implode(",", ./ul/li/@id)
    

    Because this can be done per each li node regardless in which order, this could be a perfect match for an Iterator, here the current function:

    public function current() {
    
        return [
            'ID'       => $this->evaluate('number(./@id)'),
            'label'    => $this->evaluate('normalize-space(./text()[1])'),
            'order'    => $this->evaluate('count(./preceding-sibling::li)+1'),
            'parentID' => $this->evaluate('number(concat("0", ./ancestor::li[1]/@id))'),
            'children' => $this->implodeNodes(',', './ul/li/@id'),
        ];
    }
    

    Full example (Demo) output and code:

    +----+----------------+-------+--------+----------+
    | ID |     LABEL      | ORDER | PARENT | CHILDREN |
    +----+----------------+-------+--------+----------+
    |  1 | Page 1         |   1   |    0   |          |
    |  2 | Page 2         |   2   |    0   | 3,4,5    |
    |  3 | Sub Page A     |   1   |    2   |          |
    |  4 | Sub Page B     |   2   |    2   |          |
    |  5 | Sub Page C     |   3   |    2   | 6        |
    |  6 | Sub Sub Page I |   1   |    5   |          |
    |  7 | Page 3         |   3   |    0   | 8        |
    |  8 | Sub Page D     |   1   |    7   |          |
    |  9 | Page 4         |   4   |    0   |          |
    +----+----------------+-------+--------+----------+
    
    
    class HtmlListIterator extends IteratorIterator
    {
        private $xpath;
    
        public function __construct($html) {
    
            $doc = new DOMDocument();
            $doc->loadHTML($html);
            $this->xpath = new DOMXPath($doc);
            parent::__construct($this->xpath->query('//li'));
        }
    
        private function evaluate($expression) {
    
            return $this->xpath->evaluate($expression, parent::current());
        }
    
        private function implodeNodes($glue, $expression) {
    
            return implode(
                $glue, array_map(function ($a) {
    
                    return $a->nodeValue;
                }, iterator_to_array($this->evaluate($expression, parent::current())))
            );
        }
    
        public function current() {
    
            return [
                'ID'       => $this->evaluate('number(./@id)'),
                'label'    => $this->evaluate('normalize-space(./text()[1])'),
                'order'    => $this->evaluate('count(./preceding-sibling::li)+1'),
                'parentID' => $this->evaluate('number(concat("0", ./ancestor::li[1]/@id))'),
                'children' => $this->implodeNodes(',', './ul/li/@id'),
            ];
        }
    }
    
    print_result(new HtmlListIterator($html));
    
    function print_result($result) {
    
        echo '+----+----------------+-------+--------+----------+
    | ID |     LABEL      | ORDER | PARENT | CHILDREN |
    +----+----------------+-------+--------+----------+
    ';
        foreach ($result as $line) {
            vprintf("| %' 2d | %' -14s |  %' 2d   |   %' 2d   | %-8s |\n", $line);
        }
        echo '+----+----------------+-------+--------+----------+
    ';
    }
    
    0 讨论(0)
  • 2021-01-16 22:35

    That's not "a string", it's HTML. You need to use an HTML parser like DOMDocument or simple_html_dom.

    See examples at http://htmlparsing.com/php.html

    0 讨论(0)
  • 2021-01-16 22:39

    You could divide the problem here. The one thing would be to parse the HTML, this is most easily done with DOMDocument and DOMXpath here. That is running some mapping in context of the result of another xpath expression / query. Sounds maybe a bit complicated, but it is not. In a more simplified variant you can find this outlined in a previous answer to Get parent element through xpath and all child elements.

    In your case this is a bit more complicate, some pseudo-code. I added the label because it makes things more visible for demonstration purposes:

    foreach //li ::
        ID       := string(./@id)
        ParentID := string(./ancestor::li[1]/@id)
        Label    := normalize-space(./text()[1])
    

    As this shows, this returns the bare data only. You also have the Order and the Children. Normally the Children listing is not needed (I keep it here anyway). What is similar between the Order value and the Children value is that they are retrieved from context.

    E.g. while traversing the //li nodelist in document order, the order of each children can be numbered if a counter is kept per each ParentID.

    Similar with the Children, like a counter, that value needs to be build while iterating over the list. Only at the very end the correct value for each listitem is available.

    So those two values are in a context, I create that context in form of an array keyed by ParentID: $parents. Per each ID it will contain two entries: 0 containing the counter for Order and 1 containing an array to keep the IDs of Children (if any).

    Note: Technically this is not totally correct. The Order and Children should be expressible in pure xpath as well, I just didn't do it in this example to show how to add your own non-xpath context as well, e.g. if you want a different ordering or children handling.

    Enough with the theory. Considering the standard setup:

    $doc = new DOMDocument();
    $doc->loadHTML($html);
    $xp = new DOMXPath($doc);
    

    The said mapping incl. it's context can be written as an anonymous function:

    $parents = [];
    
    $map = function (DOMElement $li) use ($xp, &$parents) {
    
        $id       = (int)$xp->evaluate('string(./@id)', $li);
        $parentId = (int)$xp->evaluate('string(./ancestor::li[1]/@id)', $li);
        $label    = $xp->evaluate('normalize-space(./text()[1])', $li);
    
        isset($parents[$parentId][0]) ? $parents[$parentId][0]++ : ($parents[$parentId][0] = 1);
        $order                   = $parents[$parentId][0];
        $parents[$parentId][1][] = $id;
        isset($parents[$id][1]) || $parents[$id][1] = [];
    
        return array($id, $label, $order, $parentId, &$parents[$id][1]);
    };
    

    As you can see it first contains the retrieval of the values like in the pseudo-code and in the second part the handling of the context values. It's merely to initialize the context for the ID / ParentID if it yet does not exists.

    This mapping needs to be applied:

    $result = [];
    foreach ($xp->query('//li') as $li) {
        list($id) = $array = $map($li);
        $result[$id] = $array;
    }
    

    Which will make $result contain the listing of items and $parents the context data. As a reference is used, the Children value needs to be imploded now, then the references can be removed:

    foreach ($parents as &$parent) {
        $parent[1] = implode(',', $parent[1]);
    }
    unset($parent, $parents);
    

    This then makes $result the final result which can be output:

    echo '+----+----------------+-------+--------+----------+
    | ID |     LABEL      | ORDER | PARENT | CHILDREN |
    +----+----------------+-------+--------+----------+
    ';
    foreach ($result as $line) {
        vprintf("| %' 2d | %' -14s |  %' 2d   |   %' 2d   | %-8s |\n", $line);
    }
    echo '+----+----------------+-------+--------+----------+
    ';
    

    Which then looks like:

    +----+----------------+-------+--------+----------+
    | ID |     LABEL      | ORDER | PARENT | CHILDREN |
    +----+----------------+-------+--------+----------+
    |  1 | Page 1         |   1   |    0   |          |
    |  2 | Page 2         |   2   |    0   | 3,4,5    |
    |  3 | Sub Page A     |   1   |    2   |          |
    |  4 | Sub Page B     |   2   |    2   |          |
    |  5 | Sub Page C     |   3   |    2   | 6        |
    |  6 | Sub Sub Page I |   1   |    5   |          |
    |  7 | Page 3         |   3   |    0   | 8        |
    |  8 | Sub Page D     |   1   |    7   |          |
    |  9 | Page 4         |   4   |    0   |          |
    +----+----------------+-------+--------+----------+
    

    You can find the Demo online here.

    0 讨论(0)
提交回复
热议问题