问题
There's a bug in Traverse()
that's causing it to iterate nodes more than once.
Bugged Code
public IEnumerable<HtmlNode> Traverse()
{
foreach (var node in _context)
{
yield return node;
foreach (var child in Children().Traverse())
yield return child;
}
}
public SharpQuery Children()
{
return new SharpQuery(_context.SelectMany(n => n.ChildNodes).Where(n => n.NodeType == HtmlNodeType.Element), this);
}
public SharpQuery(IEnumerable<HtmlNode> nodes, SharpQuery previous = null)
{
if (nodes == null) throw new ArgumentNullException("nodes");
_previous = previous;
_context = new List<HtmlNode>(nodes);
}
Test Code
static void Main(string[] args)
{
var sq = new SharpQuery(@"
<a>
<b>
<c/>
<d/>
<e/>
<f>
<g/>
<h/>
<i/>
</f>
</b>
</a>");
var nodes = sq.Traverse();
Console.WriteLine("{0} nodes: {1}", nodes.Count(), string.Join(",", nodes.Select(n => n.Name)));
Console.ReadLine();
Output
19 nodes: #document,a,b,c,g,h,i,d,g,h,i,e,g,h,i,f,g,h,i
Expected Output
Each letter a-i printed once.
Can't seem to figure out where's it going wrong... node.ChildNodes
does return just direct children, right? (from HtmlAgilityPack)
Full class (condensed) if you want to try and run it yourself.
public class SQLite
{
private readonly List<HtmlNode> _context = new List<HtmlNode>();
private readonly SQLite _previous = null;
public SQLite(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
_context.Add(doc.DocumentNode);
}
public SQLite(IEnumerable<HtmlNode> nodes, SQLite previous = null)
{
if (nodes == null) throw new ArgumentNullException("nodes");
_previous = previous;
_context = new List<HtmlNode>(nodes);
}
public IEnumerable<HtmlNode> Traverse()
{
foreach (var node in _context)
{
yield return node;
foreach (var child in Children().Traverse())
yield return child;
}
}
public SQLite Children()
{
return new SQLite(_context.SelectMany(n => n.ChildNodes).Where(n => n.NodeType == HtmlNodeType.Element), this);
}
}
回答1:
First, note that everything goes according to plan as long as we're iterating over elements that don't have any sibling. As soon as we hit element <c>
, things start going haywire. It's also interesting that <c>
, <d>
and <e>
all think they contain <f>
's children.
Let's take a closer look at your call to SelectMany()
:
_context.SelectMany(n => n.ChildNodes)
That iterates through the items in _context
and accumulates the child nodes of each item. Let's have a look at _context
. Everything should be okay, since it's a List
of length 1
. Or is it?
I suspect your SharpQuery(string)
constructor stores sibling elements inside the same list. In that case, _context
may not be of length 1
anymore and, remember, SelectMany()
accumulates the child nodes of each item in the list.
For example, if _context
is a list containing <c>
, <d>
, <e>
and <f>
, only <f>
has children, and SelectMany()
is called once for each element, it will accumulate and return the children of <f>
four times.
I think that's your bug.
EDIT: Since you've posted the full class, I don't have to guess anymore. Look at the sequence of operations when you iterate over <b>
(iterators replaced by lists for better comprehension):
- Call
Children()
on<b>
, which returns<c>
,<d>
,<e>
and<f>
, - Call
Traverse()
on the resulting list:- Call
Children()
on<c>
, but_context
actually contains<c>
,<d>
,<e>
and<f>
, not only<c>
, so that returns<g>
,<h>
and<i>
, - Call
Children()
on<d>
, same thing since_context
is the same for both<c>
and<d>
(and<e>
, and<f>
), - Lather, rinse, repeat.
- Call
回答2:
Children() is broken, it selects children of siblings as well. I'd rewrite to something like this:
public IEnumerable<HtmlNode> Traverse(IEnumerable<HtmlNode> nodes)
{
foreach (var node in nodes)
{
yield return node;
foreach (var child in Traverse(ChildNodes(node)))
yield return child;
}
}
private IEnumerable<HtmlNode> ChildNodes(HtmlNode node)
{
return node.ChildNodes.Where(n => n.NodeType == HtmlNodeType.Element);
}
public SharpQuery(IEnumerable<HtmlNode> nodes, SharpQuery previous = null)
{
if (nodes == null) throw new ArgumentNullException("nodes");
_previous = previous; // Is this necessary?
_context = new List<HtmlNode>(nodes);
}
回答3:
Shouldn't this be enough?
public IEnumerable<HtmlNode> Traverse()
{
foreach (var node in _context)
{
Console.WriteLine(node.Name);
yield return node;
foreach (var child in Children().Traverse())
{} //yield return child;
}
}
回答4:
I can't tell exactly, but you can see the pattern is that once your stuff encounter the "/" of the "c", it start to consider the "g,h,i" to be part of every subsequent letter until it encounter the "/" of the "f".
Most likely you have an extra loop that you should get ride of.
Or for some reason, it doesnt compute correctly the "/>" parts.
回答5:
I would do something like this and see if it fixes the issue:
public IEnumerable<HtmlNode> Traverse()
{
foreach (var node in _context)
{
Console.WriteLine(node.Name);
if(!node.HasChildren) {
yield return node;
}
else {
foreach (var child in Children().Traverse())
yield return child;
}
}
}
}
回答6:
public IEnumerable<HtmlNode> All()
{
var queue = new Queue<HtmlNode>(_context);
while (queue.Count > 0)
{
var node = queue.Dequeue();
yield return node;
foreach (var next in node.ChildNodes.Where(n => n.NodeType == HtmlNodeType.Element))
queue.Enqueue(next);
}
}
courtesy link
来源:https://stackoverflow.com/questions/4138803/wheres-the-bug-in-this-tree-traversal-code