How much work should the constructor for an HTML parsing class do?

前端未结

关注

 19  1840

How much work is it reasonable for an object constructor to do? Should it simply initialize fields and not actually perform any operations on data, or is it okay to have it

相关标签:

19条回答

我寻月下人不归

2020-12-23 10:41

In this particular case, I would say there is two classes here: A parser and a parse result.

public class Parser {
    public Parser() {
        // Do what is necessary to construct a parser.
        // Perhaps we need to initialize a Unicode library, UTF-8 decoder, etc
    }
    public virtual ParseResult parseHTMLString(final string html) throws ParsingException
    {
        // Parser would do actual work here
        return new ParseResult(1, 2);
    }
}
public class ParseResult
{
    private int field1;
    private int field2;
    public ParseResult(int _field1, int _field2)
    {
        field1 = _field1;
        field2 = _field2;
    }
    public int getField1()
    {
        return field1;
    }
    public int getField2()
    {
        return field2;
    }
}

If your parser could work on partial sets of data, I'd suspect it would be suitable to add another class into the mix. Possibly a PartialParseResult?

0 讨论(0)

轻奢々

2020-12-23 10:42
I normally follow one easy principle:

Everything that is mandatory for the correct existence and behavior of the class instance should be passed and done into the constructor.

Every other activity is done by other methods.

The constructor should never:
- use other methods of the class with the purpose of using overriding behavior
- act on its private attributes via methods
Because I learned the hard way that while you are in the constructor, the object is in a incoherent, intermediate state which is too dangerous to handle. Some of this unexpected behavior could be expected from your code, some could be from the language architecture and compiler decisions. Never guess, stay safe, be minimal.

In your case, I would use a Parser::parseHtml(file) method. The instantiation of the parser and the parsing are two different operations. When you instance a parser, the constructor puts it in the condition to perform its job (parsing). Then you use its method to perform the parsing. You then have two choices:
1. Either you allow the parser to contain the results of the parsing, and give the clients an interface to retrieve the parsed information (e.g. Parser::getFooValue()). The methods will return Null if you haven't performed parsing yet, or if the parsing failed.
2. or your Parser::parseHtml() returns a ParsingResult instance, containing what the Parser found.
The second strategy grants you better granularity, as the Parser is now stateless, and the client needs to interact with the methods of the ParsingResult interface. The Parser interface remains sleek and simple. The internals of the Parser class will tend to follow the Builder pattern.

You comment: "I feel as though returning an instance of a parser that hasn't parsed anything (as you suggest), a constructor that's lost its purpose. There's no use in initializing a parser without the intent of actually parsing the information. So if parsing is going to happen for sure, should we parse as early as possible and report and errors early, such as during the construction of the parser? I feel as though initializing a parser with invalid data should result in an error being thrown."

Not really. If you return an instance of a Parser, of course it's going to parse. In Qt, when you instantiate a button, of course it's going to be shown. However, you have the method QWidget::show() to manually call before something is visible to the user.

Any object in OOP has two concerns: initialization, and operation (ignore finalization, it's not on discussion right now). If you keep these two operations together, you both risk trouble (having an incomplete object operating) and you lose flexibility. There are plenty of reasons why you would perform intermediate setup of your object before calling parseHtml(). Example: suppose you want to configure your Parser to be strict (so to fail if a given column in a table contains a string instead of an integer) or permissive. Or to register a listener object which is warned every time a new parsing is performed or ended (think GUI progress bar). These are optional information, and if your architecture puts the constructor as the übermethod that does everything, you end up having a huge list of optional method parameters and conditions to handle into a method which is inherently a minefield.

"Caching should not be the responsibility of a parser. If data is to be cached, a separate cache class should be created to provide that functionality."

On the opposite. If you know that you are going to use the parsing functionality on a lot of files, and there's a significant chance that the files are going to be accessed and parsed again later on, it is internal responsability of the Parser to perform smart caching of what it already saw. From the client perspective, it is totally oblivious if this caching is performed or not. He is still callling the parsing, and still obtaining a result object. but it is getting the answer much faster. I think there's no better demonstration of separation of concerns than this. You boost performance with absolutely no change in the contract interface or the whole software architecture.

However, note that I am not advocating that you should never use a constructor call to perform parsing. I am just claiming that it's potentially dangerous and you lose flexibility. There are plenty of examples out there where the constructor is at the center of the actual activity of the object, but there is also plenty of examples of the opposite. Example (although biased, it arises from C style): in python, I would consider very weird something like this
```
f = file()
f.setReadOnly()
f.open(filename)
```
instead of the actual
```
f = file(filename,"r")
```
But I am sure there are IO access libraries using the first approach (with the second as a sugar-syntax approach).

Edit: finally, remember that while it's easy and compatible to add in the future a constructor "shortcut", it is not possible to remove this functionality if you find it dangerous or problematic. Additions to the interface are much easier than removals, for obvious reasons. Sugary behavior must be weighted against future support you have to provide to that behavior.
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤独总比滥情好

2020-12-23 10:45

A constructor should do whatever is necessary to put that instance into a runnable, valid, ready-to-use state. If that means some validation or analysis, I'd say it belongs there. Just be careful about how much the constructor does.

There might be other places in your design where validation fits as well.

If the input values are coming from a UI, I'd say that it should have a hand in ensuring valid input.

If the input values are being unmarshalled from an incoming XML stream, I'd think about using schemas to validate it.

0 讨论(0)
发布评论:

提交评论
- 加载中...
天涯浪人

2020-12-23 10:46

I personally put nothing in constructors and have a set of initialization functions. I find standard constructor methods have limited and cumbersome reuse.

0 讨论(0)
发布评论:

提交评论
- 加载中...
孤独总比滥情好

2020-12-23 10:47
It is good rule of thumb to only initialize fields in constructors, and otherwise do as little as possible to initialize the Object. Using Java as an example, you could run into problems if you call methods in your constructor, especially if you subclass your Object. This is because, due to the order of operations in the instantiation of Objects, instance variables will not be evaluated until after the super constructor has finished. If you try to access the field during the super constructor's process, you will throw an Exception

Suppose you have a superclass
```
class Test {

   Test () {
      doSomething();
   }

   void doSomething() {
     ...
   }
 }
```
and you have a subclass:
```
class SubTest extends Test {
    Object myObj = new Object();

    @Override
    void doSomething() {
        System.out.println(myObj.toString()); // throws a NullPointerException          
    }
 }
```
This is an example specific to Java, and while different languages handle this sort of ordering differently, it serves to drive the point home.

edit as an answer to your comment:

Though I would normally shy away from methods in constructors, in this case you have a few options:
1. In your constructor, set the HTML string as a field in your Class, and parse every time your getters are called. This most likely will not be very efficient.
2. Set the HTML as a field on your object, and then introduce a dependency on parse(), with it needing to be called either right after the constructor is finished or include some sort of lazy parsing by adding something like 'ensureParsed()' at the head of your accessors. I dont like this all that much, as you could have the HTML around after you've parsed, and your ensureParsed() call could be coded to set all of your parsed fields, thereby introducing a side-effect to your getter.
3. You could call parse() from your constructor and run the risk of throwing an exception. As you say, you are setting the fields to initialize the Object, so this is really OK. With regard to the Exception, stating that there was an illegal argument passed into a constructor is acceptable. If you do this, you should be careful to ensure that you understand the way that your language handles the creation of Objects as discussed above. To follow up with the Java example above, you can do this without fear if you ensure that only private methods (and therefore not eligible for overriding by subclasses) are called from within a constructor.
0 讨论(0)
发布评论:

提交评论
- 加载中...
一整个雨季

2020-12-23 10:48

I'd probably just pass enough to initialize the object and then have a 'parse' method. The idea is that expensive operations should be as obvious as possible.

0 讨论(0)
发布评论:

提交评论
- 加载中...