How much work is it reasonable for an object constructor to do? Should it simply initialize fields and not actually perform any operations on data, or is it okay to have it
As quite a few have commented the general rule is to only do initialization in constructors and never use say virtual methods (you will get a compiler warning if you try pay attention to that warning :) ). In you specific case I wouldn't go for the parHTML method either. an object should be in a valid state when it's constructed you should have to do stuff to the object before you can really use it.
Personally I'd go for a factory method. Exposing a class with no public constructors and create it using a factory method instead. Let you're factory method do the parsing and pass the parsed result to a private/protected constructor.
take a look at System.Web.WebRequest if you wanna see a sample of some similiar logic.
Misko Hevery has a nice story on this subject, from a unit testing perspective, here.
You should try to keep the constructor from doing unnecessary work. In the end, it all depends on what the class should do, and how it should be used.
For instance, will all the accessors be called after constructing your object? If not, then you've processed data unnecessarily. Also, there's a bigger risk of throwing a "senseless" exception (oh, while trying to create the parser, I got an error because the file was malformed, but I didn't even ask it to parse anything...)
On second thought, you might need the access to this data fast after it is built, but you may take long building the object. It might be ok in this case.
Anyway, if the building process is complicated, I'd suggest using a creational pattern (factory, builder).
In my case, the entire contents of the HTML file are passed through a String. The string is no longer required once it is parsed and is fairly large (a few hundred kilobytes). So it would be best to not keep it in memory. The object shouldn't be used for other cases. It was designed to parse a certain page. Parsing something else should prompt the creation of a different object to parse that.
It sounds very much as though your object isn't really a parser. Does it just wrap a call to a parser and presents the results in a (presumably) more usable fashion? Because of this, you need to call the parser in the constructor as your object would be in a non-useful state otherwise.
I'm not sure how the "object-oriented" part helps here. If there's only one object and it can only process one specific page then it's not clear why it needs to be an object. You could do this just as easily in procedural (i.e. non-OO) code.
For languages that only have objects (e.g. Java) you could just create a static
method in a class that had no accessible constructor and then invoke the parser and return all of the parsed values in a Map
or similar collection
I would not do the parsing in the constructor. I would do everything necessary to validate the constructor parameters, and to ensure that the HTML can be parsed when needed.
But I'd have the accessor methods do the parsing if the HTML is not parsed by the time they need it to be. The parsing can wait until that time - it does not need to be done in the constructor.
Suggested code, for discussion purposes:
public class MyHtmlScraper {
private TextReader _htmlFileReader;
private bool _parsed;
public MyHtmlScraper(string htmlFilePath) {
_htmlFileReader = new StreamReader(htmlFilePath);
// If done in the constructor, DoTheParse would be called here
}
private string _parsedValue1;
public string Accessor1 {
get {
EnsureParsed();
return _parsedValue1;
}
}
private string _parsedValue2;
public string Accessor2 {
get {
EnsureParsed();
return _parsedValue2;
}
}
private void EnsureParsed(){
if (_parsed) return;
DoTheParse();
_parsed = true;
}
private void DoTheParse() {
// parse the file here, using _htmlFileReader
// parse into _parsedValue1, 2, etc.
}
}
With this code in front of us, we can see there's very little difference between doing all the parsing in the constructor, and doing it on demand. There's a test of a boolean flag, and the setting of the flag, and the extra calls to EnsureParsed in each accessor. I'd be surprised if that extra code were not inlined.
This isn't a huge big deal, but my inclination is to do as little as possible in the constructor. That allows for scenarios where the construction needs to be fast. These will no doubt be situations you have not considered, like deserialization.
Again, it's not a huge big deal, but you can avoid doing the work in the constructor, and it's not expensive to do the work elsewhere. I admit, it's not like you're off doing network I/O in the constructor (unless, of course, a UNC file path is passed in), and you're not going to have to wait long in the constructor (unless there are networking problems, or you generalize the class to be able to read the HTML from places other than a file, some of which might be slow).
But since you don't have to do it in the constructor, my advice is simply - don't.
And if you do, it could be years before it causes an issue, if at all.
In general, a constructor should:
However, I would not use the constructor in the way you have. Parsing should be separated from using the parsing results.
Generally when I write a parser I write it as a singleton. I don't store any fields in the object except the single instance; instead, I only use local variables within the methods. Theoretically these could just be static (class-level) methods, but that would mean that I couldn't make them virtual.