What are some common strategies for refactoring large \"state-only\" objects?
I am working on a specific soft-real-time decision support system which do
Splitting a Large Data Object is very similar to Normalizing a Large Relational Table (first and second normal form). Follow the rules to reach at least second normal form and you may have a good decomposition of the original class.
From experience working also with R&D stuff with soft real-time performance constrains (and sometimes monster fat classes), I would suggest NOT to use OR mappers. In such situations, you'll be better off dealing "touching the metal" and working directly with JDBC result sets. This is my suggestion for apps with soft real-time constrains and massive amounts of data items per package. More importantly, if the number of distinct classes (not class instances, but class definitions) that need to persisted is large, and you also have memory constrains in your specs, you will also want to avoid ORMs like Hibernate.
Going back to your original question:
What you seem to have is a typical problem of 1) mapping multiple data items into a OO model and 2) such multiple data items do not exhibit a good way of grouping or segregation (and any attempt to grouping tends simply not to feel right.) Sometimes the domain model does not lend itself for such aggregation, and coming up with an artificial way of doing so typically ends up in compromises that don't satisfy all design requirements and desires.
To make matters worse, a OO model typically requires/expects you to have all the items present in a class as class' fields. Such a class is typically without behavior, so it is just a struct
-like construct, aka data envelope
or data shuttle
.
But such situations beg the following questions:
Does your application need to read/write all 40, 50+ data items at once, always? *Must all data items be always present?*
I do not know the specifics of your problem domain, but in general I've found that we rarely ever need to deal with all data items at once. This is where a relational model shines because you don't have to query all rows from a table at once. You only pulls those you need as projections of the table/view in question.
In a situation where we have a potentially large number of data items, but on average the number of data items being passed down the wire is less than the maximum, you'd be better off using a Properties pattern.
Instead of defining a monster envelope class holding all items :
// java pseudocode
class envelope
{
field1, field2, field3... field_n;
...
setFields(m1,m2,m3,...m_n){field1=m1; .... };
...
}
Define a dictionary (based on a map for example):
// java pseudocode
public enum EnvelopeField {field1, field2, field3,... field_n);
interface Envelope //package visible
{
// typical map-based read fields.
Object get(EnvelopeField field);
boolean isEmpty();
// new methods similar to existing ones in java.lang.Map, but
// more semantically aligned with envelopes and fields.
Iterator<EnvelopeField> fields();
boolean hasField(EnvelopeField field);
}
// a "marker" interface
// code that only needs to read envelopes must operate on
// these interfaces.
public interface ReadOnlyEnvelope extends Envelope {}
// the read-write version of envelope, notice that
// it inherits from Envelope, but not from ReadOnlyEnvelope.
// this is done to make it difficult (but not impossible
// unfortunately) to "cast-up" a read only envelope into a
// mutable one.
public interface MutableEnvelope extends Envelope
{
Object put(EnvelopeField field);
// to "cast-down" or "narrow" into a read only version type that
// cannot directly be "cast-up" back into a mutable.
ReadOnlyEnvelope readOnly();
}
// the standard interface for map-based envelopes.
public interface MapBasedEnvelope extends
Map<EnvelopeField,java.lang.Object>
MutableEnvelope
{
}
// package visible, not public
class EnvelopeImpl extends HashMap<EnvelopeField,java.lang.Object>
implements MapBasedEnvelope, ReadOnlyEnvelope
{
// get, put, isEmpty are automatically inherited from HashMap
...
public Iterator<EnvelopeField> fields(){ return this.keySet().iterator(); }
public boolean hasField(EnvelopeField field){ return this.containsKey(field); }
// the typecast is redundant, but it makes the intention obvious in code.
public ReadOnlyEnvelope readOnly(){ return (ReadOnlyEnvelope)this; }
}
public class final EnvelopeFactory
{
static public MapBasedEnvelope new(){ return new EnvelopeImpl(); }
}
No need to set up read-only
internal flags. All you need to do is downcast your envelope instances as Envelope
instances (that only provide getters).
Code that expects to read should operate on read-only envelopes and code that expects to change fields should operate on mutable envelopes. Creation of the actual instances would be compartmentalized in factories.
That is, you use the compiler to enforce things to be read-only (or allow things to be mutable) by establishing some code conventions, rules governing what interfaces to use where and how.
You can layer your code into sections that need to write separate from code that only needs to read. Once that's done, simple code reviews (or even grep) can identify code that is using the wrong interface.)
Non-public Parent Interface:
Envelope
is not declared as a public interface to prevent erroneous/malicious code from casting a read-only envelope down to a base envelope and then back to a mutable envelope. The intended flow is from mutable to read-only only - it is not intended to be bi-directional.
The problem here is that extension of Envelope
is restricted to the package that contains it. Whether that is a problem will depend on the particular domain and intended usage.
Factories:
The problem is that factories can (and most likely will) be very complex. Again, the nature of the beast.
Validation:
Another problem introduced with this approach is that now you have to worry about code that expects field X to be present. Having the original monster envelope class partially frees you from that worry because, at least syntactically, all fields are there...
... whether the fields are set or not, that was another matter that still remains with this new model I'm proposing.
So if you have client code that expects to see field X, the client code has to throw some type of exception if the field is not present (or to computer or read a sensible default somehow.) In such cases, you will have to
Identify patterns of field presence. Clients that expect field X to be present might be grouped separately (layered apart) from clients that expect some other field to be present.
Associate custom validators (proxies to read-only envelope interfaces) that either throw exceptions or compute default values for missing fields according to some rules (rules provided programmatically, with an interpreter, or with a rules engine.)
Lack of Typing:
This might be debatable, but people used to work with static typing might feel uneasy with losing the benefits of static typing by going to a loosely typied map-based approach. The counter-argument of this is that most of the web works on a loose typing approach, even on the Java side (JSTL, EL.)
Problems aside, the larger the maximum number of possible fields and the lower the average number of fields present at any given time, the most effective wrt performance this approach will be. It adds additional code complexity, but that's the nature of the beast.
That complexity doesn't go away, and either will be present in your class model or in your validation code. Serialization and transferring down the wire is much more efficient, though, specially if you expect massive numbers of individual data transfers.
Hope it helps.
I had much of the same problem you did.
At least I think I did, sounds like I did. Representation was different, but at 10,000 feet, sounds pretty much the same. Crapload of discrete, "arbitrary" variables and a bunch of ad hoc relationships among them (essentially business driven), subject to change at a moment's notice.
You also have another issue, which you sorta mentioned, and that was the performance requirement. Sounds like faster is better, and likely a slow perfect solution would be tossed out for the fast lousy one, simply because the slower one can't meet a baseline performance requirement, no matter how good it is.
To put it simply, what I did was I designed a simple domain specific rule language for my system.
The entire point of the DSL was to implicitly express relationships and package them up in to modules.
Very crude, contrived example:
D = 7
C = A + B
B = A / 5
A = 10
RULE 1: IF (C < 10) ALERT "C is less than 10"
RULE 2: IF (C > 5) ALERT "C is greater than 5"
RULE 3: IF (D > 10) ALERT "D is greater than 10"
MODULE 1: RULE 1
MODULE 2: RULE 3
MODULE 3: RULE 1, RULE 2
First, this is not representative of my syntax.
But you can see from the Modules, that it is 3, simple rules.
The key though, is that it's obvious from this that Rule 1 depends on C, which depends on A and B, and B depends on A. Those relationships are implied.
So, for that module, all of those dependencies "come with it". You can see if I generated code for Module 1 it might look something like:
public void module_1() {
int a = 10;
int b = a / 5;
int c = a + b;
if (c < 10) {
alert("C is less than 10");
}
}
Whereas if I created Module 2, all I would get is:
public void module_2() {
int d = 7;
if (d > 10) {
alert("D is greater than 10.");
}
}
In Module 3 you see the "free" reuse:
public void module_3() {
int a = 10;
int b = a / 5;
int c = a + b;
if (c < 10) {
alert("C is less than 10");
}
if (c > 5) {
alert("C is greater than 5");
}
}
So, even though I have one "soup" of rules, the Modules root the base of the dependencies, and thus filter out the stuff it doesn't care about. Grab a module, shake the tree and keep what's left hanging.
My system used the DSL to generate source code, but you can easily have it create a mini runtime interpreter as well.
Simple topological sorting handled the dependency graph for me.
So, the nice thing about this is that while there was inevitable duplication in the final, generated logic, at least across modules, there wasn't any duplication in the rule base. What you as a developer/knowledge worker maintain is the rule base.
What is also nice is that you can change an equation, and not worry so much about the side effects. For example, if I change do C = A / 2, then, suddenly, B drops out completely. But the rule for IF (C < 10) doesn't change at all.
With a few simple tools, you can show the entire dependency graph, you can find orphaned variables (like B), etc.
By generating source code, it's going to run as fast as you want.
In my case, it was interesting to see a rule drop a single variable and see 500 lines of source code vanish from the resulting module. That's 500 lines I didn't have to crawl through by hand and remove during maintenance and development. All I had to do was change a single rule in my rule base and let "magic" happen.
I was even able to do some simple peephole optimization and eliminate variables.
It's not that hard to do. Your rule language can be XML, or a simple expression parser. No reason to go full boat Yacc or ANTLR on it if you don't want to. I'll put a plug in for S-Expressions, no grammar needed, brain dead parsing.
Spreadsheets also make a great input tool, actually. Just be strict on the formatting. Kind of sucks for merging in SVN (so, Don't Do That), but end users love it.
You may well be able to get away with an actual rule based system. My system wasn't dynamic at runtime, and didn't really need sophisticated goal seeking and inference, so I didn't need the overhead of such a system. But if one works for you out of the box, then happy day.
Oh, and for an implementation note, for those who don't believe you can hit the 64K code limit in a Java method, well I can assure you it can be done :).
Actually this looks like a frequent problem that game developers face, bloated classes holding numerous variables and methods because of a deep inheritance tree etc.
There's this blog post about how and why to select composition over inheritance, maybe it would help.
One way you may be able to intelligently break up a large data class is to look at patterns of access by client classes. For example, if a set of classes only accesses fields 1-20 and another set of classes only accesses fields 25-30, maybe those groups of fields belong in separate classes.