I'm not aware of any software for this, or previous work done on it. And, fundamentally, I don't think you can get answers of the form "O(whatever)" that are trustworthy. Your measurements are noisy, you might be trying to distinguish nlog(n) operations from nsqrt(n) operations, and unlike a nice clean mathematical analysis, all of the dropped constants are still floating around messing with you.
That said, the process I would go through if I wanted to come up with a best estimate:
- Making sure I record as much information as possible through the whole process, I'd run the thing I want to measure on as many inputs (and sizes) as I could before I got bored. Probably overnight. Repeated measurements for each input and size.
- Shovel the input size to time data into a trial copy of Eureqa and see what pops out.
- If I'm not satisfied, get more data, continue to shovel it into Eureqa and see if the situation is improving.
- Assuming Eureqa doesn't give an answer I like before I get bored of it consuming all of my CPU time and power, I'd switch over to Bayesian methods.
- Using something like pymc I'd attempt to model the data using a bunch of likely looking complexity functions. (n, n^2, n^3, n^3, n*log(n), n^2*log(n) n^2*log(n)^2, etc, etc, etc).
- Compare the DIC (smaller is better) of each model, looking for the best few.
- Plot the best few, look for spots where data and model disagree.
- Collect more data near disagreements. Recompute the models.
- Repeat 5-8 until bored.
- Finally, collect some new data points at larger input sizes, see which model(s) best predict those data points.
- Choose to believe that one of them is true enough.