Measuring the time complexity can be very difficult (if it is possible at all) and I never saw this in algorithm papers. If you cannot calculate the time-complexity from (pseudo-) code or the algorithm description, then maybe you can use a heuristic to simplify the analysis.
Maybe you can also calculate the complexity of some parts of the algorithm and ignore some other parts if they have obviously a much smaller complexity.
If nothing helps, the normal way would to show how the algorithm scales on an machine, just as you wrote.
But there are many things that effect the results. Just to notice some of them:
- Memory types: If your input is small enough to fit into the L1 cache, your algorithm runs very fast, just because the memory is fast. If your input gets bigger, so it doesn't fit into L1 cache any more, it is stored in the L2 cache and if it gets even bigger it is stored in the RAM. And each time your program slows down by a huge factor (in addition to the factor of the growing input). The worst is, when it gets so big that the algorithm has to store some of thin input at your hard disc.
- Multitasking: If your OS decides to hand over the CPU to an other program your algorithm seems to slow down. This is also hard to handle.
- Hardware: In big-O every operation counts as 1 unit of time. If your algorithm performs a lot of operations, that your CPU is optimized for, this also effects your measurement.
- Software: Software can effect your measurement the same way as hardware does. E.g. if you have a lot of big integer operations using a library, you can massively speed up the program by using GMP.
- Warmup: If you start measuring, you have to warmup the CPU first. Run the algorithm on a bigger input first (without measuring).
- Input cases: You can only run your program on some chosen or random generated input cases of a specific length. In most cases it is hard to tell (or just not possible) if the input causes a shorter or longer run-time. So maybe you test the wrong examples. If you us random inputs, you get more different results.
All in all: I think you can only get an idea, how your algorithm scales, but you cannot exactly get an upper bound of the complexity by measuring the run-time. Maybe this works for really small examples, but for bigger ones you will not get correct results.
The best you can do would be:
- Write down the exact hardware and software of the computer you use for measurement.
- Repeat the tests multiple times (in different orders)
- If you change hard or software you should start from the beginning.
- Only use inputs that are all stored in the same memory type, so skip all cases that fit into the cache.
This way you can see if changes have improved the algorithm or not and others can verify your results.
About the input:
- You should use worst case inputs, if possible. If you cannot say if an input is a worst case or not, you should use many different cases or random inputs (if possible).
- You have to run tests (for each input length), until the average of the run-times stabilizes.