ok, so i have been working on my chess program for a while and i am beginning to hit a wall. i have done all of the standard optimizations (negascout, iterative deepening, kille
An old question, but same techniques apply now as for 5 years ago. Aren't we all writing our own chess engines, I have my own called "Norwegian Gambit" that I hope will eventually compete with other Java engines on the CCRL. I as many others use Stockfish for ideas since it is so nicely written and open. Their testing framework Fishtest and it's community also gives a ton of good advice. It is worth comparing your evaluation scores with what Stockfish gets since how to evaluate is probably the biggest unknown in chess-programming still and Stockfish has gone away from many traditional evals which have become urban legends (like the double bishop bonus). The biggest difference however was after I implemented the same techniques as you mention, Negascout, TT, LMR, I started using Stockfish for comparison and I noticed how for the same depth Stockfish had much less moves searched than I got (because of the move ordering).
The one thing that is easily forgotten is good move-ordering. For the Alpha Beta cutoff to be efficient it is essential to get the best moves first. On the other hand it can also be time-consuming so it is essential to do it only as necessary.
The sorting should be done as needed, usually it is enough to sort the captures, and thereafter you could run the more expensive sorting of checks and PSQT only if needed.
Programming techniques are the same for Java as in the excellent answer by Adam Berent who used C#. Additionally to his list I would mention avoiding Object arrays, rather use many arrays of primitives, but contrary to his suggestion of using bytes I find that with 64-bit java there's little to be saved using byte and int instead of 64bit long. I have also gone down the path of rewriting to C/C++/Assembly and I am having no performance gain whatsoever. I used assembly code for bitscan instructions such as LZCNT and POPCNT, but later I found that Java 8 also uses those instead of the methods on the Long object. To my surprise Java is faster, the Java 8 virtual machine seems to do a better job optimizing than a C compiler can do.