Your problem is a very common one in NLP - don't start by reinventing the wheel - it will take you a long time and not be as good as what is out there already.
You should certainly start by seeing what the NLP libraries have to offer: http://en.wikipedia.org/wiki/Natural_language_processing and http://en.wikipedia.org/wiki/Category:Natural_language_processing_toolkits. Your problem is a common one and there are different approaches which you will need to explore for your corpus.
Your wordsplitting may be found under hyphenation routines. Two possible approaches are n-grams (where the frequency of (say) 4-character substrings are used to predict boundaries) and tries which show common starts or ends to words. Some of these may help with misspellings.
But there is no trivial answer - find what works best for you.