Im working on an automatic image annotation problem in which im trying to associate tags with images. For that im trying for SIFT features for learning. But the problem is a
You can represent single SIFT as "visual word" which is one number and use it as SVM input, I think it is what you need. It is usually done by k-means clustering.
This method is called "bag-of-words" and described in this paper.
Short presentation review of method.
You should read the original paper about SIFT, it tells you what is SIFT and how to use it, you should carefully read the chapter 7 and rest for understanding how to use it practically. Here is the link for original paper.
Sift and Surf are invariant feature extractors. There for matching features will help solving lots of problems.
But there is matching problem since all points may not be same in two different image. (and in the case of similarity problem). Therefore you should use the features which is matched the others may.
Another problem is this algorithms extract lots of features which is not possible to match in large datasets.
There is a good solution to those problems which is called "Bag of Visual Word"
https://github.com/dermotte/LIRE complete bag of visual word is fully implemented. Here is the lire Demo site.
Code is very simple if you know the bag of visual word you can modify also.
After getting visual word you should use information retrieval approaches used in search engines. By the way Lire also include an information retrieval library called lucene. You should fallow the lire way until you get the complete idea and implement your own.
You can use the Bag of Words approach, of which you can read about in the following post:
http://gilscvblog.wordpress.com/2013/08/23/bag-of-words-models-for-visual-categorization/