I am trying to build a ranker for a demonstration. I did the \"automatic training\" and i got OK results (could be better) I am trying to go into manual training but I am confus
The training data is meant to train a learning-to-rank (L2R) algorithm. The L2R approach is to first take a list of candidate answers
(e.g. documents in a search result page) that were generated in response to a query
(aka question) and represent each query-answer pair
as a set of features. Each feature hopefully captures some representation of how well that particular candidate answer matches the query. Each line in the training data represents the feature values belonging to one of these query-answer pairs.
Because the training data contains feature vectors from lots of different queries (and corresponding search results), the first column uses a query id to tie together different candidate answers that were generated in response to a single query.
As you said, the last column simple captures whether a human annotator believed that the answer was actually relevant to the question or not. The 0-4 scale is not mandatory. 0 always represents irrelevant. But after that you can use whatever scale makes sense for your use case (often people just use a 0-1 binary scale when there is limited data since this reduces complexity).
The python script made available on the documentation page that you referenced will actually go through the process of generating candidate answers and corresponding feature vectors given a file containing different queries. You may wish to step through the code in that script to get a better idea of how you might create your training data.