问题
I want to create a document term matrix. In my case it is not like documents x words but it is sentences x words so the sentences will act as the documents. I am using 'l2' normalization post doc-term matrix creation.
The term count is important for me to create summarization using SVD in further steps.
My query is which axis will be appropriate to apply 'l2' normalization. With sufficient research I understood:
- Axis=1 : Will give me the importance of the word in a sentence (column wise normalization)
- Axis=0 : Importance of the word in a document (row wise normalization).
Even after knowing the theory I am not able to decide which alternative to choose because the choice will greatly affect my summarization results. So kindly guide me a solution along with a reason for the same.
回答1:
By L2 normalization, do you mean division by the total count?
If you normalize along axis=0, then the value of x_{i,j}
is the probability of the word j
over all sentences i
(division by the global word count), which is dependent on the length of the sentence, as longer ones can repeat some words over and over again and will have a much higher probability for this word, as they contribute a lot to the global word count.
If you normalize along axis=1, then you're asking whether sentences have the same composition of words, as you normalize along the lenght of the sentence.
来源:https://stackoverflow.com/questions/60793533/in-count-vectorizer-which-axis-to-use