What does “document” mean in a NLP context?

僤鯓⒐⒋嵵緔 提交于 2019-12-13 14:23:03

问题


As I was reading about tf–idf on Wiki, I was confused by what it means by the word "document". Does it mean paragraph?

"The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient."


回答1:


Document in the tf-idf context can typically be thought of as a bag of words. In a vector space model each word is a dimension in a very high-dimensional space, where the magnitude of an word vector is the number of occurrences of the word (term) in the document. A Document-Term matrix represents a matrix where the rows represent documents and the columns represent the terms, with each cell in the matrix representing # occurrences of the word in the document. Hope it's clear.




回答2:


A "document" is a distinct text. This generally means that each article, book, or so on is its own document.

If you wanted, you could treat an individual paragraph or even sentence as a "document". It's all a matter of perspective.



来源:https://stackoverflow.com/questions/41749471/what-does-document-mean-in-a-nlp-context

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!