Is there an open source Java library/algorithm for finding if a particular piece of text is a question or not?
I am working on a question answering system that needs t
Many quasi-questions/requests-for-info are posed in the grammatical form of a statement; e.g. "I would like to know who stole my bike".
I would abandon all hope of determining from its structure whether the user's input is a question, and simply assume it is a question unless it is unambiguously not a question. You could take an iterative, interactive approach so the system could refine its "understanding" of the user's input:
User: I would like to know how many angels fit on the head of a pin.
System: Do you have a question about angels?
User: Yes.
System: Do you want to know if angels are fit?
User: No.
System: Do you want to know if angels have heads?
User: Possibly.
System: Do you want to know if angels have pins?
User: No.
System: Do you want to know if angels are numerous?
User: No.
System: Do you want to know the dimensions of an angel?
User: Yes.
System: Do you mean angels in paintings?
User: No.
System: Do you mean angels in myth and religious writing?
User: Yes.
System: Angels are metaphysical beings.
User: I hear that Pennsylvania was founded by William Penn. Is that true?
System: Do you have a question about hearing?
User: No.
System: Do you have a question about Pennsylvania?
User: Yes.
System: Pennsylvania was founded by William Penn.
User: When?
System: 1682.
User: What does the name mean?
System: What name?
User: Pennsylvania!
System: Do you want to know the meaning of Pennsylvania?
User: Yes.
System: Pennsylvania means Penn's Woods.
Take a look at Link Grammar Parser It is a multi-language parser based on the concept of link pairs of related words in the sentence. It is written in C, but has a Java JNI interface as well.
In a syntactic parse of a question, the correct structure will be in the form of:
(SBARQ (WH+ (W+) ...)
(SQ ...*
(V+) ...*)
(?))
So, using anyone of the syntactic parsers available, a tree with an SBARQ node having an embedded SQ (optionally) will be an indicator the input is a question. The WH+ node (WHNP/WHADVP/WHADJP) contains the question stem (who/what/when/where/why/how) and the SQ holds the inverted phrase.
i.e.:
(SBARQ
(WHNP
(WP What))
(SQ
(VBZ is)
(NP
(DT the)
(NN question)))
(. ?))
Of course, having a lot of preceeding clauses will cause errors in the parse (that can be worked around), as will really poorly-written questions. For example, the title of this post "How to find out if a sentence is a question?" will have an SBARQ, but not an SQ.