问题
I want to parse Malayalam(Indian Language) text corpora for developing a question answering system.Is there any NLP tools for semantic parsing for languages other than English.
回答1:
This might sound big and scary.
As far as I know, there is no free software question/answering system you can study, even if it's documented.
There is two part to question-answering:
- understanding the question
- looking up the response in some preprocessed dataset (say wikidata.org)
Both steps require similar algorithms.
Vertical question-answering pipeline
To implement a vertical question/answering system you will need to be able to parse "Malayalam" and other Indian languages at a high level which means at least to do:
split text into paragraphs, and then into sentences and into words. You must be able to tell where a sentence ends. For instance depending on the language sentences might not end with the same char. Or abbrev. like
ie.
are not the end of a sentence.I.B.M.
is not three sentences, etc. How do sentence start, in english there is a upper case letter, but not all upper case letter start a sentence like proper nouns e.g. "Is Chomsky alive?"Part of Speech Tagging: tell nouns, from proper nouns, from verbs, etc.
create a named entity recognizer: identify names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
build semantic tree dependencies: e.g. to which named entity "she" or "he" refers to? who is the subject of the sentence, the complement, etc.
create a text generation tool. Given that your program understand the question, found a possible answer it must format it in a natural language.
Luckily there is no shortage of example for doing that for english which you take inspiration from. If you want to work on Python you will want to study Spacy (wanna be fast and current NLP library) and NLTK which comes with book.
Algorithms can be shared between languages.
A more narrow approach
If you don't want to do all the steps and only resolve the sub problem of answering questions. You need to simplify the problem and eliminate variables/unknown:
You must build a database with already split and tagged facts so that you can simply answer questions by doing a SQL query. For instance given the following fact tuple:
WHO:India WHAT:win WHAT:Cricket Championship WHEN:2015
Here, I simplify tagging as WHO
, WHEN
and WHAT
.
It's easy to answer the question:
WHO:? WHAT:win WHAT:Cricket Championship WHEN:2015
ie.
who won the Cricket championship in 2015?
Again questions must be "predictable" and easy to parse other examples:
WHO did WHAT WHEN
WHAT is WHAT
WHEN WHAT WHO
This can work if you can recognize/parse a WHO
, a WHEN
and guess what is a WHAT
in a sentence provided by the user. Also you can further simplify and say that WHEN
s can only be 4 digits ie. years. And futher constraint the kind of question, to simplify the parsing part.
This will lead you to a program that can actually answer question in a way that is more natural and more correct that an Information Retrieval (IR) systems like raw ElasticSearch or PostgreSQL would do.
Facts databases
You will probably need to work with semantical network look at the free ConceptNet (and send a message to the mailling list if you need help or want to contribute indian wiktionary) or babelnet. There is also wordnet.
Course
I liked a lot Jurafsky course, there is a specific chapter about QA. Jurafsky wrote a full book introduction to NLP.
Search Tips
Search the World Wide Web for information about NLP algorithms in the language you look information for. Say for a french lemmatizer I do the search on french research portal or through a search engine using native ie. french wording. American search engines are not really as good in other languages than english so be prepared to paginate (also use search tools).
Culture
Build a good culture on natural language processing and artificial intelligence. Look summarization or information retrieval (it's easy) to you will learn method that can be re-used in another problem. For instance if you look at rule based machine translation you learn that in the industry they used simplified non ambiguous natural language grammars to be able to accurately translates documentations. Those documentations are written in a simple English (e.g. SUBJECT VERB NOUN
) for which a computer grammar can easily be created (like computer language grammars) and can easily be translated mostly word-to-word. This an instance of solving a sub-problem to achieve higher quality. This is how I came up with the above narrow approach.
Algorithms
Last but not least, most sub-problem solutions falls into one the three following algorithm category:
Algrebraic and graph theory try to makes sens of the data and can explain its results. e.g. PageRank, SimRank, CoSimRank, Logic programming.
Statistical which I compare to termodynamics, where basicaly "you get the problem solved but don't know why". This is what is called "machine learning" and is mainly used in the industry to solve actually narrow problems compared to NLP. But still machine learning algorithm exists to solve natural language problems e.g topic modeling but it's not the only example. Statistical programming is popular.
Hybrid: which means a mix of both methods.
Read On Chomsky and the Two Cultures of Statistical Learning for more insight about the dichotomy and the research/engineering background.
General Tips
You don't need to know and understand every algorithm and the scientifical grounds as long as you understand their limitations and how to use them.
Something that I've figured, even if I read mostly english, reading in my "native" language french widens my understanding.
Save the papers and ressources you find, things come and go.
回答2:
When I was trying to find a semantic parser for a language other than english, I faced two major issues: Finding semantic annotated training data and a FrameNet like semantic database.
Since it is virtually impossible to find a good substitution for FrameNet even for popular languages, I suspect that this is off the table in your case.
One possibility would be to use semantic role labeling. This shallow semantic parsing is not as detailed as a real semantic parser, but maybe it is a step in the right direction. The SRL parser I know are not for english per se, they rather come with an already trained language model for english. So the biggest challenge is to train such a model for your language and that requires you to have a large corpus with semantic annotation. Most parsers need this data in the CoNLL format, because there are corpora available for multiple languages here.
mate-tools
They focus mostly on dependency parsing, but also contain a semantic role labeller that can be trained. They are stand-alone tools written in C++, so if you want to integrate them into your program you maybe have to build a toolchain of some shell scripts.
ClearNLP
Unlike the mate-tools, ClearNLP can be used as a Java library and the project is also actively developed. They also support the training of a new SRL model, a feature that is relatively new.
Since I doubt that you have the time and resources to create a semantic annotated corpus for your language, a possible workaround would be to use a very good machine translation system to translate your corpus to english and then apply semantic parsing. This approach is very error prone, so I highly suggest correcting the translated output, even if this is a time consuming task.
I am currently working with the SEMAFOR parser (GitHub). It is written in Java, comes with pre-trained models for english and creates FrameNet annotations in an easy-to-read XML file. It is far from perfect (like all semantic parsers currently), but it creates an output you can work with. If you choose to work with it, it is best to ignore the precompiled jars and download the newest code from their GitHub page.
I know that this is not the perfect solution, but it is the fastest and most reliable approach that I would choose.
来源:https://stackoverflow.com/questions/32432719/is-there-any-nlp-tools-for-semantic-parsing-for-languages-other-than-english