问题
I am looking for a solution to extract the list of concepts that a text (or html) document is about. I'd like the concepts to be wikidata topics (or freebase or DBpedia).
For example "Bad is a song by Mikael Jackson" should return Michael Jackson (the artist, wikidata Q2831) and Bad (the song, wikidata Q275422). As this example shows, the system should be robust to spelling mistakes (Mikael) and ambiguity (Bad).
Ideally the system should work across multiple languages, it should work both on short texts and long texts, and when it is unsure it should return multiple topics (eg. Bad song + Bad album). Also, it should ideally be open source and have a python API.
Yes, that sounds like a list for Santa Claus. Any ideas?
Edit
I checked out a few solutions, but no silver bullet so far.
- NLTK parses text and extract "named entities" (AFAIU, a part of a sentence that refers to a name), but it does not return Wikidata topics, just plain text. This means that it will likely not understand that "I shot the sheriff" is the name of a song by Bob Marley, it will instead treat this as a sentence.
- OpenNLP does roughly the same.
- Wikidata has a search API, but it's just one term at a time, and it does not handle disambiguation.
- There are a few commercial services (OpenCalais, AlchemyAPI, CogitoAPI...) but none really shines, IMHO.
回答1:
You can use Spacy to retrieve Named Entity then link them to WikiData using the search API.
For what remains of the sentence that is not matched as named entity by Spacy you can create a list of ngrams from the sentence starting with the biggest ngram you use the WikiData search API to lookup WikiData topics.
POS tagging can be put to good use, that said syntax parse informations is more powerful since you can know the relations between the words. For instance given the following output from link-grammar:
Found 8 linkages (8 had no P.P. violations)
Linkage 1, cost vector = (UNUSED=0 DIS= 0.15 LEN=9)
+-------------------------Xp-------------------------+
+----------->WV---------->+ |
+-------Wd------+ +---------Osn--------+ |
| +---G---+----Ss---+----Os----+ | |
| | | | | | |
LEFT-WALL Bob.m Marley[!] wrote.v-d Natural[!] Mystic[!] .
You can tell that the subject is “Bob Marley” because
- “wrote” is connected to “Marley” with a S which connects subject nouns to finite verbs.
- “Marley” is connected to “Bob” using a G which connects proper noun together.
So a “Bob Marley” is a good candidate for an entity (also it has both word capitalized).
Given the above parse "tree" it difficult to tell whether “Natural” and “Mystic” are related even if they are on the same side of the sentence.
The second parse provided by link grammar has the same cost vector and links together “Natural Mystic” with again a G.
Here is it:
Linkage 2, cost vector = (UNUSED=0 DIS= 0.15 LEN=9)
+-------------------------Xp-------------------------+
+----------->WV---------->+ |
+-------Wd------+ +---------Os---------+ |
| +---G---+----Ss---+ +----G----+ |
| | | | | | |
LEFT-WALL Bob.m Marley[!] wrote.v-d Natural[!] Mystic[!] .
So in my opinion “Bob Marley” and “Natural Mystic” are good candidate for a wikidata search.
That was the easy problem where grammar and spelling are correct.
Here is one parse out of 11 of the same sentence with lower cases:
Linkage 1, cost vector = (UNUSED=1 DIS= 0.15 LEN=14)
+------------------------Xp------------------------+
+----------------------Wa---------------------+ |
| +------------------AN-----------------+ |
| | +-------------AN-------------+ |
| | | +----AN---+ |
| | | | | |
LEFT-WALL Bob.m marley[?].n [wrote] natural.n mystic.n .
LG doesn't even recognize the verb.
来源:https://stackoverflow.com/questions/40492475/how-do-you-find-the-list-of-wikidata-or-freebase-or-dbpedia-topics-that-a-text