For each concept of my dataset I have stored the corresponding wikipedia categories. For example, consider the following 5 concepts and their corresponding wikipedia categories.
You could try to classify the wikipedia categories by the mediawiki links and backlinks returned for each category
import re
from mediawiki import MediaWiki
#TermFind will search through a list a given term
def TermFind(term,termList):
responce=False
for val in termList:
if re.match('(.*)'+term+'(.*)',val):
responce=True
break
return responce
#Find if the links and backlinks lists contains a given term
def BoundedTerm(wikiPage,term):
aList=wikiPage.links
bList=wikiPage.backlinks
responce=False
if TermFind(term,aList)==True and TermFind(term,bList)==True:
responce=True
return responce
container=[]
wikipedia = MediaWiki()
for val in termlist:
cpage=wikipedia.page(val)
if BoundedTerm(cpage,'term')==True:
container.append('medical')
else:
container.append('nonmedical')
The idea is to try to guess a term that is shared by most of the categories, I try biology, medicine and disease with good results. Perhaps you can try to use mulpile calls of BoundedTerms to make the clasification, or a single call for multiple terms and combine the result for the classification. Hope it helps