Wikipedia Category Hierarchy from dumps

匿名 (未验证) 提交于 2019-12-03 02:51:02

问题:

Using Wikipedia's dumps I want to build a hierarchy for its categories. I have downloaded the main dump (enwiki-latest-pages-articles) and the category SQL dump (enwiki-latest-category). But I can't find the hierarchy information.

For example, the SQL categories' dump has entries for each category but I can't find anything about how they relate to each other.

The other dump (latest-pages-articles) says the parent categories for each page but in an unordered way. It just states all the parents.

I have seen wikiprep's category hierarchy (http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/)... How is that one constructed? Wikiprep lists the category ID, not its name. Is there a way to get the name for each ID?

回答1:

The category hierarchy information in MediaWiki is stored in the categorylinks table, so you're going to need the categorylinks dump.

You're also going to need the page (not pages-articles) dump for page id to title mapping.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!