Extract wikipedia articles belonging to a category from offline dumps

為{幸葍}努か 提交于 2019-12-12 04:25:11

问题


I have wikipedia article dumps in different languages. I want to filter them with articles which belong to a category(specifically Category:WikiProject_Biography)

I could get a lot of similar questions for example:

  1. Wikipedia API to get articles belonging to a category
  2. How do I get all articles about people from Wikipedia?

However, I would like to do it all offline. That is using dumps, and also for different languages.

Other things which I explored are category table and category link table. MediaWiki_1.28.0_database_schema


回答1:


Fetch the page and categorylinks tables from the dump, then run

SELECT
    page_namespace,
    page_title
FROM
    page
    JOIN categorylinks ON page_id = cl_from
WHERE
    cl_to = 'WikiProject_Biography'
;

to get the list of pages.



来源:https://stackoverflow.com/questions/43178266/extract-wikipedia-articles-belonging-to-a-category-from-offline-dumps

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!