How to export text from all pages of a MediaWiki?

后端 未结 6 1475
醉酒成梦
醉酒成梦 2020-12-31 08:22

I have a MediaWiki running which represents a dictionary of German terms and their translation to a local dialect. Each page holds one term, its translation and a number of

相关标签:
6条回答
  • 2020-12-31 08:55

    You can export the page content directly from the database. It will be the raw wiki markup, as when using Special:Export. But it will be easier to script the export, and you don't need to make sure all your pages are in some special category.

    Here is an example:

    SELECT page_title, page_touched, old_text
    FROM revision,page,text
    WHERE revision.rev_id=page.page_latest
    AND text.old_id=revision.rev_text_id;
    

    If your wiki uses Postgresql, the table "text" is named "pagecontent", and you may need to specify the schema. In that case, the same query would be:

    SET search_path TO mediawiki,public;
    
    SELECT page_title, page_touched, old_text 
    FROM revision,page,pagecontent
    WHERE revision.rev_id=page.page_latest
    AND pagecontent.old_id=revision.rev_text_id;
    
    0 讨论(0)
  • 2020-12-31 08:57

    This worked very well for me. Notice I redirected the output to the file backup.xml. From a Windows Command Processor (CMD.exe) prompt:

    cd \PATH_TO_YOUR_WIKI_INSTALLATION\maintenance
    \PATH_OF_PHP.EXE\php dumpBackup.php --full > backup.xml
    
    0 讨论(0)
  • 2020-12-31 09:00

    You can use the special page, Special:Export to export to XML; here is Wikipedia's version.

    You might also consider Extension:Collection if you want it eventually human readable (e.g. PDF) form.

    0 讨论(0)
  • 2020-12-31 09:05

    Export

    cd maintenance
    php5 ./dumpBackup.php --current > /path/wiki_dump.xml
    

    Import

    cd maintenance
    php5 ./importDump.php < /path/wiki_dump.xml
    
    0 讨论(0)
  • 2020-12-31 09:09

    It looks less than simple. http://meta.wikimedia.org/wiki/Help:Export might help, but probably not.

    If the pages are all structured in the same way, you might be able to write a web scraper with something like Scrapy

    0 讨论(0)
  • 2020-12-31 09:13

    I'm not completely satisfied with the solution, but I ended up specifying a common category for all pages and then I can add this category and all of the containing page names in the Special:Export box. It seems to work, allthough I'm not sure if it will still work when I reach a few thousand pages.

    0 讨论(0)
提交回复
热议问题