How to get full Wikipedia revision-history list from some article?

后端未结

关注

 2  1690

How can I get the full Wikipedia revision-history list? (Don\'t want to scrape)

import wapiti
import pdb
import pylab as plt  
client = wapiti.WapitiClient(\


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  鱼传尺愫        
                
              
                            
                2021-01-06 01:24
              
            
            
                                                                       
If you use pywikibot you can pull a generator that will run through the full revision history for you.  For example, to get a generator that will step through all the revisions (including their content) for the page "pagename" in English Wikipedia, use:  

site = pywikibot.Site("en", "wikipedia")
page = pywikibot.Page(site, "pagename")
revs = page.revisions(content=True)


There's a lot more parameters you can apply to the query. You can find the API documentation here

Of note is:


  revisions(reverse=False, total=None, content=False, rollback=False, starttime=None, endtime=None)
  
  Generator which loads the version history as Revision instances.


pywikibot appears to be the approach taken by many wikipedia editors to automate editing. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  天涯浪人        
                
              
                            
                2021-01-06 01:44
              
            
            
                                                                       
If you need more than 500 revision entries you will have to use MediaWiki API with action query, property revisions and parameter rvcontinue, which is taken from the previous request, so you can't get the whole list only with one request:

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Coffee&rvcontinue=...


To get more specific information of your choice you'll have to use also rvprop parameter:

&rvprop=ids|flags|timestamp|user|userid|size|sha1|contentmodel|comment|parsedcomment|content|tags|parsetree|flagged


Summary of all available parameters you can find here.

This is how to get the full Wikipedia's page revision history in C#:

private static List<XElement> GetRevisions(string pageTitle)
{
    var url = "https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles=" + pageTitle;
    var revisions = new List<XElement>();
    var next = string.Empty;
    while (true)
    {
        using (var webResponse = (HttpWebResponse)WebRequest.Create(url + next).GetResponse())
        {
            using (var reader = new StreamReader(webResponse.GetResponseStream()))
            {
                var xElement = XElement.Parse(reader.ReadToEnd());
                revisions.AddRange(xElement.Descendants("rev"));

                var cont = xElement.Element("continue");
                if (cont == null) break;

                next = "&rvcontinue=" + cont.Attribute("rvcontinue").Value;
            }
        }
    }

    return revisions;
}


Currently for "Coffee" this returns 10 414 revisions.



Edit: Here is a Python version:

import urllib2
import re

def GetRevisions(pageTitle):
    url = "https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles=" + pageTitle
    revisions = []                                        #list of all accumulated revisions
    next = ''                                             #information for the next request
    while True:
        response = urllib2.urlopen(url + next).read()     #web request
        revisions += re.findall('<rev [^>]*>', response)  #adds all revisions from the current request to the list

        cont = re.search('<continue rvcontinue="([^"]+)"', response)
        if not cont:                                      #break the loop if 'continue' element missing
            break

        next = "&rvcontinue=" + cont.group(1)             #gets the revision Id from which to start the next request

    return revisions;


How you see the logic is absolutely the same. The difference with C# is that in C# I parsed the XML response and here I use regex to match the all rev and continue elements from it.

So, the idea is that I make a main request from which I get all revisions (the maximum is 500) into revisions array. Also I check the continue xml element to know are there more revisions, get the value of the rvcontinue property and use it in next variable (for this example from the first request it is 20150127211200|644458070) to make another request to take the next 500 revisions. I repeat all this until the continue element is available. If it missing, this means that no more revisions after the last one in the response's revision list are left, so I exit from the loop.

revisions = GetRevisions("Coffee")

print(len(revisions))
#10418


Here are the last 10 revisions for "Coffee" article (they are returned from the API in reversed order), and don't forget that if you need more specific revision information you can use rvprop parameter in your request.

for i in revisions[0:10]:
    print(i)

#<rev revid="698019402" parentid="698018324" user="Termininja" timestamp="2016-01-03T13:51:27Z" comment="short link" />
#<rev revid="698018324" parentid="697691358" user="AXRL" timestamp="2016-01-03T13:39:14Z" comment="/* See also */" />
#<rev revid="697691358" parentid="697690475" user="Zekenyan" timestamp="2016-01-01T05:31:33Z" comment="first coffee trade" />
#<rev revid="697690475" parentid="697272803" user="Zekenyan" timestamp="2016-01-01T05:18:11Z" comment="since country of origin is not first sighting of someone drinking coffee I have removed the origin section completely" />
#<rev revid="697272803" parentid="697272470" minor="" user="Materialscientist" timestamp="2015-12-29T11:13:18Z" comment="Reverted edits by [[Special:Contribs/Media3dd|Media3dd]] ([[User talk:Media3dd|talk]]) to last version by Materialscientist" />
#<rev revid="697272470" parentid="697270507" user="Media3dd" timestamp="2015-12-29T11:09:14Z" comment="/* External links */" />
#<rev revid="697270507" parentid="697270388" minor="" user="Materialscientist" timestamp="2015-12-29T10:45:46Z" comment="Reverted edits by [[Special:Contribs/89.197.43.130|89.197.43.130]] ([[User talk:89.197.43.130|talk]]) to last version by Mahdijiba" />
#<rev revid="697270388" parentid="697265765" user="89.197.43.130" anon="" timestamp="2015-12-29T10:44:02Z" comment="/* See also */" />
#<rev revid="697265765" parentid="697175433" user="Mahdijiba" timestamp="2015-12-29T09:45:03Z" comment="" />
#<rev revid="697175433" parentid="697167005" user="EvergreenFir" timestamp="2015-12-28T19:51:25Z" comment="Reverted 1 pending edit by [[Special:Contributions/2.24.63.78|2.24.63.78]] to revision 696892548 by Zefr: [[WP:CENTURY]]" />

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复