wikimedia-dumps | 易学教程

Extract related articles in different languages using Wikidata Toolkit

阅读更多关于 Extract related articles in different languages using Wikidata Toolkit

问题 I'm trying to extract interlanguage related articles in Wikidata dump. After searching on the internet, I found out there is a tool named Wikidata Toolkit that helps to work with these type of data. But there is no information about how to find related articles in different languages. For example, the article: "Dresden" in the English language is related to the article: "Dresda" in the Italiano one. I mean the second one is the translated version of the first one. I tried to use the toolkit,

Decoding or parsing a Wikipedia Dump

阅读更多关于 Decoding or parsing a Wikipedia Dump

问题 When I downloaded the wikipedia titles dump, I got this. I want to decode it and only get the wikipedia titles. What would I do before I get it? I asked a similar question before but I didn't get the help I wanted. 来源： https://stackoverflow.com/questions/40921838/decoding-or-parsing-a-wikipedia-dump

Wikidata Dumps - Long QIDs

阅读更多关于 Wikidata Dumps - Long QIDs

问题 While playing with the dump, I came across many entries like the following: http://www.wikidata.org/entity/Q12258SCD97A47E-A0CA-453F-B01A-DEE8829139BF http://www.wikidata.org/entity/P646v "/m/021821" . what does the Q12258SCD97A47E-A0CA-453F-B01A-DEE8829139BF stand for? it seems to be a QID followed by something else. What does the property: P646v stands for, it doesn't seem to be valid? 回答1: Each Wikidata item has statements, and each statement consists of a statement property and one or

Parsing a Wikipedia dump

阅读更多关于 Parsing a Wikipedia dump

问题 For example using this Wikipedia dump: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=lebron%20james&rvprop=content&redirects=true&format=xmlfm Is there an existing library for Python that I can use to create an array with the mapping of subjects and values? For example: {height_ft,6},{nationality, American} 回答1: It looks like you really want to be able to parse MediaWiki markup. There is a python library designed for this purpose called mwlib. You can use python's built

Spark and Python trying to parse wikipedia using gensim

阅读更多关于 Spark and Python trying to parse wikipedia using gensim

问题 Based on my previous question Spark and Python use custom file format/generator as input for RDD I think that I should be able to parse basically any input by sc.textFile() and then using my or from some library custom functions. Now I am particularly trying to parse the wikipedia dump using gensim framework. I have already installed gensim on my master node and all my worker nodes and now I would like to use gensim build in function for parsing wikipedia pages inspired by this question List

How to improve the performance when working with wikipedia data and huge no. of webpages?

阅读更多关于 How to improve the performance when working with wikipedia data and huge no. of webpages?

问题 I am supposed to extract representative terms from an organisation's website using wikipedia's article-link data dump. To achieve this I've - Crawled & downloaded organisation's webpages. (~110,000) Created a dictionary of wikipedia ID and terms/title. (~40million records) Now, I'm supposed to process each of the webpages using the dictionary to recognise terms and track their term IDs & frequencies. For the dictionary to fit in memory, I've splitted the dictionary into smaller files. Based

Extract wikipedia articles belonging to a category from offline dumps

阅读更多关于 Extract wikipedia articles belonging to a category from offline dumps

问题 I have wikipedia article dumps in different languages. I want to filter them with articles which belong to a category(specifically Category:WikiProject_Biography) I could get a lot of similar questions for example: Wikipedia API to get articles belonging to a category How do I get all articles about people from Wikipedia? However, I would like to do it all offline. That is using dumps, and also for different languages. Other things which I explored are category table and category link table.

python extract words from xml

阅读更多关于 python extract words from xml

问题 How to extract all words between <text></text> from big xml file (1.3 gb) whose structure is as follows: <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="pl"> <siteinfo> <sitename>Wikipedia</sitename> <dbname>plwiki</dbname> <base>http://pl.wikipedia.org/wiki/Wikipedia:Strona_g%C5%82%C3%B3wna<

Extract related articles in different languages using Wikidata Toolkit

阅读更多关于 Extract related articles in different languages using Wikidata Toolkit

I'm trying to extract interlanguage related articles in Wikidata dump. After searching on the internet, I found out there is a tool named Wikidata Toolkit that helps to work with these type of data. But there is no information about how to find related articles in different languages. For example, the article: "Dresden" in the English language is related to the article: "Dresda" in the Italiano one. I mean the second one is the translated version of the first one. I tried to use the toolkit, but I couldn't find any solution. Please write some example about how to find this related article. you

Parsing a Wikipedia dump

阅读更多关于 Parsing a Wikipedia dump

For example using this Wikipedia dump: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=lebron%20james&rvprop=content&redirects=true&format=xmlfm Is there an existing library for Python that I can use to create an array with the mapping of subjects and values? For example: {height_ft,6},{nationality, American} It looks like you really want to be able to parse MediaWiki markup. There is a python library designed for this purpose called mwlib . You can use python's built-in XML packages to extract the page content from the API's response, then pass that content into mwlib's