wikimedia-dumps

Extract related articles in different languages using Wikidata Toolkit

随声附和 提交于 2020-01-23 17:57:07
问题 I'm trying to extract interlanguage related articles in Wikidata dump. After searching on the internet, I found out there is a tool named Wikidata Toolkit that helps to work with these type of data. But there is no information about how to find related articles in different languages. For example, the article: "Dresden" in the English language is related to the article: "Dresda" in the Italiano one. I mean the second one is the translated version of the first one. I tried to use the toolkit,

Wikidata Dumps - Long QIDs

时光怂恿深爱的人放手 提交于 2019-12-24 15:13:23
问题 While playing with the dump, I came across many entries like the following: http://www.wikidata.org/entity/Q12258SCD97A47E-A0CA-453F-B01A-DEE8829139BF http://www.wikidata.org/entity/P646v "/m/021821" . what does the Q12258SCD97A47E-A0CA-453F-B01A-DEE8829139BF stand for? it seems to be a QID followed by something else. What does the property: P646v stands for, it doesn't seem to be valid? 回答1: Each Wikidata item has statements, and each statement consists of a statement property and one or

Parsing a Wikipedia dump

一世执手 提交于 2019-12-17 10:59:18
问题 For example using this Wikipedia dump: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=lebron%20james&rvprop=content&redirects=true&format=xmlfm Is there an existing library for Python that I can use to create an array with the mapping of subjects and values? For example: {height_ft,6},{nationality, American} 回答1: It looks like you really want to be able to parse MediaWiki markup. There is a python library designed for this purpose called mwlib. You can use python's built

Spark and Python trying to parse wikipedia using gensim

末鹿安然 提交于 2019-12-14 02:53:29
问题 Based on my previous question Spark and Python use custom file format/generator as input for RDD I think that I should be able to parse basically any input by sc.textFile() and then using my or from some library custom functions. Now I am particularly trying to parse the wikipedia dump using gensim framework. I have already installed gensim on my master node and all my worker nodes and now I would like to use gensim build in function for parsing wikipedia pages inspired by this question List

How to improve the performance when working with wikipedia data and huge no. of webpages?

拟墨画扇 提交于 2019-12-13 12:50:21
问题 I am supposed to extract representative terms from an organisation's website using wikipedia's article-link data dump. To achieve this I've - Crawled & downloaded organisation's webpages. (~110,000) Created a dictionary of wikipedia ID and terms/title. (~40million records) Now, I'm supposed to process each of the webpages using the dictionary to recognise terms and track their term IDs & frequencies. For the dictionary to fit in memory, I've splitted the dictionary into smaller files. Based

Extract wikipedia articles belonging to a category from offline dumps

為{幸葍}努か 提交于 2019-12-12 04:25:11
问题 I have wikipedia article dumps in different languages. I want to filter them with articles which belong to a category(specifically Category:WikiProject_Biography) I could get a lot of similar questions for example: Wikipedia API to get articles belonging to a category How do I get all articles about people from Wikipedia? However, I would like to do it all offline. That is using dumps, and also for different languages. Other things which I explored are category table and category link table.

python extract words from xml

三世轮回 提交于 2019-12-12 01:53:15
问题 How to extract all words between <text></text> from big xml file (1.3 gb) whose structure is as follows: <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="pl"> <siteinfo> <sitename>Wikipedia</sitename> <dbname>plwiki</dbname> <base>http://pl.wikipedia.org/wiki/Wikipedia:Strona_g%C5%82%C3%B3wna<

Extract related articles in different languages using Wikidata Toolkit

放肆的年华 提交于 2019-12-07 13:05:40
I'm trying to extract interlanguage related articles in Wikidata dump. After searching on the internet, I found out there is a tool named Wikidata Toolkit that helps to work with these type of data. But there is no information about how to find related articles in different languages. For example, the article: "Dresden" in the English language is related to the article: "Dresda" in the Italiano one. I mean the second one is the translated version of the first one. I tried to use the toolkit, but I couldn't find any solution. Please write some example about how to find this related article. you

Parsing a Wikipedia dump

℡╲_俬逩灬. 提交于 2019-11-27 14:02:42
For example using this Wikipedia dump: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=lebron%20james&rvprop=content&redirects=true&format=xmlfm Is there an existing library for Python that I can use to create an array with the mapping of subjects and values? For example: {height_ft,6},{nationality, American} It looks like you really want to be able to parse MediaWiki markup. There is a python library designed for this purpose called mwlib . You can use python's built-in XML packages to extract the page content from the API's response, then pass that content into mwlib's