How to retrieve Wiktionary word content?

£可爱£侵袭症+ 提交于 2019-11-27 05:53:49
Michael Mrozek

The Wiktionary API can be used to query whether or not a word exists.

Examples for existing and non-existing pages:

http://en.wiktionary.org/w/api.php?action=query&titles=test http://en.wiktionary.org/w/api.php?action=query&titles=testx

The first link provides examples on other types of formats that might be easier to parse.

To retrieve the word's data in a small XHTML format (should more than existence be required), request the printable version of the page:

http://en.wiktionary.org/w/index.php?title=test&printable=yes http://en.wiktionary.org/w/index.php?title=testx&printable=yes

These can then be parsed with any standard XML parser.

There are a few caveats in just checking that Wiktionary has a page with the name you are looking for:

Caveat #1: All Wiktionaries including the English Wiktionary actually have the goal of including every word in every language, so if you simply use above API call you will know that the word you are asking about is a word in at least one language, but not necessarily English: http://en.wiktionary.org/w/api.php?action=query&titles=dicare

Caveat #2: Perhaps a redirect exists from one word to another word. It might be from an alternative spelling, but it might be from an error of some kind. The API call above will not differentiate between a redirect and an article: http://en.wiktionary.org/w/api.php?action=query&titles=profilemetry

Caveat #3: Some Wiktionaries including the English Wiktionary include "common misspellings": http://en.wiktionary.org/w/api.php?action=query&titles=fourty

Caveat #4: Some Wiktionaries allow stub entries which have little or no information about the term. This used to be common on several Wiktionaries but not the English Wiktionary. But it seems to have now spread also to the English Wiktionary: https://en.wiktionary.org/wiki/%E6%99%B6%E7%90%83 (permalink for when the stub is filled so you can still see what a stub looks like: https://en.wiktionary.org/w/index.php?title=%E6%99%B6%E7%90%83&oldid=39757161)

If these are not included in what you want, you will have to load and parse the wikitext itself, which is not a trivial task.

You can download a dump of Wikitionary data. There's more information in the FAQ. For your purposes, the definitions dump is probably a better choice than the xml dump.

To keep it really simple, extract the words from the dump like that:

bzcat pages-articles.xml.bz2 | grep '<title>[^[:space:][:punct:]]*</title>' | sed 's:.*<title>\(.*\)</title>.*:\1:' > words

If you are using Python, you can use WiktionaryParser by Suyash Behera.

You can install it by

sudo pip install wiktionaryparser

Example usage:

>>> from wiktionaryparser import WiktionaryParser
>>> parser = WiktionaryParser()
>>> word = parser.fetch('test')
>>> another_word = parser.fetch('test', 'french')
>>> parser.set_default_language('french')

Here's a start to parsing etymology and pronunciation data:

function parsePronunciationLine(line) {
  let val
  let type
  line.replace(/\{\{\s*a\s*\|UK\s*\}\}\s*\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en\}\}/, (_, $1) => {
    val = $1
    type = 'uk'
  })
  line.replace(/\{\{\s*a\s*\|US\s*\}\}\s*\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en\}\}/, (_, $1) => {
    val = $1
    type = 'us'
  })
  line.replace(/\{\{enPR|[^\}]+\}\},?\s*\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en}}/, (_, $1) => {
    val = $1
    type = 'us'
  })
  line.replace(/\{\{a|GA\}\},?\s*\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en}}/, (_, $1) => {
    val = $1
    type = 'ga'
  })
  line.replace(/\{\{a|GA\}\},?.+\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en}}/, (_, $1) => {
    val = $1
    type = 'ga'
  })
  // {{a|GA}} {{IPA|/ˈhæpi/|lang=en}}
  // * {{a|RP}} {{IPA|/pliːz/|lang=en}}
  // * {{a|GA}} {{enPR|plēz}}, {{IPA|/pliz/|[pʰliz]|lang=en}}

  if (!val) return

  return { val, type }
}

function parseEtymologyPiece(piece) {
  let parts = piece.split('|')
  parts.shift() // first one is ignored.
  let ls = []
  if (langs[parts[0]]) {
    ls.push(parts.shift())
  }
  if (langs[parts[0]]) {
    ls.push(parts.shift())
  }
  let l = ls.pop()
  let t = parts.shift()
  return [ l, t ]
  // {{inh|en|enm|poisoun}}
  // {{m|enm|poyson}}
  // {{der|en|la|pōtio|pōtio, pōtiōnis|t=drink, a draught, a poisonous draught, a potion}}
  // {{m|la|pōtō|t=I drink}}
  // {{der|en|enm|happy||fortunate, happy}}
  // {{cog|is|heppinn||lucky}}
}

Update: Here is a gist with it more fleshed out.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!