问题
I am trying to find the wiki id of list of pages from wikipedia. So, the format is:
input: list of wikipedia page titles
output: list of wikipedia page ids.
So far, I've gone through Mediawiki API to understand how to proceed, but couldn't find a correct way to implement the function. Can anyone suggest how to get the list of page ids?
回答1:
Query basic page information:
import requests
page_titles = ['A', 'B', 'C', 'D']
url = (
'https://en.wikipedia.org/w/api.php'
'?action=query'
'&prop=info'
'&inprop=subjectid'
'&titles=' + '|'.join(page_titles) +
'&format=json')
json_response = requests.get(url).json()
title_to_page_id = {
page_info['title']: page_id
for page_id, page_info in json_response['query']['pages'].items()}
print(title_to_page_id)
print([title_to_page_id[title] for title in page_titles])
This will print:
{'A': '290', 'B': '34635826', 'C': '5200013', 'D': '8123'}
['290', '34635826', '5200013', '8123']
If you have too many titles, you have to query for them in multiple requests because there is a 50 (500 for bots) limit for the number of titles that can be queried at once.
回答2:
The answer provided by AXO works as long as you don't have unnormalized titles such as a category page "Category:Computer_storage_devices" or special characters like &.
In that case you also need to map the response with the normalized titles as following:
def get_page_ids(page_titles):
import requests
from requests import utils
page_titles_encoded = [requests.utils.quote(x) for x in page_titles]
url = (
'https://en.wikipedia.org/w/api.php'
'?action=query'
'&prop=info'
'&inprop=subjectid'
'&titles=' + '|'.join(page_titles_encoded) +
'&format=json')
# print(url)
json_response = requests.get(url).json()
# print(json_response)
page_normalized_titles = {x:x for x in page_titles}
result = {}
if 'normalized' in json_response['query']:
for mapping in json_response['query']['normalized']:
page_normalized_titles[mapping['to']] = mapping['from']
for page_id, page_info in json_response['query']['pages'].items():
normalized_title = page_info['title']
page_title = page_normalized_titles[normalized_title]
result[page_title] = page_id
return result
get_page_ids(page_titles = ['Category:R&J_Records_artists', 'Category:Computer_storage_devices', 'Category:Main_topic_classifications'])
will print
{'Category:R&J_Records_artists': '33352333', 'Category:Computer_storage_devices': '895945', 'Category:Main_topic_classifications': '7345184'}
.
来源:https://stackoverflow.com/questions/52787504/how-to-get-page-id-from-wikipedia-page-title