The problem is - there are irrelevant li
tags that don't contain the data you need.
Be more specific. For example, if you want to get the list of events from the "20th century", first find the header and get the list of events from it's parent's following ul sibling. Also, not every item in the list has the date in the %B %d, %Y
format - you need to handle it via try/except
block:
import urllib2
from datetime import datetime
from bs4 import BeautifulSoup
page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
events = soup.find('span', id='20th_century').parent.find_next_sibling('ul')
for event in events.find_all('li'):
try:
date_string, rest = event.text.split(':', 1)
print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
except ValueError:
print event.text
Prints:
19/09/1902
30/12/1903
11/01/1908
24/12/1913
23/10/1942
09/03/1946
1954 500-800 killed at Kumbha Mela, Allahabad.
01/01/1956
02/01/1971
03/12/1979
20/10/1982
29/05/1985
13/03/1988
20/08/1988
Updated version (getting all ul groups under a century):
events = soup.find('span', id='20th_century').parent.find_next_siblings()
for tag in events:
if tag.name == 'h2':
break
for event in tag.find_all('li'):
try:
date_string, rest = event.text.split(':', 1)
print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
except ValueError:
print event.text