Web crawler to extract from list elements

前端 未结 1 1682
野的像风
野的像风 2021-01-27 09:18

I am trying to extract from

  • tags the dates and store them in an Excel file.

  • January 13, 1991: At least 40 people
  • 相关标签:
    1条回答
    • 2021-01-27 09:52

      The problem is - there are irrelevant li tags that don't contain the data you need.

      Be more specific. For example, if you want to get the list of events from the "20th century", first find the header and get the list of events from it's parent's following ul sibling. Also, not every item in the list has the date in the %B %d, %Y format - you need to handle it via try/except block:

      import urllib2
      from datetime import datetime
      from bs4 import BeautifulSoup
      
      
      page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
      soup = BeautifulSoup(page1)
      
      events = soup.find('span', id='20th_century').parent.find_next_sibling('ul')
      for event in events.find_all('li'):
          try:
              date_string, rest = event.text.split(':', 1)
              print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
          except ValueError:
              print event.text
      

      Prints:

      19/09/1902
      30/12/1903
      11/01/1908
      24/12/1913
      23/10/1942
      09/03/1946
      1954 500-800 killed at Kumbha Mela, Allahabad.
      01/01/1956
      02/01/1971
      03/12/1979
      20/10/1982
      29/05/1985
      13/03/1988
      20/08/1988
      

      Updated version (getting all ul groups under a century):

      events = soup.find('span', id='20th_century').parent.find_next_siblings()
      for tag in events:
          if tag.name == 'h2':
              break
          for event in tag.find_all('li'):
              try:
                  date_string, rest = event.text.split(':', 1)
                  print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
              except ValueError:
                  print event.text
      
      0 讨论(0)
    提交回复
    热议问题