问题
I want to extract date when news article was published on websites. For some websites I have exact html element where date/time is (div, p, time) but on some websites I do not have:
These are the links for some websites (german websites):
(3 Nov 2020) http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226
(Dec. 1, 2020) http://www.reutigen.ch/de/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id=1066837&ls=0&sq=&kategorie_id=&date_from=&date_to=
(10/22/2020) http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905
I have tried 3 different solutions with Python libs such as requests
, htmldate
and date_guesser
but I'm always getting None, or in case of htmldate
lib, I always get same date (2020.1.1)
from bs4 import BeautifulSoup
import requests
from htmldate import find_date
from date_guesser import guess_date, Accuracy
# Lib find_date
url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
response = requests.get(url)
my_date = find_date(response.content, extensive_search=True)
print(my_date, '\n')
# Lib guess_date
url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
my_date = guess_date(url=url, html=requests.get(url).text)
print(my_date.date, '\n')
# Lib Requests # I DO NOT GET last modified TAG
my_date = requests.head('http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226')
print(my_date.headers, '\n')
Am I doing something wrong?
Can you please tell me is there a way to extract date of publication from websites like this (where I do not have specific divs, p, and datetime elements).
IMPORTANT! I want to make universal date extraction, so that I can put these links in for loop and run the same function to them.
回答1:
I have never had much success with some of the date parsing libraries, so I usually go another route. I believe that the best method to extract the date strings from these sites in your question is with regular expressions.
website: linden.ch
import requests
import re as regex
from bs4 import BeautifulSoup
from datetime import datetime
url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
page_body = soup.find('body')
find_date = regex.search(r'(Datum der Neuigkeit)\s(\d{1,2}\W\s\w+\W\s\d{4})', str(page_body))
reformatted_timestamp = datetime.strptime(find_date.groups()[1], '%d. %b. %Y').strftime('%d-%m-%Y')
print(reformatted_timestamp)
# print output
03-11-2020
website: buchholterberg.ch
import requests
import re as regex
from bs4 import BeautifulSoup
from datetime import datetime
url = "http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
page_body = soup.find('body')
find_date = regex.search(r'(Veröffentlicht)\s\w+:\s(\d{1,2}:\d{1,2}:\d{1,2})\s(\d{1,2}.\d{1,2}.\d{4})', str(page_body))
reformatted_timestamp = datetime.strptime(find_date.groups()[2], '%d.%m.%Y').strftime('%d-%m-%Y')
print(reformatted_timestamp)
# print output
22-10-2020
Update 12-04-2020
I looked at the source code for the two Python libraries: htmldate and date_guesser that you mentioned. Neither of these libraries can currently extract the date from the 3 sources that you listed in your question. The primary reason for this lack of extraction is linked to the date formats and language (german) of these target sites.
I had some free time so I put this together for you. The answer below can easily be modified to extract from any website and can be refined as needed based on the format of your target sources. It currently extract from all the links contained in URLs.
all urls
import requests
import re as regex
from bs4 import BeautifulSoup
def extract_date(can_of_soup):
page_body = can_of_soup.find('body')
clean_body = ''.join(str(page_body).replace('\n', ''))
if 'Datum der Neuigkeit' in clean_body or 'Veröffentlicht' in clean_body:
date_formats = '(Datum der Neuigkeit)\s(\d{1,2}\W\s\w+\W\s\d{4})|(Veröffentlicht am: \d{2}:\d{2}:\d{2} )(\d{1,2}.\d{1,2}.\d{4})'
find_date = regex.search(date_formats, clean_body, regex.IGNORECASE)
if find_date:
clean_tuples = [i for i in list(find_date.groups()) if i]
return ''.join(clean_tuples[1])
else:
tags = ['extra', 'elementStandard elementText', 'icms-block icms-information-date icms-text-gemeinde-color']
for tag in tags:
date_tag = page_body.find('div', {'class': f'{tag}'})
if date_tag is not None:
children = date_tag.findChildren()
if children:
find_date = regex.search(r'(\d{1,2}.\d{1,2}.\d{4})', str(children))
return ''.join(find_date.groups())
else:
return ''.join(date_tag.contents)
def get_soup(target_url):
response = requests.get(target_url)
soup = BeautifulSoup(response.content, 'html.parser')
return soup
urls = {'http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226',
'http://www.reutigen.ch/de/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id=1066837&ls=0'
'&sq=&kategorie_id=&date_from=&date_to=',
'http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905',
'https://www.steffisburg.ch/de/aktuelles/meldungen/Hochwasserschutz-und-Laengsvernetzung-Zulg.php',
'https://www.wallisellen.ch/aktuellesinformationen/924227',
'http://www.winkel.ch/de/aktuellesre/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id'
'=1093910&ls=0&sq=&kategorie_id=&date_from=&date_to=',
'https://www.aeschi.ch/de/aktuelles/mitteilungen/artikel/?tx_news_pi1%5Bnews%5D=87&tx_news_pi1%5Bcontroller%5D=News&tx_news_pi1%5Baction%5D=detail&cHash=ab4d329e2f1529d6e3343094b416baed'}
for url in urls:
html = get_soup(url)
article_date = extract_date(html)
print(article_date)
来源:https://stackoverflow.com/questions/65095206/extract-date-from-multiple-webpages-with-python