I want to automatically extract section "1A. Risk Factors" from around 10000 files and write it into txt files. A sample URL with a file can be found here
The desired section is between "Item 1a Risk Factors" and "Item 1b". The thing is that the 'item', '1a' and '1b' might look different in all these files and may be present in multiple places - not only the longest, proper one that interest me. Thus, there should be some regular expressions used, so that:
The longest part between "1a" and "1b" is extracted (otherwise the table of contents will appear and other useless elements)
Different variants of the expressions are taken into consideration
I tried to implement these two goals in the script, but as it's my first project in Python, I just randomly sorted expressions that I think might work and apparently they are in a wrong order (I'm sure I should iterate on the "< a >"elements, add each extracted "section" to a list, then choose the longest one and write it to a file, though I don't know how to implement this idea). EDIT: Currently my method returns very little data between 1a and 1b (i think it's a page number) from the table of contents and then it stops...(?)
My code:
import requests
import re
import csv
from bs4 import BeautifulSoup as bs
with open('indexes.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for line in reader:
fn1 = line[0]
fn2 = re.sub(r'[/\\]', '', line[1])
fn3 = re.sub(r'[/\\]', '', line[2])
fn4 = line[3]
saveas = '-'.join([fn1, fn2, fn3, fn4])
f = open(saveas + ".txt", "w+",encoding="utf-8")
url = 'https://www.sec.gov/Archives/' + line[4].strip()
response = requests.get(url)
soup = bs(response.content, 'html.parser')
risks = soup.find_all('a')
regexTxt = 'item[^a-zA-Z\n]*1a.*item[^a-zA-Z\n]*1b'
for risk in risks:
for i in risk.findAllNext():
sections = re.findall(regexTxt, str(i), re.IGNORECASE | re.DOTALL)
for section in sections:
clean = re.compile('<.*?>')
# section = re.sub(r'table of contents', '', section, flags=re.IGNORECASE)
# section = section.strip()
# section = re.sub('\s+', '', section).strip()
print(re.sub(clean, '', section))
The goal is to find the longest part between "1a" and "1b" (regardless of how they exactly look) in the current URL and write it to a file.
In the end I used a CSV file, that contains a column HTMURL, which is the link to htm-format 10-K. I got it from Kai Chen that created this website. I wrote a simple script that writes pure txt into files. Processing it will be a simple task now.
import requests
import csv
from pathlib import Path
from bs4 import BeautifulSoup
with open('index.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for line in reader:
url = line[9]
html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc, 'html.parser')
name = line[1]
name = name.replace('/', '')
name = name.replace("/PA/", "")
name = name.replace("/DE/", "")
dir = Path(name + line[4] + ".txt")
f = open(dir, "w+", encoding="utf-8")
if dir.is_dir():
else: f.write(soup.get_text())