Unknown encoding of files in a resulting Beautiful Soup txt file

问题

I downloaded 13 000 files (10-K reports from different companies) and I need to extract a specific part of these files (section 1A- Risk factors). The problem is that I can open these files in Word easily and they are perfect, while as I open them in a normal txt editor, the document appear to be an HTML with tons of encrypted string in the end (EDIT: I suspect this is due to XBRL format of these files). Same happens as a result of using BeautifulSoup.

I've tried using online decoder, because I thought that maybe this is connected to Base64 encoding, but it seems that none of the known encoding could help me. I saw that at the beginning of some files, there is something like: "created with Certent Disclosure Management 6.31.0.1" and other programs, I thought maybe this causes the encoding. Nevertheless Word is able to open these files, so I guess there must be a known key to it. This is a sample encoded data:

M1G2RBE@MN)T='1,SC4,]%$$Q71T3<XU#[AHMB9@*E1=E_U5CKG&(77/*(LY9
ME$N9MY/U9DC,- ZY:4Z0EWF95RMQY#J!ZIB8:9RWF;\"S+1%Z*;VZPV#(MO
MUCHFYAJ'V#6O8*[R9L<VI8[I8KYQB7WSC#DMFGR[E6+;7=2R)N)1Q\24XQ(K
MYQDS$>UJ65%MV4+(KBRHJ3HFIAR76#G/F$%=*9FOU*DM-6TSTC$Q\[C$YC$/

And a sample file from the 13 000 that I downloaded.

Below I insert the BeautifulSoup that I use to extract text. It does its' job, but I need to find a clue to this encoded string and somehow decode it in the Python code below.

from bs4 import BeautifulSoup

with open("98752-TOROTEL INC-10-K-2019-07-23", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'html.parser')
    print(soup.getText())
    with open("extracted_test.txt", "w", encoding="utf-8") as f:
        f.write(soup.getText())
    f.close()

What I want to achieve is decoding of this dummy string in the end of the file.

回答1:

Ok, this is going to be somewhat messy, but will get you close enough to what you are looking for, without using regex (which is notoriously problematic with html). The fundamental problem you'll be facing is that EDGAR filings are VERY inconsistent in their formatting, so what may work for one 10Q (or 10K or 8K) filing may not work with a similar filing (even from the same filer...) For example, the word 'item' may appear in either lower or uppercase (or mixed), hence the use of the string.lower() method, etc. So there's going to be some cleanup, under all circumstances.

Having said that, the code below should get you the RISK FACTORS sections from both filings (including the one which has none):

url = [one of these two]

from bs4 import BeautifulSoup as bs
response = requests.get(url)
soup = bs(response.content, 'html.parser')

risks = soup.find_all('a')
for risk in risks:    
    if 'item' in str(risk.attrs).lower() and '1a' in str(risk.attrs).lower():       
        for i in risk.findAllNext(): 
            if 'item' in str(i.attrs).lower():
                break
            else:
                print(i.text.strip())

Good luck with your project!

来源：https://stackoverflow.com/questions/57286580/unknown-encoding-of-files-in-a-resulting-beautiful-soup-txt-file

标签

text

encoding

beautifulsoup

lxml

xbrl