Parsing EDGAR filings

前端 未结 3 1105
耶瑟儿~
耶瑟儿~ 2020-12-30 10:03

I would like to use python2.7 to remove anything that isn\'t the documents\' text from EDGAR filings (which are available online as .txt files). An example of what the file

相关标签:
3条回答
  • 2020-12-30 10:55

    The link below is a library that parses EDGAR filings into a SQLite DB. It contains functionality to pull Form10k and Form8Qk filings from the EDGAR FPT site for years that you specify and load them into a normalized format in SQLite DB tables. Considering the poorly adhered to standard for the filings, writing your own parsing script would be a significant undertaking. That library and code similar to the below will load filings for the wanted quarter and from there you can simply query the table for the data you are seeking.

    edgar.database.create()
    # Load quarterly master index files into local sqlite db
    quarters = []
    #Q3 2009
    quarters.add(2009,3)
    #Q3 2008
    quarters.add(2008,3)
    edgar.database.load(quarters)
    

    http://rf-contrib.googlecode.com/svn/trunk/ha/src/main/python/edgar/

    0 讨论(0)
  • 2020-12-30 10:56

    Look at the OpenSP toolkit, which has programs to process SGML files. Your simplest option is probably to use the osx program to get an XML version of the input file, after which you can use XML processing tools.

    There may be some setup to do first, as the OpenSP package doesn't come with the EDGAR DTD or its SGML declaration (the first part of the stuff in your reference at page 48, starting with <!SGML "ISO 8879-1986"). You will have to get these as text files and add them to the catalogs where the SP parser can find them.

    UPDATE: This document seems to be a more up-to-date version. A casual google search doesn't turn up any immediately machine processable versions, though. So you may have to copy-paste from the PDF.

    However, if you do so, there will be some extraneous formatting you'll have to remove: there seem to be page break indicators, labelled "C-1", "C-2", and so on. They are not part of SGML and need to be deleted.

    You can either add the SGML declaration and the EDGAR DTD to the catalog (in which case the DTD file should only have the part inside the [ after <!DOCTYPE submission and the matching ] at the end) or you can create a "prolog" file consisting of both parts together as is (i.e. including the <!DOCTYPE submission [ and ]>) and run any program in the toolkit on the prolog and your SGML file - i.e. put both names on the command line, with the prolog file first, so that the parser will read both files in the correct order. To understand what's happening, you need to know that an SGML parser needs three pieces of information for a parse: an SGML declaration to set some environmental and processing parameters, then a DTD to describe the structural constraints on a document, and finally the document itself.

    0 讨论(0)
  • 2020-12-30 11:06

    The pysec project looks promising. It's a basic Django app that downloads the Edgar index and then allows you to download specific filings and extract financial parameters from the XBRL.

    0 讨论(0)
提交回复
热议问题