Programmatically convert pandas dataframe to markdown table

后端 未结 13 1175
梦毁少年i
梦毁少年i 2020-12-13 03:44

I have a Pandas Dataframe generated from a database, which has data with mixed encodings. For example:

+----+-------------------------+----------+-----------         


        
相关标签:
13条回答
  • 2020-12-13 04:33

    Pandas have merged a PR to support df.to_markdown() method. You can find more details here It should be available soon.

    0 讨论(0)
  • 2020-12-13 04:36

    Pandas 1.0 was released 29 january 2020 and supports markdown conversion, so you can now do this directly!

    Example taken from the docs:

    df = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=['a', 'a', 'b'])
    print(df.to_markdown())
    
    |    |   A |   B |
    |:---|----:|----:|
    | a  |   1 |   1 |
    | a  |   2 |   2 |
    | b  |   3 |   3 |
    

    Or without the index:

    print(df.to_markdown(index=False)) # use 'showindex' for pandas < 1.1
    
    |   A |   B |
    |----:|----:|
    |   1 |   1 |
    |   2 |   2 |
    |   3 |   3 |
    
    0 讨论(0)
  • 2020-12-13 04:36

    Right, so I've taken a leaf from a question suggested by Rohit (Python - Encoding string - Swedish Letters), extended his answer, and came up with the following:

    # Enforce UTF-8 encoding
    import sys
    stdin, stdout = sys.stdin, sys.stdout
    reload(sys)
    sys.stdin, sys.stdout = stdin, stdout
    sys.setdefaultencoding('UTF-8')
    
    # SQLite3 database
    import sqlite3
    # Pandas: Data structures and data analysis tools
    import pandas as pd
    
    # Read database, attach as Pandas dataframe
    db = sqlite3.connect("Applications.db")
    df = pd.read_sql_query("SELECT path, language, date, shortest_sentence, longest_sentence, number_words, readability_consensus FROM applications ORDER BY date(date) DESC", db)
    db.close()
    df.columns = ['Path', 'Language', 'Date', 'Shortest Sentence', 'Longest Sentence', 'Words', 'Readability Consensus']
    
    # Parse Dataframe and apply Markdown, then save as 'table.md'
    cols = df.columns
    df2 = pd.DataFrame([['---','---','---','---','---','---','---']], columns=cols)
    df3 = pd.concat([df2, df])
    df3.to_csv("table.md", sep="|", index=False)
    

    An important precursor to this is that the shortest_sentence and longest_sentence columns do not contain unnecessary line breaks, as removed by applying .replace('\n', ' ').replace('\r', '') to them before submitting into the SQLite database. It appears that the solution is not to enforce the language-specific encoding (ISO-8859-1 for Norwegian), but rather that UTF-8 is used instead of the default ASCII.

    I ran this through my IPython notebook (Python 2.7.10) and got a table like the following (fixed spacing for appearance here):

    | Path                    | Language | Date       | Shortest Sentence                                                                            | Longest Sentence                                                                                                                                                                                                                                         | Words | Readability Consensus |
    |-------------------------|----------|------------|----------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|-----------------------|
    | data/Eng/Something1.txt | Eng      | 2015-09-17 | I am able to relocate to London on short notice.                                             | With my administrative experience in the preparation of the structure and content of seminars in various courses, and critiquing academic papers on various levels, I am confident that I can execute the work required as an editorial assistant.       | 306   | 11th and 12th grade   |
    | data/Nor/NoeNorrønt.txt | Nor      | 2015-09-17 | Jeg har grundig kjennskap til Microsoft Office og Adobe.                                     | I løpet av studiene har jeg vært salgsmedarbeider for et større konsern, hvor jeg solgte forsikring til studentene og de faglige ansatte ved universitetet i Trønderlag, samt renholdsarbeider i et annet, hvor jeg i en periode var avdelingsansvarlig. | 205   | 18th and 19th grade   |
    | data/Nor/Ørret.txt.txt  | Nor      | 2015-09-17 | Jeg håper på positiv tilbakemelding, og møter naturligvis til intervju hvis det er ønskelig. | I løpet av studiene har jeg vært salgsmedarbeider for et større konsern, hvor jeg solgte forsikring til studentene og de faglige ansatte ved universitetet i Trønderlag, samt renholdsarbeider i et annet, hvor jeg i en periode var avdelingsansvarlig. | 160   | 18th and 19th grade   |
    

    Thus, a Markdown table without problems with encoding.

    0 讨论(0)
  • 2020-12-13 04:36

    sqlite3 returns Unicodes by default for TEXT fields. Everything was set up to work before you introduced the table() function from an external source (that you did not provide in your question).

    The table() function has str() calls which do not provide an encoding, so ASCII is used to protect you.

    You need to re-write table() not to do this, especially as you've got Unicode objects. You may have some success by simply replacing str() with unicode()

    0 讨论(0)
  • 2020-12-13 04:37

    Here's an example function using pytablewriter and some regular expressions to make the markdown table more similar to how a dataframe looks on Jupyter (with the row headers bold).

    import io
    import re
    import pandas as pd
    import pytablewriter
    
    def df_to_markdown(df):
        """
        Converts Pandas DataFrame to markdown table,
        making the index bold (as in Jupyter) unless it's a
        pd.RangeIndex, in which case the index is completely dropped.
        Returns a string containing markdown table.
        """
        isRangeIndex = isinstance(df.index, pd.RangeIndex)
        if not isRangeIndex:
            df = df.reset_index()
        writer = pytablewriter.MarkdownTableWriter()
        writer.stream = io.StringIO()
        writer.header_list = df.columns
        writer.value_matrix = df.values
        writer.write_table()
        writer.stream.seek(0)
        table = writer.stream.readlines()
    
        if isRangeIndex:
            return ''.join(table)
        else:
            # Make the indexes bold
            new_table = table[:2]
            for line in table[2:]:
                new_table.append(re.sub('^(.*?)\|', r'**\1**|', line))    
    
            return ''.join(new_table)
    
    0 讨论(0)
  • 2020-12-13 04:40

    Using external tool pandoc and pipe:

    def to_markdown(df):
        from subprocess import Popen, PIPE
        s = df.to_latex()
        p = Popen('pandoc -f latex -t markdown',
                  stdin=PIPE, stdout=PIPE, shell=True)
        stdoutdata, _ = p.communicate(input=s.encode("utf-8"))
        return stdoutdata.decode("utf-8")
    
    0 讨论(0)
提交回复
热议问题