I have a Pandas Dataframe generated from a database, which has data with mixed encodings. For example:
+----+-------------------------+----------+-----------
Pandas have merged a PR to support df.to_markdown() method. You can find more details here It should be available soon.
Pandas 1.0 was released 29 january 2020 and supports markdown conversion, so you can now do this directly!
Example taken from the docs:
df = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=['a', 'a', 'b'])
print(df.to_markdown())
| | A | B |
|:---|----:|----:|
| a | 1 | 1 |
| a | 2 | 2 |
| b | 3 | 3 |
Or without the index:
print(df.to_markdown(index=False)) # use 'showindex' for pandas < 1.1
| A | B |
|----:|----:|
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
Right, so I've taken a leaf from a question suggested by Rohit (Python - Encoding string - Swedish Letters), extended his answer, and came up with the following:
# Enforce UTF-8 encoding
import sys
stdin, stdout = sys.stdin, sys.stdout
reload(sys)
sys.stdin, sys.stdout = stdin, stdout
sys.setdefaultencoding('UTF-8')
# SQLite3 database
import sqlite3
# Pandas: Data structures and data analysis tools
import pandas as pd
# Read database, attach as Pandas dataframe
db = sqlite3.connect("Applications.db")
df = pd.read_sql_query("SELECT path, language, date, shortest_sentence, longest_sentence, number_words, readability_consensus FROM applications ORDER BY date(date) DESC", db)
db.close()
df.columns = ['Path', 'Language', 'Date', 'Shortest Sentence', 'Longest Sentence', 'Words', 'Readability Consensus']
# Parse Dataframe and apply Markdown, then save as 'table.md'
cols = df.columns
df2 = pd.DataFrame([['---','---','---','---','---','---','---']], columns=cols)
df3 = pd.concat([df2, df])
df3.to_csv("table.md", sep="|", index=False)
An important precursor to this is that the shortest_sentence
and longest_sentence
columns do not contain unnecessary line breaks, as removed by applying .replace('\n', ' ').replace('\r', '')
to them before submitting into the SQLite database. It appears that the solution is not to enforce the language-specific encoding (ISO-8859-1
for Norwegian), but rather that UTF-8
is used instead of the default ASCII
.
I ran this through my IPython notebook (Python 2.7.10) and got a table like the following (fixed spacing for appearance here):
| Path | Language | Date | Shortest Sentence | Longest Sentence | Words | Readability Consensus |
|-------------------------|----------|------------|----------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|-----------------------|
| data/Eng/Something1.txt | Eng | 2015-09-17 | I am able to relocate to London on short notice. | With my administrative experience in the preparation of the structure and content of seminars in various courses, and critiquing academic papers on various levels, I am confident that I can execute the work required as an editorial assistant. | 306 | 11th and 12th grade |
| data/Nor/NoeNorrønt.txt | Nor | 2015-09-17 | Jeg har grundig kjennskap til Microsoft Office og Adobe. | I løpet av studiene har jeg vært salgsmedarbeider for et større konsern, hvor jeg solgte forsikring til studentene og de faglige ansatte ved universitetet i Trønderlag, samt renholdsarbeider i et annet, hvor jeg i en periode var avdelingsansvarlig. | 205 | 18th and 19th grade |
| data/Nor/Ørret.txt.txt | Nor | 2015-09-17 | Jeg håper på positiv tilbakemelding, og møter naturligvis til intervju hvis det er ønskelig. | I løpet av studiene har jeg vært salgsmedarbeider for et større konsern, hvor jeg solgte forsikring til studentene og de faglige ansatte ved universitetet i Trønderlag, samt renholdsarbeider i et annet, hvor jeg i en periode var avdelingsansvarlig. | 160 | 18th and 19th grade |
Thus, a Markdown table without problems with encoding.
sqlite3 returns Unicodes by default for TEXT fields. Everything was set up to work before you introduced the table()
function from an external source (that you did not provide in your question).
The table()
function has str()
calls which do not provide an encoding, so ASCII is used to protect you.
You need to re-write table()
not to do this, especially as you've got Unicode objects. You may have some success by simply replacing str()
with unicode()
Here's an example function using pytablewriter
and some regular expressions to make the markdown table more similar to how a dataframe looks on Jupyter (with the row headers bold).
import io
import re
import pandas as pd
import pytablewriter
def df_to_markdown(df):
"""
Converts Pandas DataFrame to markdown table,
making the index bold (as in Jupyter) unless it's a
pd.RangeIndex, in which case the index is completely dropped.
Returns a string containing markdown table.
"""
isRangeIndex = isinstance(df.index, pd.RangeIndex)
if not isRangeIndex:
df = df.reset_index()
writer = pytablewriter.MarkdownTableWriter()
writer.stream = io.StringIO()
writer.header_list = df.columns
writer.value_matrix = df.values
writer.write_table()
writer.stream.seek(0)
table = writer.stream.readlines()
if isRangeIndex:
return ''.join(table)
else:
# Make the indexes bold
new_table = table[:2]
for line in table[2:]:
new_table.append(re.sub('^(.*?)\|', r'**\1**|', line))
return ''.join(new_table)
Using external tool pandoc
and pipe:
def to_markdown(df):
from subprocess import Popen, PIPE
s = df.to_latex()
p = Popen('pandoc -f latex -t markdown',
stdin=PIPE, stdout=PIPE, shell=True)
stdoutdata, _ = p.communicate(input=s.encode("utf-8"))
return stdoutdata.decode("utf-8")