问题
I'm working on a python script using pypff
to open Outlook PST files and extract useful information. I'm following the code posted in here.
I'm trying to get the names of the attachments for each email but the only methods for type 'attachment' is get_size()
, read_buffer()
and seek_offset()
, which aren't useful to me.
The read_buffer method gives a long string, something like x00\x11\x00\x02\x01\x02\x02\x01\x03\x04\x07\x05\...
How can I decode it?
回答1:
you can try decoding with ascii
first.
print((msg.get_attachment(0).read_buffer(attach_size)).decode('ascii', errors="ignore"))
I think Microsoft is using more than one way to encode different parts of attachments, so no single decoding can do perfectly. If ascii
cannot decode enough content, you can try them all. For different Python versions, check it out here.
# 98 encodings in python3.5/6/7
decode = ['ascii','big5','big5hkscs','cp037','cp273',
'cp424','cp437','cp500','cp720','cp737',
'cp775','cp850','cp852','cp855','cp856',
'cp857','cp858','cp860','cp861','cp862',
'cp863','cp864','cp865','cp866','cp869',
'cp874','cp875','cp932','cp949','cp950',
'cp1006','cp1026','cp1125','cp1140','cp1250',
'cp1251','cp1252','cp1253','cp1254','cp1255',
'cp1256','cp1257','cp1258','cp65001','euc_jp',
'euc_jis_2004','euc_jisx0213','euc_kr','gb2312','gbk',
'gb18030','hz','iso2022_jp','iso2022_jp_1','iso2022_jp_2',
'iso2022_jp_2004','iso2022_jp_3','iso2022_jp_ext','iso2022_kr','latin_1',
'iso8859_2','iso8859_3','iso8859_4','iso8859_5','iso8859_6',
'iso8859_7','iso8859_8','iso8859_9','iso8859_10','iso8859_11',
'iso8859_13','iso8859_14','iso8859_15','iso8859_16','johab',
'koi8_r','koi8_t','koi8_u','kz1048','mac_cyrillic',
'mac_greek','mac_iceland','mac_latin2','mac_roman','mac_turkish',
'ptcp154','shift_jis','shift_jis_2004','shift_jisx0213','utf_32',
'utf_32_be','utf_32_le','utf_16','utf_16_be','utf_16_le',
'utf_7','utf_8','utf_8_sig']
# Select the best decoder
items = []
for item in encode:
attach_size = msg.get_attachment(0).get_size()
content = (msg.get_attachment(0).read_buffer(attach_size)).decode(item, errors="ignore")
# I know 'sample_content' is in the attachment, so it's easy to see which ones can decode it.
if 'sample_content' in content:
items.append(item)
print(items)
If you don't know what's in the content, you can try workarounds. For instance, in the loop you can find one decoding that leaves least number of "\x", since before encoding your content looks like this "\x93\x93\xfa\x8c\xd3\x1a\xc6".
If anyone has better ways of decoding attachments, please leave a comment here, thank you.
来源:https://stackoverflow.com/questions/55847711/is-there-a-way-to-get-the-attachment-names-from-a-pst-file