windows-1252

Python - dealing with mixed-encoding files

余生长醉 提交于 2019-11-27 19:08:27
I have a file which is mostly UTF-8, but some Windows-1252 characters have also found their way in. I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and would like to use it to fix the mis-encoded characters, e.g. cp1252_to_unicode = { "\x85": u'\u2026', # … "\x91": u'\u2018', # ‘ "\x92": u'\u2019', # ’ "\x93": u'\u201c', # “ "\x94": u'\u201d', # ” "\x97": u'\u2014' # — } for l in open('file.txt'): for c, u in cp1252_to_unicode.items(): l = l.replace(c, u) But attempting to do the replace this way results in a UnicodeDecodeError being raised, e

Windows-1252 to UTF-8 encoding

自作多情 提交于 2019-11-27 17:43:01
I've copied certain files from a Windows machine to a Linux machine. So all the Windows encoded (windows-1252) files need to be converted to UTF-8. The files which are already in UTF-8 should not be changed. I'm planning to use the recode utility for that. How can I specify that the recode utility should only convert windows-1252 encoded files and not the UTF-8 files? Example usage of recode: recode windows-1252.. myfile.txt This would convert myfile.txt from windows-1252 to UTF-8. Before doing this, I would like to know that myfile.txt is actually windows-1252 encoded and not UTF-8 encoded.

How to read a file in Java with specific character encoding?

眉间皱痕 提交于 2019-11-27 02:36:06
问题 I am trying to read a file in as either UTF-8 or Windows-1252 depending on the output of this method: public Charset getCorrectCharsetToApply() { // Returns a Charset for either UTF-8 or Windows-1252. } So far, I have: String fileName = getFileNameToReadFromUserInput(); InputStream is = new ByteArrayInputStream(fileName.getBytes()); InputStreamReader isr = new InputStreamReader(is, getCorrectCharsetToApply()); BufferedReader buffReader = new BufferedReader(isr); The problem I'm having is

Python - dealing with mixed-encoding files

十年热恋 提交于 2019-11-26 19:46:42
问题 I have a file which is mostly UTF-8, but some Windows-1252 characters have also found their way in. I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and would like to use it to fix the mis-encoded characters, e.g. cp1252_to_unicode = { "\x85": u'\u2026', # … "\x91": u'\u2018', # ‘ "\x92": u'\u2019', # ’ "\x93": u'\u201c', # “ "\x94": u'\u201d', # ” "\x97": u'\u2014' # — } for l in open('file.txt'): for c, u in cp1252_to_unicode.items(): l = l

.NET Core doesn't know about Windows 1252, how to fix?

故事扮演 提交于 2019-11-26 14:31:33
This program works just fine when compiled for .NET 4 but does when compiled for .NET Core. I understand the error about encoding not supported but not how to fix it. Public Class Program Public Shared Function Main(ByVal args As String()) As Integer System.Text.Encoding.GetEncoding(1252) End Function End Class To do this, you need to register the CodePagesEncodingProvider instance from the System.Text.Encoding.CodePages package. To do that, install the System.Text.Encoding.CodePages package : dotnet add package System.Text.Encoding.CodePages Then (after implicitly or explicitly running dotnet

.NET Core doesn't know about Windows 1252, how to fix?

老子叫甜甜 提交于 2019-11-26 03:56:51
问题 This program works just fine when compiled for .NET 4 but does when compiled for .NET Core. I understand the error about encoding not supported but not how to fix it. Public Class Program Public Shared Function Main(ByVal args As String()) As Integer System.Text.Encoding.GetEncoding(1252) End Function End Class 回答1: To do this, you need to register the CodePagesEncodingProvider instance from the System.Text.Encoding.CodePages package. To do that, install the System.Text.Encoding.CodePages