how to convert characters like these,“a³ a¡ a´a§” in unicode, using python?

为君一笑 提交于 2019-12-05 09:29:31

Firstly, you have to ensure that your crawler returns HTML that is unicode text (Eg. Scrapy has a method response.body_as_unicode() that does exactly this)

Once you have unicode text that you cant make sense of, the step of going from unicode text to equivalent ascii text lies here - http://pypi.python.org/pypi/Unidecode/0.04.1

from unidecode import unidecode
print unidecode(u"\u5317\u4EB0")

The output is "Bei Jing"

You have byte data. You need Unicode data. Isn’t the library supposed to decode it for you? It has to, because you don’t have the HTTP headers and so lack the encoding.

EDIT

Bizarre though this sounds, it appears that Python does not support content decoding in its web library. If you run this program:

#!/usr/bin/env python    
import re
import urllib.request
import io
import sys

for s in ("stdin","stdout","stderr"):
    setattr(sys, s, io.TextIOWrapper(getattr(sys, s).detach(), encoding="utf8"))

print("Seeking r\xe9sum\xe9s")

response = urllib.request.urlopen('http://nytimes.com/')
content  = response.read()

match    = re.search(".*r\xe9sum\xe9.*", content, re.I | re.U)
if match:
    print("success: " + match.group(0))
else:
    print("failure")

You get the following result:

Seeking résumés
Traceback (most recent call last):
  File "ur.py", line 16, in <module>
    match    = re.search(".*r\xe9sum\xe9.*", content, re.I | re.U)
  File "/usr/local/lib/python3.2/re.py", line 158, in search
    return _compile(pattern, flags).search(string)
TypeError: can't use a string pattern on a bytes-like object

That means .read() is returning raw bytes not a real string. Maybe you can see something in the doc for the urllib.request class that I can’t see. I can’t believe they actually expect you to root around in the .info() return and the <meta> tags and figure out the stupid encoding on your own and then decode it so you have a real string. That would be utterly lame! I hope I’m wrong, but I spent a good time looking and couldn’t find anything useful here.

Compare how easy doing the equivalent is in Perl:

#!/usr/bin/env perl    
use strict;
use warnings;    
use LWP::UserAgent;

binmode(STDOUT, "utf8");    
print("Seeking r\xe9sum\xe9s\n");

my $agent = LWP::UserAgent->new();
my $response = $agent->get("http://nytimes.com/");

if ($response->is_success) {
    my $content = $response->decoded_content;
    if ($content =~ /.*r\xe9sum\xe9.*/i) {
        print("search success: $&\n");
    } else {
        print("search failure\n");
    } 
} else {
    print "request failed: ", $response->status_line, "\n";
} 

Which when run dutifully produces:

Seeking résumés
search success: <li><a href="http://hiring.nytimes.monster.com/products/resumeproducts.aspx">Search Résumés</a></li>

Are you sure you have to do this in Python? Check out how much richer and more user-friendly the Perl LWP::UserAgent and HTTP::Response classes are than the equivalent Python classes. Check it out and see what I mean.

Plus with Perl you get better Unicode support all around, such as full grapheme support, something which Python currently lacks. Given that you were trying to strip out diacritics, this seems like it would be another plus.

 use Unicode::Normalize;
 ($unaccented = NFD($original)) =~ s/\pM//g;

Just a thought.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!