i'm making a crawler to get text html inside, i'm using beautifulsoup.
when I open the url using urllib2, this library converts automatically the html that was using portuguese accents like " ã ó é õ " in another characters like these "a³ a¡ a´a§"
what I want is just get the words without accents
contrã¡rio -> contrario
I tried to use this algoritm, bu this one just works when the text uses words like these "olá coração contrário"
def strip_accents(s):
return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
Firstly, you have to ensure that your crawler returns HTML that is unicode text (Eg. Scrapy has a method response.body_as_unicode() that does exactly this)
Once you have unicode text that you cant make sense of, the step of going from unicode text to equivalent ascii text lies here - http://pypi.python.org/pypi/Unidecode/0.04.1
from unidecode import unidecode
print unidecode(u"\u5317\u4EB0")
The output is "Bei Jing"
You have byte data. You need Unicode data. Isn’t the library supposed to decode it for you? It has to, because you don’t have the HTTP headers and so lack the encoding.
EDIT
Bizarre though this sounds, it appears that Python does not support content decoding in its web library. If you run this program:
#!/usr/bin/env python
import re
import urllib.request
import io
import sys
for s in ("stdin","stdout","stderr"):
setattr(sys, s, io.TextIOWrapper(getattr(sys, s).detach(), encoding="utf8"))
print("Seeking r\xe9sum\xe9s")
response = urllib.request.urlopen('http://nytimes.com/')
content = response.read()
match = re.search(".*r\xe9sum\xe9.*", content, re.I | re.U)
if match:
print("success: " + match.group(0))
else:
print("failure")
You get the following result:
Seeking résumés
Traceback (most recent call last):
File "ur.py", line 16, in <module>
match = re.search(".*r\xe9sum\xe9.*", content, re.I | re.U)
File "/usr/local/lib/python3.2/re.py", line 158, in search
return _compile(pattern, flags).search(string)
TypeError: can't use a string pattern on a bytes-like object
That means .read()
is returning raw bytes not a real string. Maybe you can see something in the doc for the urllib.request
class that I can’t see. I can’t believe they actually expect you to root around in the .info()
return and the <meta>
tags and figure out the stupid encoding on your own and then decode it so you have a real string. That would be utterly lame! I hope I’m wrong, but I spent a good time looking and couldn’t find anything useful here.
Compare how easy doing the equivalent is in Perl:
#!/usr/bin/env perl
use strict;
use warnings;
use LWP::UserAgent;
binmode(STDOUT, "utf8");
print("Seeking r\xe9sum\xe9s\n");
my $agent = LWP::UserAgent->new();
my $response = $agent->get("http://nytimes.com/");
if ($response->is_success) {
my $content = $response->decoded_content;
if ($content =~ /.*r\xe9sum\xe9.*/i) {
print("search success: $&\n");
} else {
print("search failure\n");
}
} else {
print "request failed: ", $response->status_line, "\n";
}
Which when run dutifully produces:
Seeking résumés
search success: <li><a href="http://hiring.nytimes.monster.com/products/resumeproducts.aspx">Search Résumés</a></li>
Are you sure you have to do this in Python? Check out how much richer and more user-friendly the Perl LWP::UserAgent
and HTTP::Response
classes are than the equivalent Python classes. Check it out and see what I mean.
Plus with Perl you get better Unicode support all around, such as full grapheme support, something which Python currently lacks. Given that you were trying to strip out diacritics, this seems like it would be another plus.
use Unicode::Normalize;
($unaccented = NFD($original)) =~ s/\pM//g;
Just a thought.
来源:https://stackoverflow.com/questions/7237241/how-to-convert-characters-like-these-a%c2%b3-a-a%c2%b4a-in-unicode-using-python