问题
I have a pandas DataFrame containing a column with Wikipedia urls, that I want to load. However, some strings won't load because they contain unicodes. For example, 'Kruskal %E2%80%93Wallis_one-way_analysis_of_variance' raises the following
PageError: Page id "Cauchy%E2%80%93Schwarz_inequality" does not match any pages. Try another id!
Is there a way to turn all unicodes into ascii? So in this case, I need a function that can create a new column:
old column new column
Cauchy%E2%80%93Schwarz_inequality Cauchy–Schwarz_inequality
Markov%27s_inequality Markov's_inequality
回答1:
urllib.parse.unquote
should do the trick. Hope this helps.
In [1]: import urllib
...:
...: import pandas as pd
...:
...:
...: df = pd.DataFrame({'url': ['Markov%27s_inequality', 'Cauchy%E2%80%93Schwarz_inequality']})
...: df['clean_url'] = df['url'].apply(urllib.parse.unquote)
...:
In [2]: df
Out[2]:
url clean_url
0 Markov%27s_inequality Markov's_inequality
1 Cauchy%E2%80%93Schwarz_inequality Cauchy–Schwarz_inequality
来源:https://stackoverflow.com/questions/50837619/transform-pandas-string-column-containing-unicodes-to-ascii-to-load-urls