问题
I am trying to build Python model that could classify account names as either legitimate or gibberish. Capitalization is not important in this particular case as some legitimate account names could be comprised of all upper-case or all lower-case letters.
Disclaimer: this is just a internal research/experiment and no real action will be taken on the classifier outcome.
In my particular, there are 2 possible characteristics that can reveal an account name as suspicious, gibberish or both:
Weird/random spelling in name or name consists of purely or mostly numbers. Examples of account names that fit these criteria are: 128, 127, h4rugz4sx383a6n64hpo, tt, t66, t65, asdfds.
The name has 2 components (let's assume that no name will ever have more than 2 components) and the spelling and pronounciation of the 2 components are very similar. Examples of account names that fit these criteria are: Jala Haja, Hata Yaha, Faja Kaja.
If an account name meets both of the above criteria (i.e. 'asdfs lsdfs', '332 333') it should also be considered suspicious.
On the other hand, a legitimate account name doesn't need to have both first name and last name. They are usually names from popular languages such as Roman/Latin (i.e. Spanish, German, Portuguese, French, English), Chinese, and Japanese.
Examples of legitimate account names include (these names are made up but do reflect similar styles of legitimate account names in real world): Michael, sara, jose colmenares, Dimitar, Jose Rafael, Morgan, Eduardo Medina, Luis R. Mendez, Hikaru, SELENIA, Zhang Ming, Xuting Liu, Chen Zheng.
I've seen some slightly similar questions on Stackoverflow that asks for ways to detect gibberish texts. But those don't fit my situation because legitimate texts and words actually have meanings, whereas human names usually don't. I also want to be able to do it just based on account names and nothing else.
Right now my script takes care of finding the 2nd characteristic of suspicious account names (similar components in name) using Python's Fuzzy Wuzzy package and using 50% as the similarity threshold. The script is listed below:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import numpy as np
accounts = pd.read_csv('dataset_with_names.csv', encoding = 'ISO-8859-1', sep=None, engine='python').replace(np.nan, 'blank', regex=True)
pd.options.mode.chained_assignment = None
accounts.columns = ['name', 'email', 'akon_id', 'acct_creation_date', 'first_time_city', 'first_time_ip', 'label']
accounts['name_simplified']=accounts['name'].str.replace('[^\w\s]','')
accounts['name_simplified']=accounts['name_simplified'].str.lower()
sim_name = []
for index, row in accounts.iterrows():
if ' ' in row['name_simplified']:
row['name_simplified']=row['name_simplified'].split()
if len(row['name_simplified']) > 1:
#print(row['name_simplified'])
if fuzz.ratio(row['name_simplified'][0], row['name_simplified'][1]) >= 50:
sim_name.append('True')
else:
sim_name.append('False')
else:
sim_name.append('False')
else:
sim_name.append('False')
accounts['are_name_components_similar'] = sim_name
The result has been reliable for what the script was designed to do, but I also want to be able to surface gibberish account names with the 1st characteristic (weird/random spelling or name consists of purely or mostly numbers). So far I have not found a solution to that yet.
Can anyone help? Any feedback/suggestion will be greatly appreciated!
回答1:
For the 1st characteristic, you can train a character-based n-gram language model, and treat all names with low average per-character probability as suspicious.
A quick-and-dirty example of such language model is below. It is a mixture of 1-gram, 2-gram and 3-gram language models, trained on a Brown corpus. I am sure you can find more relevant training data (e.g. list of all names of actors).
from nltk.corpus import brown
from collections import Counter
import numpy as np
text = '\n'.join([' '.join([w for w in s]) for s in brown.sents()])
unigrams = Counter(text)
bigrams = Counter(text[i:(i+2)] for i in range(len(text)-2))
trigrams = Counter(text[i:(i+3)] for i in range(len(text)-3))
weights = [0.001, 0.01, 0.989]
def strangeness(text):
r = 0
text = ' ' + text + '\n'
for i in range(2, len(text)):
char = text[i]
context1 = text[(i-1):i]
context2 = text[(i-2):i]
num = unigrams[char] * weights[0] + bigrams[context1+char] * weights[1] + trigrams[context2+char] * weights[2]
den = sum(unigrams.values()) * weights[0] + unigrams[char] + weights[1] + bigrams[context1] * weights[2]
r -= np.log(num / den)
return r / (len(text) - 2)
Now you can apply this strangeness measure to your examples.
t1 = '128, 127, h4rugz4sx383a6n64hpo, tt, t66, t65, asdfds'.split(', ')
t2 = 'Michael, sara, jose colmenares, Dimitar, Jose Rafael, Morgan, Eduardo Medina, Luis R. Mendez, Hikaru, SELENIA, Zhang Ming, Xuting Liu, Chen Zheng'.split(', ')
for t in t1 + t2:
print('{:20} -> {:9.5}'.format(t, strangeness(t)))
You see that gibberish names are in most cases more "strange" than normal ones. You could use for example a threshold of 5.9 here.
128 -> 5.9073
127 -> 6.0044
h4rugz4sx383a6n64hpo -> 7.4261
tt -> 6.3916
t66 -> 7.3553
t65 -> 7.2584
asdfds -> 6.1796
Michael -> 5.6694
sara -> 5.5734
jose colmenares -> 4.9489
Dimitar -> 5.7058
Jose Rafael -> 5.8184
Morgan -> 5.5766
Eduardo Medina -> 5.5703
Luis R. Mendez -> 5.5337
Hikaru -> 6.439
SELENIA -> 7.1125
Zhang Ming -> 5.1594
Xuting Liu -> 5.5975
Chen Zheng -> 5.3341
Of course, a simpler solution is to collect a list of popular names in all your target languages and use no machine learning at all - just lookups.
来源:https://stackoverflow.com/questions/50659889/unable-to-detect-gibberish-names-using-python