问题
Here's some bar code data from a pandas
database
737318 Sikat Botol Pigeon 4902508045506 75170
737379 Natur Manual Breast Pump 8850851860016 75170
738753 Sunlight 1232131321313 75261
739287 Bodymist bodyshop 1122334455667 75296
739677 Bodymist ale 1234567890123 75367
I want to remove data that is suspicious (i.e. has too many repeated or successive digits) like 1232131321313
, 1122334455667
, 1234567890123
, etc. I am very tolerant of false negatives, but want to avoid false positives (bad bar codes) as much as possible.
回答1:
If you're worried about repeated and successive digits, you can take np.diff
of the digits and then compare against a triangular distribution using a Kolmogorov Smirnov test. The difference between successive digits for a random number should follow a triangular distribution between -10
and 10
, with a maximum at 0
import scipy.stats as stat
t = stat.triang(.5, loc = -10, scale = 20)
Turning the bar codes into an array:
a = np.array(list(map(list, map(str, a))), dtype = int) # however you get `a` out of your dataframe
then build a mask with
np.array[stat.kstest(i, t.cdf).pvalue > .5 for i in np.diff(a, axis = 1)]
testing:
np.array([stat.kstest(j, t.cdf).pvalue > .5 for j in np.diff(np.random.randint(0, 10, (1000, 13)), axis = 1)]).sum()
Out: 720
You'll have about a 30% false negative rate, but a p-value threshold of .5
should pretty much guarantee that the values you keep don't have too many successive or repeat digits. If you want to really be sure you've eliminate anything suspicious, you may want to also KS test the actual digits against stat.uniform(scale = 10)
(to eliminate 1213141516171
and similar).
回答2:
As a first step I would use the barcodes built in validation mechanism, the checksum. As your barcodes appear to be GTIN barcodes (specifically GTIN-13), you can use this method:
>>> import math
>>> def CheckBarcode(s):
sum = 0
for i in range(len(s[:-1])):
sum += int(s[i]) * ((i%2)*2+1)
return math.ceil(sum/10)*10-sum == int(s[-1])
>>> CheckBarcode("4902508045506")
True
>>> CheckBarcode("8850851860016")
True
>>> CheckBarcode("1232131321313")
True
>>> CheckBarcode("1122334455667")
False
>>> CheckBarcode("1234567890123")
False
来源:https://stackoverflow.com/questions/46556587/how-to-eliminate-suspicious-barcode-like-123456-data