How to eliminate suspicious barcode (like 123456) data [closed]

走远了吗. 提交于 2020-01-06 06:58:12

问题


Here's some bar code data from a pandas database

737318  Sikat Botol Pigeon          4902508045506   75170
737379  Natur Manual Breast Pump    8850851860016   75170
738753  Sunlight                    1232131321313   75261
739287  Bodymist bodyshop           1122334455667   75296
739677  Bodymist ale                1234567890123   75367

I want to remove data that is suspicious (i.e. has too many repeated or successive digits) like 1232131321313 , 1122334455667, 1234567890123, etc. I am very tolerant of false negatives, but want to avoid false positives (bad bar codes) as much as possible.


回答1:


If you're worried about repeated and successive digits, you can take np.diff of the digits and then compare against a triangular distribution using a Kolmogorov Smirnov test. The difference between successive digits for a random number should follow a triangular distribution between -10 and 10, with a maximum at 0

import scipy.stats as stat
t = stat.triang(.5, loc = -10, scale = 20)

Turning the bar codes into an array:

a = np.array(list(map(list, map(str, a))), dtype = int)  # however you get `a` out of your dataframe

then build a mask with

np.array[stat.kstest(i, t.cdf).pvalue > .5 for i in np.diff(a, axis = 1)]

testing:

np.array([stat.kstest(j, t.cdf).pvalue > .5 for j in np.diff(np.random.randint(0, 10, (1000, 13)), axis = 1)]).sum()

Out: 720

You'll have about a 30% false negative rate, but a p-value threshold of .5 should pretty much guarantee that the values you keep don't have too many successive or repeat digits. If you want to really be sure you've eliminate anything suspicious, you may want to also KS test the actual digits against stat.uniform(scale = 10) (to eliminate 1213141516171 and similar).




回答2:


As a first step I would use the barcodes built in validation mechanism, the checksum. As your barcodes appear to be GTIN barcodes (specifically GTIN-13), you can use this method:

>>> import math
>>> def CheckBarcode(s):
        sum = 0
        for i in range(len(s[:-1])):
            sum += int(s[i]) * ((i%2)*2+1)
        return math.ceil(sum/10)*10-sum == int(s[-1])

>>> CheckBarcode("4902508045506")
True
>>> CheckBarcode("8850851860016")
True
>>> CheckBarcode("1232131321313")
True
>>> CheckBarcode("1122334455667")
False
>>> CheckBarcode("1234567890123")
False


来源:https://stackoverflow.com/questions/46556587/how-to-eliminate-suspicious-barcode-like-123456-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!