I am working on trying to weed out a certain customer from our database. I\'ve noticed a trend where people fill out their first name with the same name that is partial to how t
You can employ similarity based approach
Try code at bottom of answer
It produces result like below
business_name partial_business_name first_name similarity
locksmith taylorsville locksmith locksmith 1.0
locksmith farmington locksmith locksmith 1.0
locksmith roy locksmith locksmi 0.7777777777777778
locksmith clinton locksmith locks 0.5555555555555556
So, you will be able to control what to filter out based on similarity value
** Code **
SELECT business_name, partial_business_name, first_name, similarity FROM
JS( // input table
(
SELECT business_name, REGEXP_EXTRACT(business_name, r'^(\w+)') AS partial_business_name, first_name AS first_name FROM
(SELECT 'locksmith taylorsville' AS business_name, 'locksmith' AS first_name),
(SELECT 'locksmith roy' AS business_name, 'locksmi' AS first_name),
(SELECT 'locksmith clinton' AS business_name, 'locks' AS first_name),
(SELECT 'locksmith farmington' AS business_name, 'locksmith' AS first_name),
) ,
// input columns
business_name, partial_business_name, first_name,
// output schema
"[{name: 'business_name', type:'string'},
{name: 'partial_business_name', type:'string'},
{name: 'first_name', type:'string'},
{name: 'similarity', type:'float'}]
",
// function
"function(r, emit) {
var _extend = function(dst) {
var sources = Array.prototype.slice.call(arguments, 1);
for (var i=0; i<sources.length; ++i) {
var src = sources[i];
for (var p in src) {
if (src.hasOwnProperty(p)) dst[p] = src[p];
}
}
return dst;
};
var Levenshtein = {
/**
* Calculate levenshtein distance of the two strings.
*
* @param str1 String the first string.
* @param str2 String the second string.
* @return Integer the levenshtein distance (0 and above).
*/
get: function(str1, str2) {
// base cases
if (str1 === str2) return 0;
if (str1.length === 0) return str2.length;
if (str2.length === 0) return str1.length;
// two rows
var prevRow = new Array(str2.length + 1),
curCol, nextCol, i, j, tmp;
// initialise previous row
for (i=0; i<prevRow.length; ++i) {
prevRow[i] = i;
}
// calculate current row distance from previous row
for (i=0; i<str1.length; ++i) {
nextCol = i + 1;
for (j=0; j<str2.length; ++j) {
curCol = nextCol;
// substution
nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
// insertion
tmp = curCol + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// deletion
tmp = prevRow[j + 1] + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// copy current col value into previous (in preparation for next iteration)
prevRow[j] = curCol;
}
// copy last col value into previous (in preparation for next iteration)
prevRow[j] = nextCol;
}
return nextCol;
}
};
var the_partial_business_name;
try {
the_partial_business_name = decodeURI(r.partial_business_name).toLowerCase();
} catch (ex) {
the_partial_business_name = r.partial_business_name.toLowerCase();
}
try {
the_first_name = decodeURI(r.first_name).toLowerCase();
} catch (ex) {
the_first_name = r.first_name.toLowerCase();
}
emit({business_name: r.business_name, partial_business_name: the_partial_business_name, first_name: the_first_name,
similarity: 1 - Levenshtein.get(the_partial_business_name, the_first_name) / the_partial_business_name.length});
}"
)
ORDER BY similarity DESC
Was used in How to perform trigram operations in Google BigQuery? and based on https://storage.googleapis.com/thomaspark-sandbox/udf-examples/pataky.js by @thomaspark where Levenshtein's distance is used to measure similarity
this will do the trick,
select * from TableName where lower(business_name) contains lower(first_name)
use lower() just in case they have upper case letters. Hope it helps.
You can use LIKE operator:
SELECT * FROM table WHERE business_name NOT LIKE CONCAT(first_name, '%')
% stands for anything.