问题
I would like to calculate the Jaro-Winkler string distance in a database. If I bring the data into R (with collect
) I can easily use the stringdist
function from the stringdist
package.
But my data is very large and I'd like to filter on Jaro-Winkler distances before pulling the data into R.
There is SQL code for Jaro-Winkler (https://androidaddicted.wordpress.com/2010/06/01/jaro-winkler-sql-code/ and a version for T-SQL) but I guess I'm not sure how best to get that SQL code to work with dbplyr
. I'm happy to try and map the stringdist
function to the Jaro-Winkler sql
code but I don't know where to start on that. But even something simpler like executing the SQL code directly from R on the remote data would be great.
I had hoped that SQL translation in the dbplyr
documentation might help, but I don't think so.
回答1:
You can build your own SQL functions in R. They just have to produce a string that is a valid SQL query. I don't know the Jaro-Winkler distance, but I can provide an example for you to build from:
union_all = function(table_a,table_b, list_of_columns){
# extract database connection
connection = table_a$src$con
sql_query = build_sql(con = connection,
sql_render(table_a),
"\nUNION ALL\n",
sql_render(table_b)
)
return(tbl(connection, sql(sql_query)))
}
unioned_table = union_all(table_1, table_2, c("who", "where", "when"))
Two key commands here are:
sql_render
, which takes a dbplyr table and returns the SQL code that produces itbuild_sql
, which assembles a query from strings.
You have choices for your execution command:
tbl(connection, sql(sql_query))
will return the resulting tabledbExecute(db_connection, as.character(sql_query))
will execute a query without returning the result (useful for for dropping tables, creating indexes, etc.)
来源:https://stackoverflow.com/questions/50661862/how-to-use-custom-sql-function-in-dbplyr