问题
I posted a question some days ago and while the solution seems to be working on RStudio in Windows (but takes forever and sometimes spits out no results), I keep getting an error of long vectors not supported when I run the same code with 30 CPUs on a HPC. Any ideas why?
Here is a sample of the data:
> head(forfuzzy)
# A tibble: 6 x 3
grantee_name grantee_city grantee_state
<chr> <chr> <chr>
1 (ICS)2 MAINE CHAPTER CLEARWATER FL
2 (SUFFOLK COUNTY) VANDERBILT~ CENTERPORT NY
3 1 VOICE TREKKING A FUND OF ~ WESTMINSTER MD
4 10 CAN NEWBERRY FL
5 10 THOUSAND WINDOWS LIVERMORE CA
6 100 BLACK MEN IN CHICAGO INC CHICAGO IL
... 7 - 97000 rows to go
> head(filings)
# A tibble: 6 x 2
grantee_name ein
<chr> <dbl>
1 ICS-2 MAINE CHAPTER 123456
2 SUFFOLK COUNTY VANDERBILT 654321
3 VOICE TREKKING A FUND OF VOICES 789456
4 10 CAN 654987
5 10 THOUSAND MUSKETEERS INC 789123
6 100 BLACK MEN IN HOUSTON INC 987321
rows 7-1200000 omitted for brevity
And the code with error message after 20 or so minutes of runtime:
n=10
lst=split(forfuzzy, cumsum(1:nrow(forfuzzy)-1)%%n==0)
knitr::opts_chunk$set(cache = TRUE, warning = FALSE, message = FALSE, cache.lazy = FALSE) # This was added and didnt change anything
df=purrr::map_dfr(lst, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.25, max_dist=0.1, distance_col="distance"))
Error in do_dist(a = b, b = a, method = method, weight = weight, q = q, :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:535
Calls: <Anonymous> ... list2 -> lapply -> FUN -> mf -> <Anonymous> -> do_dist
Execution halted
Any idea how I can get this to work (as said, sometimes Windows crashes as well but for different reasons where there is not enough space on my C drive I think).
来源:https://stackoverflow.com/questions/64549055/long-vectors-stringdist-package-r