问题
Anyone knows how to get all the random numbers different in the following code? E.g. with doRNG
package? I don't care about reproducibility.
Edit: Duplicates by pure chance are accepted.
rm(list = ls())
set.seed(666)
cat("\014")
library(plyr)
library(dplyr)
library(doRNG)
# ====== Data Preparation ======
dt = data.frame(id = 1:10,
part = rep("dt",10),
HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),
random = NA)
# ====== Set Parallel Computing ======
library(foreach)
library(doParallel)
cl = makeCluster(3, outfile = "")
registerDoParallel(cl)
# ====== SIMULATION ======
nsim = 1000 # number of simulations
iterChunk = 100 # split nsim into this many chunks
out = data.frame() # prepare output DF
for(iter in 1:ceiling(nsim/iterChunk)){
strt = Sys.time()
out_iter =
foreach(i = 1:iterChunk, .combine = rbind, .multicombine = TRUE, .maxcombine = 100000, .inorder = FALSE, .verbose = FALSE,
.packages = c("plyr", "dplyr")) %dopar% {
# simulation number
id_sim = iterChunk * (iter - 1) + i
## Generate random numbers
tmp_sim = is.na(dt$HG) # no results yet
dt$random[tmp_sim] = runif(sum(tmp_sim))
dt$HG[tmp_sim] = 3
# Save Results
dt$id_sim = id_sim
dt$iter = iter
dt$i = i
print(Sys.time())
return(dt)
}#i;sim_forcycle
out = rbind.data.frame(out,subset(out_iter, !is.na(random)))
fnsh = Sys.time()
cat(" [",iter,"] ",fnsh - strt, sep = "")
}#iter
# ====== Stop Parallel Computing ======
stopCluster(cl)
# ====== Distinct Random Numbers ======
length(unique(out$random)) # expectation: 6000
I have been strugling with this for 2 days. I asked this question earlier with only general response about random numbers.
Here I would like to ask for a solution (if anybody knows) how to set doRNG
package options (or similar package) in a way that all the random numbers are different. Across all the loops.
I have tried tons of doRNG settings and I still can't get it to work. Tried R versions 3.5.3 and 3.6.3 on two different computers.
UPDATE Following discussion with @Limey
Purpose of the code is to simulate football matches. As the simulation is large, I use iterChunk
to "split" the simulation into managable parts and after each iter
send the data into PostgreSQL database so the simulation doesn't overload RAM. Some matches already have real world results and have HG
(home goals) filled in. I want to simulate the rest.
When setting iterChunk
to 1
everything is fine. Increasing iterChunk
leads to generation of same numbers within iter
. For example when I set nsim
to 100
and iterChunk
to 10
. (All matches simulated 100 times, 10 times in 10 loops). I expect 600 random numbers (each match independently simulated accross all the loops). However I only get 180 - following the logic: 3 cores * 6 matches * 10 iterChunks.) Using 2 workers I do get 120 distinct random numbers (2 * 6 * 10)
Furthermore: exluding dt$HG[tmp_sim] = 3
I do get all random numbers different with whatever setting.
To understand the problem, I suggest:
- Run the code as is. (possibly setting
nsim
to100
anditerChunk
to10
) You will get 180 different random numbers. With lower number of nsim & iterChunk things may work as expected. - Comment out
dt$HG[tmp_sim] = 3
. You will get 6000 different random numbers (600 if you changensim
anditerChunk
)
The code in 2nd step assigns goals scored by home team. It looks like some kind of bug I can't get over. Even information that someone gets the same result and doesn't know why will be helpful - it will lift the weight of my own stupidity out of me.
Thank you, I highly appreciate any effort.
回答1:
I realised what the problem with OP's code was whilst I was in the shower. It's simple, and obvious in retrospect: all the loops and parallel processes are working on the same object - the dt
data frame. So they're constantly overwriting the changes that each makes, and at the end of the outer loop, you just have multiple copies of the changes made by the last loop to complete. The solution is equally simple: work on a copy of the dt
data frame.
To minimise the changes, I renamed dt
to baseDT
# ====== Data Preparation ======
baseDT = data.frame(id = 1:10,
part = rep("dt",10),
HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),
random = NA)
and then took a copy of it at the top of the foreach
loop
out_iter = foreach(i = 1:iterChunk,
.combine = rbind, .multicombine = TRUE, .maxcombine = 100000,
.inorder = FALSE, .verbose = FALSE,
.packages = c("plyr", "dplyr")) %dopar% {
dt <- baseDT
This gives
> length(unique(out$random)) # expectation: 6000
[1] 6000
as expected.
回答2:
Modifying the "Hello World" example in the "getting started with doParallel" vignette to generate random numbers, I came up with:
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
myFunc <- function(n) {runif(n)}
foreach(i=1:3) %dopar% myFunc(10)
[[1]]
[1] 0.18492375 0.13388278 0.65455450 0.93093066 0.41157625 0.89479764 0.14736529 0.47935995 0.03062963 0.16110714
[[2]]
[1] 0.89245145 0.20980791 0.83828019 0.04411547 0.38184303 0.48110619 0.51509058 0.93732055 0.40159834 0.81414140
[[3]]
[1] 0.74393129 0.66999730 0.44411989 0.85040773 0.80224527 0.72483644 0.64566262 0.22546420 0.14526819 0.05931329
Suggesting that getting random numbers across threads is straightforward. Indeed, the examples on pages 2 and 3 of the doRNG reference manual say the same thing.
In fact, if I understand you correctly, the purpose of doRNG
is to do precisely the opposite of what you want: to make random processes reproducible across threads.
Of course, this doesn't guarantee that all numbers are different across all threads. But it makes duplication very unlikely. A guarantee of no duplicates would mean some degree of determinism in the process: a completely random process might produce duplicates by chance.
Update Following on from our conversation in the comments...
We've established that the problem is in your program logic, not the parallelisation per se. So we need to refocus the question: what are you trying to do. I'm afraid it's not at all clear to me. So that means we need to simplify.
I set nsim
to 5
and iterChunk
to 1
. I get 5 data frames which look like
id part HG random id_sim iter i
1 1 dt 1 NA 1 1 1
2 2 dt 3 NA 1 1 1
3 3 dt 6 NA 1 1 1
4 4 dt 3 0.6919744 1 1 1
5 5 dt 3 0.5413398 1 1 1
6 6 dt 2 NA 1 1 1
7 7 dt 3 0.3983175 1 1 1
8 8 dt 3 0.3342174 1 1 1
9 9 dt 3 0.6126020 1 1 1
10 10 dt 3 0.4185468 1 1 1
In each, the values of id_sim
and iter
are always the same, and run from 1
in the first data frame to 5
in the fifth. i
is 1
for all rows in all data frames. Values in random
do appear to be random, and different between data frames. But the NA
s are all in the same positions in every data frame: the 1st, 2nd, 3rd and 6th rows. The values of HG
are as shown above for all five data frames.
Is that what you would expect? If not, what do you expect? Given we know the problem is not the paraellisation, you need to give us more information.
Update 2
Do you know Arduan? They posted a related question over the weekend...
I'm not going to tell you what's wrong with your code. I'll show you how I would apprach your problem. I hope you'll agree it's more readable, if nothing else.
So, we're simulating some football matches. I'll assume its a league format and use the english Premier League as an example. Start by generating the fixture list for a single season.
library(tidyverse)
teams <- c("Arsenal", "Aston Villa", "Bournemouth", "Brighton & Hove Albion",
"Burnley", "Chelsea", "Crystal Palace", "Everton", "Leicester City",
"Liverpool", "Manchester City", "Manchester United", "Newcastle United",
"Norwich City", "Sheffield United", "Southampton", "Tottenham Hotspur",
"Watford", "West Ham United", "Wolverhampton Wanderers")
fixtures <- tibble(HomeTeam=teams, AwayTeam=teams) %>%
complete(HomeTeam, AwayTeam) %>%
filter(HomeTeam != AwayTeam) # A team can't play itself
fixtures %>% head(5)
# A tibble: 5 x 2
HomeTeam AwayTeam
<chr> <chr>
1 Arsenal Aston Villa
2 Arsenal Bournemouth
3 Arsenal Brighton & Hove Albion
4 Arsenal Burnley
5 Arsenal Chelsea
Suppose we know some results. I'll use yesterday's matches as an illustration.
knownResults <- tribble(~HomeTeam, ~AwayTeam, ~HomeGoals, ~AwayGoals,
"Burnley", "Sheffield United", 1, 1,
"Newcastle United", "West Ham United", 2, 2,
"Liverpool", "Aston Villa", 2, 0,
"Southampton", "Manchester City", 1, 0)
resultsSoFar <- fixtures %>%
left_join(knownResults, by=c("HomeTeam", "AwayTeam"))
resultsSoFar %>% filter(!is.na(HomeGoals))
# A tibble: 4 x 4
HomeTeam AwayTeam HomeGoals AwayGoals
<chr> <chr> <dbl> <dbl>
1 Burnley Sheffield United 1 1
2 Liverpool Aston Villa 2 0
3 Newcastle United West Ham United 2 2
4 Southampton Manchester City 1 0
Now some utility functions. You could certainly combine them, but I think it's clearer to keep them separate so you can see exactly what each one is doing.
First, a function to simulate the results of all matches whose results are unknown. The details of how you simulate the scores are entirely arbitrary. I've assumed that home teams score an average of 1.5 goals a game, away teams score 1.2 goals per game. Later on, I'm going to use this to simulate many seasons in one go, so I'll add a variable (Iteration
) to index the season.
simulateResults <- function(i=NA, data) {
n <- nrow(data)
data %>%
add_column(Iteration=i, .before=1) %>%
mutate(
# Give the home team a slight advantage
HomeGoals=ifelse(is.na(HomeGoals), floor(rexp(n, rate=1/1.5)), HomeGoals),
AwayGoals=ifelse(is.na(AwayGoals), floor(rexp(n, rate=1/1.2)), AwayGoals)
)
}
Use it, and check that we haven't overwritten known results:
simulateResults(1, resultsSoFar) %>% filter(HomeTeam=="Burnley", AwayTeam=="Sheffield United")
# A tibble: 1 x 5
Iteration HomeTeam AwayTeam HomeGoals AwayGoals
<dbl> <chr> <chr> <dbl> <dbl>
1 1 Burnley Sheffield United 1 1
I'm going to parallelise the overall simulation, so now let's have a function to simulate a chunk of simulations. Again, create an index column to identify the chunk.
simulateChunk <- function(chunkID=NA, n) {
bind_rows(lapply(1:n, simulateResults, data=resultsSoFar)) %>%
add_column(Chunk=chunkID, .before=1)
}
simulateChunk(chunkID=1, n=3)
# A tibble: 1,140 x 6
Chunk Iteration HomeTeam AwayTeam HomeGoals AwayGoals
<dbl> <int> <chr> <chr> <dbl> <dbl>
1 1 1 Arsenal Aston Villa 2 0
2 1 1 Arsenal Bournemouth 0 0
3 1 1 Arsenal Brighton & Hove Albion 2 0
4 1 1 Arsenal Burnley 2 0
5 1 1 Arsenal Chelsea 1 0
6 1 1 Arsenal Crystal Palace 0 0
7 1 1 Arsenal Everton 2 3
8 1 1 Arsenal Leicester City 2 0
9 1 1 Arsenal Liverpool 0 1
10 1 1 Arsenal Manchester City 4 0
OK. Now I'm ready to do the main simulation work. I'll run 10 chunks of 100 simulations eash, to give 1000 simulated seasons in total, the same as you had.
library(doParallel)
cl <- makeCluster(3)
registerDoParallel(cl)
chunkSize <- 100
nChunks <- 10
startedAt <- Sys.time()
x <- bind_rows(foreach(i=1:nChunks, .packages=c("tidyverse")) %dopar% simulateChunk(i, n=chunkSize))
finishedAt <- Sys.time()
print(finishedAt - startedAt)
Time difference of 6.772928 secs
stopCluster(cl)
> x
# A tibble: 380,000 x 6
Chunk Iteration HomeTeam AwayTeam HomeGoals AwayGoals
<int> <int> <chr> <chr> <dbl> <dbl>
1 1 1 Arsenal Aston Villa 2 0
2 1 1 Arsenal Bournemouth 3 1
3 1 1 Arsenal Brighton & Hove Albion 0 1
4 1 1 Arsenal Burnley 3 0
5 1 1 Arsenal Chelsea 1 0
6 1 1 Arsenal Crystal Palace 0 0
7 1 1 Arsenal Everton 1 2
8 1 1 Arsenal Leicester City 0 0
9 1 1 Arsenal Liverpool 0 0
10 1 1 Arsenal Manchester City 0 0
Let's check I've got sensible results. As a basic check, I'll look at the results of Arsenal vs Aston Villa:
x %>%
filter(HomeTeam == "Arsenal", AwayTeam=="Aston Villa") %>%
group_by(HomeGoals, AwayGoals) %>%
summarise(N=n(), .groups="drop") %>%
pivot_wider(
values_from="N", names_prefix="AwayGoals",
names_sep="", names_from=AwayGoals
)
# A tibble: 8 x 10
HomeGoals AwayGoals0 AwayGoals1 AwayGoals2 AwayGoals3 AwayGoals4 AwayGoals5 AwayGoals6 AwayGoals8 AwayGoals7
<dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 299 129 57 19 12 7 NA NA NA
2 1 135 63 25 6 4 4 1 2 NA
3 2 75 21 12 9 4 1 NA NA 1
4 3 30 13 10 1 NA NA NA NA NA
5 4 21 7 1 1 NA NA NA NA NA
6 5 11 2 1 NA 2 NA NA NA NA
7 6 4 2 2 NA NA NA NA NA NA
8 7 4 1 1 NA NA NA NA NA NA
That looks reasonable. Now confirm that the matches with known results don't vary. For example:
x %>%
filter(HomeTeam == "Liverpool", AwayTeam=="Aston Villa") %>%
group_by(HomeGoals, AwayGoals) %>%
summarise(N=n(), .groups="drop") %>%
pivot_wider(values_from="N", names_prefix="AwayGoals", names_sep="", names_from=AwayGoals)
HomeGoals AwayGoals0
<dbl> <int>
1 2 1000
All good.
So, That's 23 statements to generate the fixtures, take account of known results, simulate the remainder of the matches and do some basic sanity checking. I could easily get that down to under 20 statements if I had to. That's about a third less than you were using just to try to simulate the unknown results. [The actual simulation takes fewer than 10 statements.] I think my approach is easier to understand: by using tidy verbs the code is almost self-documenting.
来源:https://stackoverflow.com/questions/62758637/multithread-computation-with-r-how-to-get-all-different-random-numbers