Introduction
If you have an existing R script that wasn’t written with the intention of paralellizing it - here is a short tutorial how you can achieve easy paralellization without changing your code.
Idea
Use an existing R script. Add the possibility to evaluate command line arguments
via commandArgs()
to set the tiling for example. Then set up a bash script that
calls multiple parallel R sessions in the background to compute the results.
Tutorial
Add evaluation of command line arguments to your R script
First you will add the commandArgs()
function to your script. It allows you
to pass command line arguments to the script when you call it via the command line.
This is an example of how to use two command line arguments that will determine
the tiling of your data set.
# Here command line arguments from the command line are used to determine the tiling
args = commandArgs(trailingOnly = TRUE)
if (length(args)!=4) {
stop("Supply the first col, last col, first row, last row for tiling as command line arguments!", call.=FALSE)
}
# The arguments are passed to varibles here for further use
first_col <- args[1]
last_col <- args[2]
first_row <- args[3]
last_row <- args[4]
If you want to try the script without the command line assign values here. This is only for testing.
first_col <- 1
last_col <- 5
first_row <- 1
last_row <- 5
This is the part where you load your data. A matrix, raster or whatever. This is just an example.
# some data you load. a large matrix or a raster image or whatever.
library(raster, quietly = TRUE)
r <- raster(ncol=10, nrow=10)
set.seed(0)
values(r) <- runif(ncell(r))
plot(r)
Here you subset your data according to your tiling strategy. Here the first 25
values of a raster will be used for calculations. If you have the possibility
to lazyly load your data - even better. Then the whole dataset doesn’t have to be
loaded completely. The stars
package supports this for multidimensional rasters for example.
From the package data.table
you could use the function fread()
to only load
a dataset partially.
# subset the raster to your tiling.
r_sub <- crop(r, extent(r, first_row, last_row, first_col, last_col)) # get row 1 to 5 and column 1 to 5
rm(r)
# do some calculations on the raster
r_sub <- r_sub * 2
plot(r_sub)
Then save your results. And combine them in a seperate script.
# save your result
save(r_sub, file = paste0("r_sub_", last_col, "_", last_row, ".Rdata"))
Then combine your results in a seperate script
Note: Instead of passing indices to the command line argument for tiling you could also pass a bounding box, band numbers or an iterator for a loop.
Run the R script from the command line
You will run the R script from the command line like this. It will start a dedicated R session.
nohup Rscript --vanilla --verbose my_script.R 1 5 1 5 &> nohup_1.out &
nohup
stands for no hang up, the job will run also if you disconnect from the VM1
,5
,1
,5
stand for the tiling limits you choose. They will be used in the r script.&
stands for pushing the job into the background so your console isn’t blocked> nohup_1.out
stands for writing the log to the specified file- to see how many jobs you have running type
jobs
into your console - to monitor the jobs type
htop
into your console - to kill a job:
kill <pid>
(you can get the pid from htop) - to check the log:
cat nohup_1.out
Run multiple R scripts from the command line in parallel
To parallelize this you can run the command mentioned above multiple times and only change the tiling arguments. Multiple R session on multiple cores will process your jobs. You can also create a shell script for doing this.
#!/bin/bash
echo "Starting DI_TILE_cmd.R scripts in the background"
nohup Rscript --vanilla --verbose my_script.R 1 5 1 5 &> nohup_1.out &
nohup Rscript --vanilla --verbose my_script.R 1 5 6 10 &> nohup_2.out &
nohup Rscript --vanilla --verbose my_script.R 6 10 1 5 &> nohup_3.out &
nohup Rscript --vanilla --verbose my_script.R 6 10 6 10 &> nohup_4.out &
Acknowledgement
Thanks to Alice Crespi for providing a use case where this strategy was necessary.