Parallelizing R scripts from the command line

A tutorial on how to parallelize R scripts from the command line

Introduction

If you have an existing R script that wasn’t written with the intention of paralellizing it - here is a short tutorial how you can achieve easy paralellization without changing your code.

Idea

Use an existing R script. Add the possibility to evaluate command line arguments via commandArgs() to set the tiling for example. Then set up a bash script that calls multiple parallel R sessions in the background to compute the results.

Tutorial

Add evaluation of command line arguments to your R script

First you will add the commandArgs() function to your script. It allows you to pass command line arguments to the script when you call it via the command line. This is an example of how to use two command line arguments that will determine the tiling of your data set.

# Here command line arguments from the command line are used to determine the tiling
args = commandArgs(trailingOnly = TRUE)
if (length(args)!=4) {
  stop("Supply the first col, last col, first row, last row for tiling as command line arguments!", call.=FALSE)
} 

# The arguments are passed to varibles here for further use
first_col <- args[1]
last_col <- args[2]
first_row <- args[3]
last_row <- args[4]

If you want to try the script without the command line assign values here. This is only for testing.

first_col <- 1
last_col <- 5
first_row <- 1
last_row <- 5

This is the part where you load your data. A matrix, raster or whatever. This is just an example.

# some data you load. a large matrix or a raster image or whatever.
library(raster, quietly = TRUE)
r <- raster(ncol=10, nrow=10)
set.seed(0)
values(r) <- runif(ncell(r))
plot(r)

Here you subset your data according to your tiling strategy. Here the first 25 values of a raster will be used for calculations. If you have the possibility to lazyly load your data - even better. Then the whole dataset doesn’t have to be loaded completely. The stars package supports this for multidimensional rasters for example. From the package data.table you could use the function fread() to only load a dataset partially.

# subset the raster to your tiling. 
r_sub <- crop(r, extent(r, first_row, last_row, first_col, last_col)) # get row 1 to 5 and column 1 to 5
rm(r)

# do some calculations on the raster
r_sub <- r_sub * 2
plot(r_sub)

Then save your results. And combine them in a seperate script.

# save your result
save(r_sub, file = paste0("r_sub_", last_col, "_", last_row, ".Rdata"))

Then combine your results in a seperate script

Note: Instead of passing indices to the command line argument for tiling you could also pass a bounding box, band numbers or an iterator for a loop.

Run the R script from the command line

You will run the R script from the command line like this. It will start a dedicated R session.

nohup Rscript --vanilla --verbose my_script.R 1 5 1 5 &> nohup_1.out &
  • nohup stands for no hang up, the job will run also if you disconnect from the VM
  • 1, 5, 1, 5 stand for the tiling limits you choose. They will be used in the r script.
  • & stands for pushing the job into the background so your console isn’t blocked
  • > nohup_1.out stands for writing the log to the specified file
  • to see how many jobs you have running type jobs into your console
  • to monitor the jobs type htop into your console
  • to kill a job: kill <pid> (you can get the pid from htop)
  • to check the log: cat nohup_1.out

Run multiple R scripts from the command line in parallel

To parallelize this you can run the command mentioned above multiple times and only change the tiling arguments. Multiple R session on multiple cores will process your jobs. You can also create a shell script for doing this.

#!/bin/bash
echo "Starting DI_TILE_cmd.R scripts in the background"
nohup Rscript --vanilla --verbose my_script.R 1 5 1 5 &> nohup_1.out &
nohup Rscript --vanilla --verbose my_script.R 1 5 6 10  &> nohup_2.out &
nohup Rscript --vanilla --verbose my_script.R 6 10 1 5 &> nohup_3.out &
nohup Rscript --vanilla --verbose my_script.R 6 10 6 10 &> nohup_4.out &

Acknowledgement

Thanks to Alice Crespi for providing a use case where this strategy was necessary.

Avatar
Peter Zellner
Institute for Earth Observation

Trying to provide, optimize and automitize remote sensing and gis workflows and tools.

comments powered by Disqus