R is the tool of choice for many data scientists when it comes to statistical computing. So, when we had to develop a statistical computation service for our web application, it was only logical to do it in R. We decided that the computation service had to run independently of the application server. This would allow us to make changes in the R layer easily and scale by adding other services.
Once the R code had been written, we started to plug it into our application. That’s where we ran into some problems. This blog is about how we solved them. We hope that our experience would be useful for others trying to solve a similar problem.
First some details about the setup:
Our first instinct was to use an R server for the computation service. We came across ‘deployR’ by Revolution Analytics. It looked like a great option at the start. But the lack of an open source community and a set of unclear configuration error messages prompted us not to go forward with ‘deployR’.
When we were unable to find another robust solution in R, we decided to use a Flask server on top of the R module. It would call R functions using ‘rpy2’. The application server would post requests to the RESTful API of the computational layer. We started out by parsing JSON objects in Python to create a ‘rpy2‘ object. This ‘rpy2‘ object would then be used to call the R functions. This turned out to be painfully slow, making it unsuitable for our web application. The conversion from JSON to ‘rpy2‘ object was taking a lot longer than the actual computation in R.
To solve this, we started parsing JSON objects directly in R. The Flask API just forwards the JSON objects received from the client to R. Converting JSON to an R data frame turned out to be much faster than converting JSON to a ‘rpy2’ data frame.
Even after this what we had was not fast enough. Parsing JSON in R was now the rate-determining step. The two popular packages in R to handle JSON objects are ‘rjson’ and ‘jsonlite’. Most of the time, one would select one and go ahead. We decided to play around a bit with the parsing functions of the two libraries. We discovered that the running times of ‘toJSON()’ and ‘fromJSON()’ functions, of the two libraries, vary for different types of objects. We utilized this to our benefit to create different combinations of these functions. This, in turn, helped us speed up our current server response times. What follows below is a summary of the comparisons that we did.
JSON request object structure:
{
‘data’: ‘data frame JSON object’,
‘data vector’: ‘data vector JSON object’
}
The format of the data frame JSON object:
{
‘col1’: {‘row1’: 1, ‘row2’: 2, …, ‘row-n’: 100},
‘col2’: {‘row1’: 5, ‘row2’: 10, …, ‘row-n’: 500},
‘col3’: {‘row1’: 10, ‘row2’: 20, …, ‘row-n’: 1000},
…,
‘col-n’: {‘row1’: 100, ‘row2’: 200, …, ‘row-n’: 10000}
}
The format of the data vector JSON object:
{‘row1’: ‘label1’, ‘row2’: ‘label1’, ‘row3’: ‘label2’, …, ‘row-n’: ‘label1’}
Benchmark Results
Loading a JSON file from the local disk
object <- rjson::fromJSON(file = "request.json")
Average running time: 120ms
object <- jsonlite::fromJSON(txt = "request.json")
Average running time: 15ms
Parsing a JSON object from a client request
object <- rjson::fromJSON(json_object)
Average running time: 5ms
object <- jsonlite::fromJSON(json_object)
Average running time: 9s
Extracting a data frame from a JSON object
data <- rjson::fromJSON(object$data)
Average running time: 26s
data <- jsonlite::fromJSON(object$data)
Average running time: 1.4 seconds
Extracting a data vector from a JSON object
data_vector <- as.factor(unlist(rjson::fromJSON(object$data_vector)))
Average running time: < 1ms
data_vector <- as.factor(unlist(jsonlite::fromJSON(object$data_vector)))
Average running time: 1ms
Creating a JSON object from an R data frame
return_object <- rjson::toJSON(data_frame)
return object: {[1, 2, 3, 4, …, 1000]}
Average running time: 62ms
return_object
return object:
[{‘col1’: 1, ‘col2’: 2, ..., ‘col-n’: 100, ‘_row’: ‘row1’},
{‘col1’: 5, ‘col2’: 10, ..., ‘col-n’: 500, ‘_row’: ‘row2’},
{‘col1’: 10, ‘col2’: 20, ..., ‘col-n’: 1000, ‘_row’: ‘row3’},
…,
{‘col1’: 100, ‘col2’: 200, ..., ‘col-n’: 10000, ‘_row’: ‘row-n’}]
Average running time: 27ms
In this case we go with ‘jsonlite::toJSON()’. It takes less time than ‘rjson::toJSON()’ and also retains the data frame’s metadata, such as row names, column names, and dimension. This information is useful in recreating the data frame on the client side.
Creating a JSON object from an R vector
return_object <- rjson::toJSON(data_vector)
return object: {‘a’: 5, ‘b’: 10, …, ‘z’: 130}
Average running time: 7ms
return_object <- rjson::toJSON(data_vector)
return object: [5, 10, 15, …, 100]
Average running time: 2ms
We prefer the ‘rjson::toJSON()’ method because it retains the key-value pairing inside the JSON object. ‘jsonlite::toJSON()’ does not convert a named R vector to a key-value paired JSON object. We permit a 5ms overhead in this case because we saved that much in parsing the JSON object in step 2.
Combining multiple JSON objects to return to the client
To send back a combination of data frames and data vectors to a client, we convert them to JSON objects individually and append them to a list in R. Then we convert the entire thing to JSON again.
return_list <- list(
data_vector <- rjson::toJSON(data_vector),
data_frame <- jsonlite::toJSON(data_frame)
)
return(jsonlite::toJSON(return_list))
‘jsonlite::toJSON()’ is used here because it is faster than ‘rjson::toJSON()’ and retains the key value pairing when converting an R list to a JSON object.
Average server response time: ~ 0.5 seconds
Benchmark Parameters
Request data frame: 12 x 5000, ~ 1MB
Request data vector: 12 elements, 256 bytes
Response data frame: 5000 x 12, ~ 1 MB
Response data vector: 5000 elements, 0.1 MB
Setup: Flask server running on a local machine, calling R scripts via rpy2. The client in the above examples refers to the application server, which will query the computation module as a client.
Processor: 2.7 GHz Core i7 Quad Core
RAM: 16 GB DDR3 1600 MHz
Hard Disk: 512GB SATA3 SSD
We will be happy to learn that there are even better solutions. Comments/emails are most welcome.
Happy analyzing!
Try Polly, a biological data analysis tool like never before! Book a demo here!