By Jeff Heaton, Ph.D., Lead Data Scientist, Reinsurance Group of America (RGA)
This is a guest post from Jeff Heaton, Ph.D., Lead Data Scientist, Reinsurance Group of America (RGA), a Community sponsor of the StampedeCon Artificial Intelligence Conference 2017 in St. Louis on October 17.
Occasionally, the ultimate deliverable of a data science project is a report. In these cases, a visually appealing PDF that succinctly conveys the needed information will suffice. Most data science projects, however, require a conduit that allows the project to be integrated with a much larger system. An application program interface (API) provides that conduit.
Websites are often used as an interface for people to access a company’s internal data and services. A common example is your bank’s online portal. An API provides a similar experience, only for computer programs rather than human users. APIs allow other companies’ computer systems to access your data science models as easily as you can browse your stock portfolio.
Underlying an API is a data science model that accepts information and returns a prediction based on that incoming data. The prediction from the model might be a customer’s credit worthiness, an underwriting evaluation, or a consumer’s propensity to buy a particular product. The process by which a model generates this prediction is often called scoring.
It is important to consider the pattern that your customers will follow when accessing your API. The two most common access patterns are real-time and batch. The two can be used interchangeably; however, the best performance is realized using the access pattern that most closely fits how clients will use the system.
The decision between real-time API and batch API is essentially governed by two considerations: how many transactions the client will send for scoring at a time and how quickly a score is needed. If a large number of transactions will be scored together and a nearly instant response is not needed, then a batch access pattern is called for. If transactions will be scored one at a time, with a near time response needed, then a real-time access pattern is in order.
Design Considerations for Real-Time Scoring
At RGA, we implemented an API as a web service running on an Express-based NodeJS server deployed as a Docker image. This web service handled all security and direct interactions with the client and communicated to a DeployR-based server that called the actual R scripts making up our models. These R scripts do the scoring for all incoming transactions.
It is important that the data science team design R scripts to be as robust and performant as possible. Because these scripts are used in real time, the actual scoring should be executed quickly. Long runtimes and large amounts of memory usually are needed for model building, but once the models are built, the scoring can typically happen in a relatively small footprint. It is also important that the scoring scripts do not unexpectedly crash when bad data is presented. Error checking and proper reporting is essential.
Design Considerations for Batch Scoring
If the client will send a large volume of transactions at specific time intervals, a batch pattern is best. For example, consider a client sending 100,000 product records weekly to generate sales leads from a propensity-to-buy data science model. It does not make sense to force the client to send 100,000 different transactions to an API. For the client to achieve an acceptable throughput performance, these transactions must be sent concurrently to a real-time API.
How many simultaneous connections should the client use? If the client chooses too few, performance will suffer. If the client chooses too many, it might overwhelm the API and degrade performance. It is much better for the client to simply send all needed transactions in a single batch. A similar batch transaction will later return these transactions to the client.
While there are many different patterns for batch processing, at RGA we typically deploy a Secure File Transfer Protocol (SFTP) approach. The client opens a SFTP connection to one of our servers and drops one or more Comma Separated Value (CSV) files containing the data to be scored. For additional security, these CSV files can be encrypted with a public-private key encryption scheme such as GNU Privacy Guard (GPG). In such a scheme, the client and API each have a private key used to decrypt the transferred file. The API and client company exchange the public keys used to encrypt the message, but the private keys are never exchanged. This ensures that only the API can decrypt the incoming files to be scored and that only the client can decrypt the resulting scores.
For the batch model the same type of R scripts are used to process incoming transactions as the real-time API. However, these scripts process batches of transactions rather than a single transaction. The resulting scores are written to the CSV file sent back to the client. Usually this resulting CSV file is simply a list of the IDs from the client’s files and the resulting scores.
Total Team Effort
Deploying an effective data science model is a collaborative effort. The data science team must ensure that the scoring model produces reliable results in a robust manner. The Information Technology (IT) group must work directly with the data science team to ensure security, robustness, and compliance with the company’s technology standards. The model that we deployed at RGA was a collaboration between our data science and IT teams. The result is a powerful tool using both real-time and batch scoring that effectively meets the needs of the many clients we serve.
Acknowledgements: Edmond Deuser – technical architect, Reinsurance Group of America and Larry Anderson – Principal Engineer, Ocelot Consulting (https://www.ocelotconsulting.com/) contributed to both the implementation of the API and this blog posting.
RGA Data Science Deployment Models