A large north American insurance company is ramping up its analytics capabilities. An elite team of data scientists and actuaries has been assembled, and within a year they have already built sophisticated statistical models. Some of the models are expected to generate an annual ROI of over 20:1, and major transformational work is in the pipeline.
When it comes to insurance and math, the organization truly excels. However, two other aspects of the venture have proven difficult:
Given the size of the company, the data management division is quite large, and does not include the DBA group. Most teams rely on the usual enterprise stack for reporting and data warehousing: Oracle, Informatica, MicroStrategy.
The data science & analytics team is only a small part of the group and plays with more bleeding-edge technologies, such as Spark or MXNet.
There are three profiles in the data science & analytics team:
Full time software developers were not part of the initial plan for the analytics team. It was assumed that data scientists and actuaries could take care of the required coding, and that developers from other teams could be borrowed for special projects as needed. However, it soon became obvious that customizing the software stack and preparing data munging processes are time-consuming tasks that require a specific skill set. This led to the creation of the data engineer role.
The organization used to have strong ties with large traditional vendors. What was seen as a safe bet turned out to be a costly computing environment with unsatisfactory performance. In recent years, the IT group has progressively diversified its vendor portfolio and has embraced smaller, more modern service providers, and the agility of the organization has greatly increased.
Lately, the organization has performed a successful transition from a Big Iron UNIX ecosystem to a more dynamic Linux (RHEL) environment relying on server virtualization. Containers are becoming popular in the organization, but the long term strategy of the IT group is to leverage cloud computing rather than expand the local infrastructure.
While some aspects of the business still rely on the waterfall software development methodology, modern approaches such as Agile and DevOps are slowly becoming the norm. Cross-functional teams have started to appear but the IT group is still mostly built around silos (systems, networks, databases, desktops, etc), and communication between groups is ticket-driven.
When it comes to the adoption of new technology, the IT group usually finds itself in the early majority. Groupware has already been transitioned to the cloud, and for support functions such as helpdesk, systems monitoring or source control management, the IT group is moving towards SaaS solutions.
The organization operates in a highly regulated industry in multiple distinct jurisdictions, and must comply with a large number of third-party requirements (DMV, credit score agencies, etc). The security and compliance team excels at enforcing those requirements and regulations, but the sheer amount of restrictions severely restrict the ability to build innovative analytics solutions.
For instance, the organization does business in more than one country, and some data sets cannot physically cross national boundaries. This poses a problem with cloud computing providers, which often only have partial offerings outside of the United States. This also prevents the data scientists from combining some data points at the leaf level and from analyzing larger trends.
The data scientists need high-performance computing equipment to build complex statistical models; the quality and value of their output depends largely on the number of iterations their machine learning processes can perform. Since the corporate infrastructure relies heavily on a shared virtualization environment, these kind of compute-intensive processes directly impact the entire organization.
The data scientists also need to store and process terabytes (or even petabyes) of data. The corporate storage infrastructure, built on high-end SAN appliances for performance and data protection reasons, cannot easily accommodate this volume of data without massive upfront investments. Leveraging the cloud for these workloads presents additional challenges, such as increasing the bandwidth or setting up federated identity management tools.
The organization moved away from paper-driven business processes and from mainframe-based operations less than a decade ago. Historical data is scarce and difficult to process.
As part of the modernization of the computing environment, the IT group has selected a best-of-breed insurance software package and embarked on a lengthy customization journey. The database model is software-defined and changes constantly based on weekly code modifications, which makes the maintenance of ETL processes quite challenging.
At best, the data quality in the organization is sub-optimal. Many critical pieces of information are punched in manually by busy underwriters or brokers, and important fields are often left blank. To make matters worse, the data warehouse has been designed without following best practices, and often contains only the latest value of records, preventing even the simplest forms of time series analysis.
The first deliverable was to implement a robust computing infrastructure (on-premises and/or cloud-based) meeting the following criteria:
The combination of performance, stability and agility requirements made things challenging, especially in a corporate context where data science was still a new and unproven concept. The project was perceived as pure R&D, but had full-blown production expectations.
The first step in the project was to build a small Hadoop cluster using commodity hardware. This approach solved the storage problem since entry-level storage devices are cost-effective ($100/TB, compared to $2,500/TB for SAN storage). Since the computing requirements for plain storage are low, the data scientists were able to leverage the idle CPUs on those commodity servers to run machine learning tasks.
This phase of the project quickly proved successful. Tasks executed on the small inexpensive Hadoop cluster using the Spark computing framework ran roughly twice as fast as regular Python tasks running on a vastly more powerful and more expensive platform (a dedicated, unpartitioned POWER7 frame). Provided with these performance metrics and with the encouraging results of the statistical models built by the data scientists, senior management approved a significant upgrade of the cluster.
Some stakeholders proposed at that point to invest heavily in a GPU farm, which was a strong trend in the market. With a project pipeline of over a year, however, the data science team determined that most tasks in the foreseeable future would leverage regressions and ensemble trees, not neural networks. This made GPU valuable but not a high priority:
The cluster was upgraded accordingly, with a good balance of CPU, memory and GPU capabilities.
The hardware was only part of the equation, however. Experience on the smaller cluster showed that even among a team of senior data scientists, it was difficult to use a shared computing environment. Some processes would often utilize 100% of the cluster, leaving the rest of the team stranded.
The main issue with resource contention was not bad code or irresponsible users, but rather the growing complexity of the ecosystem. Most machine learning processes were running as Spark jobs leveraging the the Hadoop resource manager (YARN), but a handful of other frameworks were also used, such as h2o or dask, making it difficult to optimally allocate computing resources.
After some experimentation, the Mesos platform proved to be a good fit for the upgraded cluster. It allowed for a more granular and sophisticated allocation of resources, and supported out of the box most of the frameworks used by the team. With Mesos, the three requirements (performance, agility, stability) were easier to meet.
The second deliverable was an ambitious training program designed to ramp up the skill sets of various teams in the context of analytics:
Although many excellent commercial Hadoop distributions were available, a decision was made to initially focus on open-source technologies rather than bring in a vendor. The goal was to help the team understand the ecosystem without getting shackled to a specific vendor implementation. This proved to be a great approach, as it allowed for much more flexibility in the selection and upgrade of software components.
For instance, the data scientists quickly embraced the Spark framework, and were thrilled to always have access to the latest version, rather than to be stuck with the release cycle of a vendor. The DBA and DevOps teams were also quite happy to have the freedom to choose a deployment mechanism they were comfortable with, rather than learn yet another vendor platform.
The training program went smoothly, but a few surprising lessons were learned along the way. More specifically, it became obvious that the web platform used to introduce new team members to Python coding (Jupyter notebooks) made it difficult for them to acquire good software development practices, such as writing modular code or using source control systems. For people with a strong background in math or insurance, figuring out the syntax of Python code was not difficult; what was less intuitive was writing code that was easy to maintain and share. The training platform was therefore migrated to a more traditional software development stack, with an IDE (PyCharm), a source control system (Git), and a continuous integration utility (Jenkins).
The third deliverable proved the most challenging: building a sustainable ecosystem.
In a heated market, attracting data scientists was difficult; retaining them proved even more challenging. Within the first year, once they improved their skills and gained some real life experience, three team members left the organization for big tech firms or medical research centers.
The organization was not in a position to compete with the glamour or the financial incentives offered by those big players, and becoming a data scientist incubator for Silicon Valley was not an enticing goal. A more tactical approach was needed.
The solution was to build an analytics platform that made it very easy to onboard new members and to put their existing skills to work immediately, without requiring extensive and expensive training. This new ecosystem became a textbook implementation of continuous delivery, with short delivery cycles and advanced automation.
This approach made it less costly to onboard new team members, and allowed the team to grow quickly. As a side effect of allowing team members to focus on their field of interest rather than learn a slew of new technologies, turnover slowed down significantly.