Making machine learning work harder in the (re)insurance sector

Timely access to information is crucial to making the right decisions on underwriting, claims, and every other aspect of the insurance process. But in an increasingly data-generative, data-retentive world, insurance decision-makers can sometimes find they have, quite literally, more information than they know what to do with.

One of the key tools currently helping insurers process, analyse and make decisions based on the increasingly abundant data they have available to them is machine learning. Insurers can dramatically boost efficiency and productivity by putting machine learning tools to work. For example, in the world of claims, in which DOCOsoft specialises, machine learning can help interpret what kind of claim has just been reported, how it should be handled, and what level of expertise it requires.

Training machine learning tools to make sense of insurance data, however, presents a range of challenges. Resolving these can prove significantly time-consuming. The sheer quantity of data available to work with is one aspect of this. An added complication is that much of the data available is a low-quality, high-noise mixture of categorical and numerical data – with multiple subsets of data (or ‘features’) that simply get in the way and slow things down.

A solution to this challenge lies in applying algorithms capable of pre-selecting the most useful or influential features of incoming data and discarding the rest. This can greatly facilitate the development of machine learning tools that aid decision making and optimise the claims handling process. Feature selection plays a key role in the creation of effective machine learning models. This article examines how this works, why it matters, and how it can be made to work better.

The inclusion of irrelevant and redundant data features has long been recognised as having an adverse effect on the development and performance of machine learning models. Applying the right feature selection techniques can slim down ‘big data’ and draw out those features that will prove most useful in knowledge-generation and decision-making tasks.

Also known as variable selection or variable subset selection, feature selection identifies and strips out features that are redundant, irrelevant, or which correlate so closely with other features that they provide little or no additional information. Removing such features means machine learning models can be trained far more rapidly — and that their output will be more legible for those working with them.

“Applying the right feature selection techniques can slim down ‘big data’ and draw out those features that will prove most useful in knowledge-generation and decision-making tasks.”

The practice of feature selection is much more advanced for purely numerical data than for mixed categorical and numerical data of the kind the insurance industry typically uses. Hence the emphasis DOCOsoft’s machine learning specialists put on applying feature selection to remove ‘noisy’ features before performing machine learning techniques, to allow algorithms to focus on influential features.

DOCOsoft data scientist Dr Bernard Cosgrave recently co-authored a paper published in a respected academic journal, entitled A Multiple-Association-Based Unsupervised Feature Selection Algorithm for Mixed Data Sets. The paper explained how its authors had developed and comparatively tested an algorithm that is better able to separate out interesting and significant features from large mixed data sets than other currently available alternatives.

The work on which Dr Cosgrave and his colleagues collaborated resulted in the development and testing of an algorithm based on their view that the most representative features will be those that are most diverse and least interdependent. The algorithm they developed sees feature selection as an optimisation problem that can be addressed by selecting the set of features with the minimum association between them.

Published last year in the journal Expert Systems with Applications, their paper describes a generic multiple association measure and two associated feature selection algorithms: ‘naive’ and ‘greedy’ feature selection algorithms (NFSA and GFSA, respectively).

The proposed GFSA algorithm was evaluated on 15 benchmark datasets and compared against four state-of-the-art feature selection techniques. The test results for complexity analysis and experimental results showed that the proposed algorithm significantly reduced processing time required for unsupervised feature selection algorithms for the types of data typically encountered in the world of insurance.

An additional benefit of this approach is that it reveals the most and least important features in datasets. The insight this generates can prove useful for both decision making and strategy development in an insurance context that goes beyond the immediate focus of attention.

As our world shifts rapidly from having too little to too much information, techniques like feature selection will prove more important than ever. Through active engagement at the forefront of developments in data science and machine learning, DOCOsoft is helping itself and its clients to compete and win in a data-saturated future.

The above article was originally posted on Insider Engage, written by Aidan O’Neill at DOCOsoft.