Optimising Machine Learning for Insurance Claims

Machine learning plays an increasingly significant role in the insurance sector today. It can massively accelerate and empower insurance claims departments’ ability to process, analyse and make decisions based on the increasingly abundant data they have at their fingertips.

Feature selection – the process of separating out and retaining a subset of the most useful or informative features within the available data – plays a key role in the creation of effective machine learning models. This blog explains how this works, why it matters, and how it can be made to work better.

Machine learning tools can dramatically boost the productivity and efficiency of insurance claims departments. To take just one practical example, they can interpret what kind of claim has just been reported, how it should most appropriately be handled, and what level of expertise that task requires.

But training machine learning tools to make sense of insurance data presents a range of challenges and can prove significantly time-consuming. The sheer quantity of data insurance claims departments now have available to work with is one aspect of this. An added complication is that much of this data is a low-quality, high-noise mixture of categorical and numerical data – with multiple subsets of data (or features) that simply get in the way and slow things down.

A solution to this challenge lies in applying algorithms capable of pre-selecting the most useful or influential features of incoming data and discarding the chaff. This can greatly facilitate the development of machine learning tools that aid claims decision making and optimise the claims handling process.

In a recent news item, we revealed how DOCOsoft data scientist Dr Bernard Cosgrave had co-authored a paper published in a top-tier academic journal, entitled A Multiple-Association-Based Unsupervised Feature Selection Algorithm for Mixed Data Sets. The paper explained how its authors had developed and comparatively tested an algorithm that is better able to separate out interesting and significant features from large mixed data sets than other currently available alternatives.

The inclusion of irrelevant and redundant data features has long been recognised as having an adverse effect on the development and performance of machine learning models. Applying the right feature selection techniques can slim down unhelpfully big data and draw out those features that will prove most useful in knowledge generation and decision-making tasks.

Also known as variable selection or variable subset selection, feature selection identifies and strips out features that are redundant, irrelevant, or which correlate so closely with other features that they provide little or no additional information. Removing such features means models can be trained much faster and that the resulting models will be more legible for those working with them.

The practice of feature selection is much more advanced for purely numerical data than for mixed categorical and numerical data of the kind the insurance industry typically uses. DOCOsoft’s machine learning specialists recognise the value of applying feature selection to remove ‘noisy’ features before performing machine learning techniques, to allow algorithms to focus on influential features.

The work on which Dr Cosgrave and his colleagues collaborated resulted in the development and testing of an algorithm based on their belief that the most representative features will be those that are most diverse and least dependent on one another. The algorithm they developed sees feature selection as an optimisation problem that can be addressed by selecting the set of features with minimal association between them.

Their paper, published recently in Expert Systems with Applications, describes a generic multiple association measure and two associated feature selection algorithms: ‘naive’ and ‘greedy’ feature selection algorithms (NFSA and GFSA, respectively).

The proposed GFSA algorithm was evaluated on 15 benchmark datasets and compared against four state-of-the-art feature selection techniques. The test results for complexity analysis and experimental results showed that the proposed algorithm significantly reduces processing time required for unsupervised feature selection algorithms for the types of data typically encountered in the world of insurance.

An additional benefit of this approach is that it reveals the most and least important features in datasets. The insight this generates can prove useful for both decision making and strategy development in an insurance context extending beyond the immediate focus of attention.

As our world transitions rapidly from having too little to too much information, techniques such as feature selection will prove more important than ever. Through active engagement in the forefront of developments in data science and machine learning, DOCOsoft equips itself and its clients to compete and win in a data-saturated future.