At the end of last year, I completed the Business Science University course “Data Science for Business with R (Advanced Machine Learning)”. Recently an interesting business case came up and gave me the opportunity to apply and deepen what I learned in the course.

In this particular case, it is the process of developing trading strategies. The procedure has proven itself many times over, but having to go through all the steps is demanding and time-consuming. Therefore, we would like to know whether we could possibly skip some development steps in order to accelerate the process and save time and resources.

For this purpose, we would like to use a machine learning model that predicts whether the current strategy candidate is going to be a highly robust one or not, i.e. a binary classification problem. If we are able to develop an accurate model, we could significantly streamline our development process.

##### Data Understanding

First, we start by taking a look at the data to get a better understanding. Then we prepare the data and perform a correlation analysis. By identifying key features that relate to the target variable we understand which features are most likely to provide further insights and are suitable for a machine learning model. It could also help us to gain a better understanding of the business case as a whole.

The target variable we want to predict is RL3 with RL3 we mark strategies that have completed the demanding test procedure with the highest robustness level. From the graph below we can see a difference in the distributions of the features between RL3 Yes (green) and RL3 No (red). This gives us confidence that we will find a suitable model and be able classify the strategies correctly. When we take a closer look at the numeric values, all except block and consistency, we can observe that most of the variables are skewed. The skewness in the data can lead to misleading results, therefore, we center and scale the data in order to get bell shaped distributions.

The Pearson correlation is widely used and it was developed to measure the linear dependence between two continuous variables. The two variables should be normally distributed and show similar variance. Consequently, we use the centered and scaled data for our analysis. Then we obtain the following correlation structure: The variables that are highly correlated with our target feature are retretratio, sqn, and npmddratio. Somewhat surprisingly, the standard deviation of the average trade (StdDevAvgTrade) seems not to be of big importance. Contrarily, consistency seems to be an important feature that is negatively correlated with our target variable.

Correlation analysis has already yielded some interesting findings, and with further analysis we could derive performance indicators from which we could decide whether or not it is worthwhile to proceed with a strategy candidate. The downside of this method is that we have generalized information and may be screening out too many actually robust trading strategies.

Therefore, we would like to implement a machine learning model that helps us to predict the outcome for each strategy and hopefully not discard as many strategies as in the case of generalized performance indicators.

##### Model Metrics

Now that we have carried out the initial exploratory data analysis, we can start building our models. We fit several machine learning models to our training data. To determine their quality, we test the models on unseen data. The following performance visualization is based on test data, i.e. data the model has not seen during the model training. This gives us an indication of how the individual model behaves on unseen data. Since models with a low log-loss tend to perform well on unseen data, we sort the models by this metric. For performance visualization, we consider the top five models, and based on the log-loss metric, we prefer the first three models.

In general, all models seem to work similarly in comparison to each other. An important diagram is the gain and lift diagram. This graph emphasizes how much the model improves the results compared to the random selection of a strategy. If we rank them in order of probability, the model is able to detect 53% of the strategies within the first 30% that are not classified as RL3 strategies.

Put differently, think of lift as a multiplier between what we have gained divided by what we expected without the model. For example, if we focused on the first 30%, we gained the ability to target 53% of non-RL3 strategies, but we expected only to target 30% of the non-RL3 strategies in the first 30% of the strategies. Therefore, the lift would be 1.76x (53/30), meaning the model has the ability targeting 1.76x better than random. ##### Confusion Matrix

The confusion matrix provides information on how the selected model has performed with unseen data. The model correctly predicted 119 strategies as RL3-No and 151 strategies as RL3-Yes. However, the model also predicted 40 strategies as RL3 strategies, which they are not. These are called false positives. On the other hand, the model predicted 9 strategies as non-RL3 strategies, which in reality are RL3 strategies. These are called the false negatives. Usually, in a business context, the false negatives are more important than the false positives because the associated costs are higher. If you think about it, we missed 9 robust trading strategies because the model was not able to correctly classify the strategy candidate.

Another example: if an employee is likely to leave the company and you predict that he or she will stay, high costs are incurred. If, on the other hand, you predict that the employee will leave the company and he or she stays, the costs were not as high. If this sounds all like gibberish to you, let’s look at an example. For the sake of simplicity, let’s assume that it would take us three hours to finish testing a strategy. If we were to use brute force testing, we would go through all 319 strategies, for a total of 957 hours of testing. From the above confusion matrix, the model would have given us a total of 191 strategies for further testing, resulting in 573 hours of testing. In other words, with the model we could have reduced our testing time by 384 hours, or 40%, and if we assume 30 dollars per hour, we could have saved almost 12,000 dollars.

##### Global and Local Interpretation

Well, you might say I don’t trust the results. The model is a black box, and I do not understand its complex inner workings. In addition, more advanced machine learning models are usually more accurate, and unfortunately, more accuracy often comes at the expense of interpretability, and interpretability is crucial for business acceptance and for people’s acceptance and trust. Fortunately, several advancements have been made to aid in interpreting machine learning models .

In general, we can distinguish between global and local interpretation. It is often important to understand the machine learning model on a global level, but it is also important to zoom into local regions of the predictions to derive local explanations. Global interpretations help us to understand the inputs and their overall modeled relationship to the prediction target, but global interpretations can be very approximate in some cases. Local interpretations help us to understand predictions for individual candidates.

The most common ways of obtaining global interpretation is through variable importance measures and partial dependence diagrams. Variable importance quantifies the global contribution of each input variable to the predictions of a machine learning model. For example, the GBM Grid 1 AutoML model identified consistency, sqn, and retretratio as the top three variables impacting the objective function. Partial Dependency Plots (PDPs) represent the change in the predicted value as the selected feature varies across its marginal distribution. As a result, we can gain some local understanding of how the response variable changes over the distribution of a particular variable, but this still only provides a global understanding of these relationships across all observed data. For example, if we plot the PDP of the system quality number (SQN) variable, we see that the probability of a strategy becoming an RL3 increases, on average, as their value approaches 1.5 and beyond. Local Interpretable Model-agnostic Explanations (LIME) is a visualization technique that helps to explain individual predictions. Behind how LIME works is the assumption that every complex model is linear on a local scale, and the assumption that it is possible to fit a simple model around a single observation that mimics the behavior of the global model at that location. The simple model can then be used to locally explain the predictions of the more complex model.

The graph below is a visualization that contains the local interpretation for strategy candidate 162. The label indicates whether or not the model believes it is an RL3 strategy, together with the probability of the strategy given the label. Since we have specified eight features, the eight most influential variables that best explain the linear model in this local region are shown, along with whether the variable causes an increase in probability (supported) or a decrease in probability (contradicted). Finally, it also provides us with the model fit, which allows us to see how well the model explains the local region. From this it can be concluded that strategy candidate 162 has a very high likelihood of being an RL3 strategy, and among the three variables that seem to influence this high probability are consistency, retretratio and the system quality number.

Finally, we can create a heatmap that shows all variables and their influence for each strategy candidate. This representation is helpful when trying to find common features that affect all candidates or when performing this analysis over many observations, making several individual plots difficult to discern. ##### Conclusion

In summary, we looked at the data and developed a general understanding of the data, then transformed the data and performed a correlation analysis with the scaled and centered data. We then fitted several machine learning models to the training data and compared the best five models on the test data. For the best model we analyzed the confusion matrix and based on this we developed a practical example. Finally, we took a closer look at the problem of the difficulty to explain machine learning models, and learned about the possibilities of global and local interpretation.

Finally, we can state that the machine learning model we analyzed seems to have a certain predictive power, a lift of 1.76x, and it would be possible to save time and resources with this model.

It was an interesting project and I was able to take a lot with me in applying what I had learned in the Business Science DS4B 201-R course. Besides, what I did not show here is the possibility of automated reporting with RMarkdown. An example of a pdf and html report can be found here.

 Visualizing ML Models with LIME: https://uc-r.github.io/lime