What are the methods to interpret the output of machine learning methods?


The methods of interpretability of machine learning methods can be classified into the following groups:

Post-hoc Interpretability

Post-hoc interpretability refers to the interpretability of models after they have been successfully trained. Here we include a list approaches by their overarching logic.

  • Feature Analysis

    Accumulated Local Effects (ALE) aims to explain the impact of features on the outcome in average.

    Feature Interaction aims to explain how the interaction of features impact the model output.

    Permutation Feature Importance (PFI) evaluates the impact of feature on the output via random permutation of particular feature input.

  • Model Inspection

    Scoped Rules (Anchors) aims to arrive at anchoring rules unaffected by changes in other features.

  • Saliency Analysis

    Partial Dependence Plot (PDP) visualizes the average dependence between the target and a set of features.

    Individual Condition Expectation (ICE) visualizes the sample based dependence between the target and a set of features.

    Shapley Additive Explanations (SHAP) aims to distribute the total outcome as individual feature contributions.

  • Model Simplification and Surrogate Models

    Global Surrogate is the approximation of the model globally via a simpler interpretable model.

    Local Surrogate(LIME) is the approximation of the model locally via a simpler interpretable model to explain an individual prediction.

  • Mathematical Modeling

    These approaches building on some assumptions, formulate a mathematical framework to explain model outcomes.

  • Explanation by Example

    Illustrate the workings of the model by studying representative samples and their corresponding output.

  • Explanation Generation

    These methods aim to generate symbols or words explaining the inner working of the models.

Intrinsic Interpratability

Intrinsic interpretability refers to the interpretability that is built into the model. Two major approaches have been employed in intrinsic interpretability.

  • Interpretable Representation

    A set of favorable properties such as monotonicity or sparsity are imposed on the model through regularization to arrive at more interpretable representations.

  • Interpretable Architecture

    The architecture of the model is designed to increase its interpretability.

References