Isolation forest with simple illustration
In this article, you can read about isolation forest algorithm, its drawbacks and see visualisation of its results.
Isolation Forest, like any tree ensemble method, is built on the basis of decision trees. In these trees, partitions are created by first randomly selecting a feature and then selecting a random split value between the minimum and maximum value of the selected feature. In principle, outliers are less frequent than regular observations and are different from them in terms of values (they lie further away from the regular observations in the feature space). That is why by using such random partitioning they should be identified closer to the root of the tree (shorter average path length, i.e., the number of edges an observation must pass in the tree going from the root to the terminal node), with fewer splits necessary.
Formal description
From a mathematical point of view, recursive partitioning can be represented by a tree structure, while the number of partitions required to isolate a point can be interpreted as the length of the path, within the tree, to reach a terminating node starting from the root. For example, the path length of point that is anomalous will be much shorter and will need fewer splits to be found than regular point.
X(t) = {x(1),…,x(n)} is the set of points of some feature X(t). Let’s imagine we have some subtest of points, X1. Then we make a splits in that way that on the other side we have some internal node (T) with ‘test’ point and exactly two daughter nodes (T(l) and T(r)). Test node at T consists of an attribute X(q), and split value — p such that the test X(q)<p determines the traversal of a data point to either T(l) (left part of the node) or to T(r) (right part).
In order to build an isolation Tree, the algorithm recursively divides X(t) by randomly selecting an attribute (X(q)) and a split value — p, until either
- the node has only one instance, or
- all data at the node have the same values.
When the iTree is fully grown, each point in X(t) is isolated at one of the external nodes. Intuitively, the anomalous points are those (easier to isolate, hence) with the smaller path length in the tree, where the path length of point X(x) is defined as the number of splits made from the root node to get to an external node.
Main drawback
The main negative side of the Isolation Forest algorithm is it must receive outlier fraction of the anomalies, a priori inferred the proportion of anomalies in some data set, which implies that one has to have prior knowledge of phenomena and data and to have some idea what outliers or anomalies are there in the data set.
Example on Boston data set
We will make an new fake anomalous data on a Boston house price data set and then we will apply created function that runs the Isolation Forest algorithm and plots the anomalous point:
from sklearn.ensemble import IsolationForest
def iso_forest_detection(df, outliers_fraction, target_feature:str, plotting = True):
x=df[target_feature].to_numpy().reshape(len(df[target_feature]),-1)
model = IsolationForest(contamination=outliers_fraction)
model.fit(x)
predictions = model.predict(x) # anomalies are -1
df[target_feature+'_anomalies'] = pd.Series(predictions, index = df.index)
decision_function = model.decision_function(x)
df['decision_function'] = pd.Series(decision_function, index = df.index)
if plotting:
a = df.loc[df[target_feature+'_anomalies'] == -1] #anomaly
plt.figure(figsize=(6,6))
plt.plot(df[target_feature], color='blue', label= f'original {target_feature}')
plt.plot(a[target_feature], linestyle='none', marker='X', color='red', markersize=8, label='Anomaly')
plt.plot(df['decision_function'], color='green', label='decision function', alpha=0.3) # The anomaly score of the input samples.
# The lower, the more abnormal. Negative scores represent outliers,
# positive scores represent inliers.
plt.xlabel('Rows')
plt.ylabel(f'{target_feature}value')
plt.title('Anomalies')
plt.legend(loc='best')
plt.show()
return df
Then we will add some fake anomalous data point:
df['TAX'][0]=3000
After that, we will see whether the algorithm will detect it with outlier fraction of 1%:
outliers_fraction=0.01
df = iso_forest_detection(df, outliers_fraction, 'TAX', plotting=True)
In the next plot we can see that also algorithm captured one point that is maybe not an anomaly:
Conclusion
We saw that outlier fraction that is carefully tuned can be of crucial importance for good application of Isolation forest algorithm. Also, latter filtering of preselected anomalous data could be helpful but good business or scientific understanding of the problem is crucial. Full code is available here.