Beyond Labels: Implementing Unsupervised Anomaly Detection with Isolation Forest and LightGBM
Explore practical implementation of anomaly detection scheme with IsolationForest and automated feedback loops
By Kuriko IWAI

Table of Contents
IntroductionWhat is Anomaly Detection?Common Methods for Anomaly DetectionComparing with Unsupervised ClusteringWhat is IsolationForestSimulationWrapping UpIntroduction
Unsupervised anomaly detection is an important technique to detect unusual events without labelled data.
In this article, I’ll explore an overview of anomaly detection techniques, with a practical simulation of the fraud detection cycle using IsolationForest for anomaly detection and LightGBM as a primary classification model.
What is Anomaly Detection?
Anomaly detection is a technique used to identify anomalies, any data points that are significantly different from what is considered normal, or so called “outliers” when compared to the majority of the data in a dataset.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure A. Anomaly found in the data points (source)
Anomaly detection is crucial for identifying threats and risks through irregular data patterns. Major examples include:
Fraud detection by flagging unusual spending or a sudden jump in transactions.
Manufacturing quality control by noticing a sudden rise or fall in quality metrics.
Healthcare monitoring by highlighting a sudden increase in a health measurement.
Essentially, it flags these unusual events for human review to help reduce risks.
Common Methods for Anomaly Detection
When we have structured data on anomalies, both supervised and clean anomaly detection are valuable approaches, in addition to unsupervised methods.
Here are key characteristics of each approach:
◼ Supervised Anomaly Detection
Involves training a model on a dataset where data points are explicitly labeled as either normal or anomalous.
Effective when reliable labeled data is available and anomalies are well-defined
Examples of ML algorithms for structured data: Bayesian networks, k-nearest neighbors (KNN), and Decision trees.
◼ Clean Anomaly Detection
Focuses on identifying significant deviations from normal patterns in data that is largely free from noise or errors.
Suitable for applications with well-structured and predictable data, such as fraud detection or manufacturing quality control.
◼ Unsupervised Anomaly Detection
Identifies anomalies by finding data points that significantly deviate from the majority of the data.
Useful when anomalies are rare or not well understood, and no labeled anomalies in the training data.
Examples of ML algorithms includes K-means and One-class support vector machine.
▫ Common Approaches
When we don’t have labeled anomaly data, unsupervised anomaly detection methods are effective.
Major methods include statistical, clustering, and proximity-based. And the choice depends on the specific data and anomaly types.
I’ll list the overview of the methods with guidance on when to use them.
▫ 1. Statistical Method
Uses statistical properties of the data to find unusual observations. Major methods include:
Z-Score/Standard Score: Quantifies how many standard deviations a data point is from the mean, and flags data points significantly far from the mean as anomalies.
Percentiles: Identifies anomalies by setting thresholds based on percentiles or quantiles and flags data points falling outside these thresholds as anomalies.
When to Use:
- Best for data with well-understood distributions, where anomalies are clear deviations from statistical norms
▫ 2. Clustering Method
Groups similar data points together, with anomalies often being points that don’t fit into any defined cluster.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Clusters data points based on their density. Points in low-density regions without belonging to any clusters are considered anomalies.
K-Means Clustering: Data points that are far from the center of any cluster or do not belong to any well-defined cluster may be identified as anomalies.
When to Use:
- Best when normal data forms distinct clusters.
▫ 3. Proximity-Based Method
Measures the distance or similarity between data points to identify those that are unusually far from their neighbors or the data’s central tendency.
Mahalanobis Distance: This measure calculates the distance of data points from the center of the data distribution, taking into account the correlations between different features.
Local Outlier Factor (LOF): LOF computes the local density deviation of a data point in comparison to its neighbors. This helps in identifying outliers in regions with varying densities.
When to Use:
Best when anomalies are defined by their isolation or distance from other data points.
Best when dealing with complex, multi-dimensional datasets where density variations are important.
▫ 4. Time-Series Analysis
Specifically designed for sequential data, these methods identify anomalies based on temporal patterns.
Moving Averages: Detects anomalies when data points significantly deviate from the calculated moving average or exponential moving average over a specific period.
Seasonal Decomposition: A time series is broken down into its trend, seasonal, and residual components. Anomalies are often found within the residual component that cannot be explained by the trend or seasonality.
When to Use:
Best when dealing with sequential data where the order of observations matters.
Best when anomalies are deviations from expected trends, seasonality, or temporal patterns (e.g., sudden spikes or drops in data over time).
▫ 5. Machine Learning Algorithm
Uses machine learning algorithms to learn patterns from data.
Isolation Forest: An ensemble method that efficiently isolates anomalies by building a tree-like structure.
One-Class SVM: A Support Vector Machine (SVM) model trained to classify data points as either normal or outliers, effectively defining a boundary around normal data.
K-Nearest Neighbors (KNN): Assigns an anomaly score based on the distance to its ‘K’ closest neighbors.
Autoencoders: The neural networks learn a compressed representation of the data. Anomalies are detected by their high reconstruction error.
When to Use:
- Best when complex, non-linear patterns define normal behavior, and anomalies are very subtle.
Comparing with Unsupervised Clustering
Unsupervised clustering is another useful method to detect irregular patterns in the dataset. I’ll explore its key differences from anomaly detection.
Taking a fraud detection case as an example, Clustering helps identify groups of potentially fraudulent transactions or different behavioral segments some of which potentially have higher risk than others. But a small cluster isn’t necessarily fraudulent; it’s just a distinct group.
On the other hand, anomaly detection directly aims to flag individual transactions that are unusual enough to warrant investigation as potential fraud, regardless of whether they form a “group.”
These methods fundamentally have different primary goals and hence, generates different outputs.
◼ Primary Goals
Clustering: To discover natural groupings or segments within the unlabeled data.
Anomaly Detection: To identify individual data points that significantly deviate from what is considered “normal”.
◼ Outputs
Clustering: Assigns a cluster ID to each data point.
Anomaly Detection: Provides an anomaly score or a binary flag (-1 for outlier, 1 for inlier).
◼ Typical Use Cases
▫ Unsupervised Clustering
A group of fraudsters is working together, perhaps using multiple synthetic identities or compromised accounts. Their individual transactions might not seem like strong anomalies, but their collective patterns are suspicious.
Multiple customers who share the same phone number applied for the loan.
Security managers want to identify inherently riskier types of merchants even though they aren’t actively engaged in fraud right now.
▫ Unsupervised Anomaly Detections
A new type of credit card scam emerges.
An employee seems to exploit the internal system and access customer accounts in an unauthorized manner.
A legitimate customer’s account is compromised, and fraudsters begin making transactions that are completely out of character for that customer.
An user logs in from New York, then 5 minutes later logs in from Tokyo.
An user creates many new accounts from the same IP range using disposable email addresses.
These cases typically involve situations where we look for very subtle, individual irregularities with no labeled data or prior examples and due to the lack of data, the primal models likely overlook these irregular patterns.
In the next section, I’ll take Isolation Forest for an example to demonstrate how the anomaly detection actually works.
What is IsolationForest
Isolation Forest is an algorithm for anomaly detection using binary trees.
It isolates anomalies by exploiting their distinct characteristics rather than modeling the normal data.
This intuitive and efficient approach makes it a popular choice for anomaly detection tasks.
Its key characteristics include:
Ensemble Method: Builds multiple isolation trees.
Anomaly Isolation: Identifies anomalies by how easily they are isolated (require fewer splits to be separated).
Efficiency: Relatively fast and scalable for high-dimensional data.
Unsupervised: Does not require labeled data for training.
Outlier Focus: Directly targets outliers instead of profiling normal points.
Effective for High Dimensions: Handles datasets with many features well.
◼ How Isolation Forest Works
The image illustrates how a data point (x) is evaluated across multiple isolation trees (Tree 1, Tree 2, …, Tree T) to determine if it’s an anomaly:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure B. Isolation Forest architecture and how the data point is processed (Created by Kuriko IWAI)
▫ Computing Anomaly Score
In the diagram, h_i(x) represents the path length of data point in the i-th tree, couting the number of edges that the data point crosses from the root node to the leaf node.
The shorter the path length is, the more anomalous the data point would be.
The algorithm computes the average path length E[h(x)] of the data point across all isolation trees:
where:
T: the total number of isolation trees in the forest and
h_i(x): the path length of data point x in the i-th isolation tree.
Then, the algorithm converts the average path length into a normalized anomaly score s(x) such that:
where:
E[h(x)]: the average path length of x,
N: The number of data points in the training subset used to build a single tree (the subsampling size),
c(N): A normalization factor that represents the average path length of an unsuccessful search in a Binary Search Tree of N points:
(m: sample size)
The anomaly score ranges from 0 to 1, and a higher score indicates a higher likelihood of being an anomaly.
Lastly, the model compares the score with the threshold (γ: gamma).
If the score is below the threshold, it flags the data point as an anomaly.
◼ Evaluation in Absence of Ground Truth
In reality, without labeled data or historical records, there's no ground truth for the model to confirm whether these anomalies genuinely need to be flagged.
So, the flags from the isolation forest are evaluated in several ways.
▫ Human-in-the-Loop / Domain Expert Validation
Human review is an important step in evaluating flags, including presenting flagged anomalies to fraud investigators.
Their feedback on how many are “true” fraud and how many are “false alarms” (false positives) is valuable feedback to the anomaly detection system:
“Does the model’s output lead to actual investigations that uncover fraud, or does it primarily generate noise?”
“Does the model consistently flag similar types of events as anomalous over time?”
▫ Semi-Supervised Evaluation (if some labels exist)
If we have a small, held-out set of labeled data (even if not used for training because it is too small), we can use it to calculate metrics like:
Precision@k: For the top k anomalies flagged by the model, what percentage are truly fraudulent?
Recall@k: Similar to Precision@k, but focusing on how many of the actual fraud cases were among the top k flagged.
ROC AUC / PR AUC on Anomaly Scores: Treat the anomaly score as a continuous data and plot ROC or Precision-Recall curves.
▫ Synthetic Anomaly Injection
We can inject a known number of synthetic anomalies into a clean dataset and see how many the model successfully identifies. This helps in benchmarking different unsupervised algorithms.
These evaluation methods help:
Generate new labeled data to retrain primary models,
Create new features or rules to tackle new threats, and
Adjust isolation forest hyperparameters (e.g., contamination) to refine detection accuracy.
In the next section, I’ll demonstrate the detection cycle using a credit card transaction dataset.
Simulation
In this section, I’ll first train the IsolationForest model for anomaly detection on the credit card transaction data, evaluating its performance with human-in-the-loop feedback.
Then, I’ll train LightGBM using the newly labeled data and assess performance on synthetically injected anomalies.
◼ Preprocessing Dataset
Just like other decision-tree based algorithms, IsolationForest does not need normalization on numerical features.
I simply loaded the original data from Financial Transactions Dataset: Analytics.
The Dataset:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure C. Original data (Created by Kuriko IWAI)
◼ Tuning
Without any clues on frauds, I first relaxed hyperparameters as much as possible to make the model flexible:
1from sklearn.ensemble import IsolationForest
2from sklearn.svm import OneClassSVM
3
4isolation_forest = IsolationForest(
5 n_estimators=500, # maximum number of trees in the system
6 contamination="auto", # intentionally set auto (later adjusted)
7 max_samples='auto',
8 max_features=1, # split considered only one feature at a time
9 bootstrap=True, # using bootstrap samples to secure robustness
10 random_state=42,
11 n_jobs=-1
12)
13y_pred_iso = isolation_forest.fit_predict(X_processed)
14inliers_iso = X[y_pred_iso == 1]
15outliers_iso = X[y_pred_iso == -1]
16print(f"Isolation Forest detected {len(outliers_iso)} outliers.")
17
◼ Tuning One-Class SVM
For performance comparison, I also tuned the one-class SVM, setting a relaxed nu value:
1one_class_svm = OneClassSVM(
2 kernel='rbf',
3 gamma='scale',
4 tol=1e-7,
5 nu=0.1,
6 shrinking=True,
7 max_iter=5000,
8)
9y_pred_ocsvm = one_class_svm.fit_predict(X_processed)
10inliers_ocsvm = X[y_pred_ocsvm == 1]
11outliers_ocsvm = X[y_pred_ocsvm == -1]
12print(f"One-Class SVM detected {len(outliers_ocsvm)} outliers.")
13
◼ Results
Isolation Forest detected 166 outliers.
One-Class SVM detected 101 outliers.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure D. Unsupervised anomaly detection by Isolation Forest (left) and One-Class SVM (right). Applied PCA to two dimensions for visualization. (Created by Kuriko IWAI)
IsolationForest is aggressive in finding points that are easily “isolated” and tends to be good at finding novel, truly anomalous points even if they are not extremely far from the main data cluster.
This can lead to a higher number of detected outliers if your definition of “outlier” is broad.
On the other hand, One-Class SVM defines a more structured boundary around the normal data and marks points outside this boundary as outliers.
It might be more conservative, requiring a more significant deviation from the “normal” manifold to be flagged.
◼ Evaluation - Human-in-the-loop Feedback
For a practical demonstration, I’ll categorize the flagged records into four classes by reviewing them one by one:
Class 1. True Positive (TP): The transaction flagged by the model was indeed fraudulent.
Class 2. False Positive (FP): The transaction flagged by the model was actually legitimate. This indicates a “false alarm.” Too many FPs can overwhelm analysts and lead to inefficiencies.
Class 3. New Fraud Pattern: Identify a new type of fraud that was caught, or a previously missed fraud type, which provides new insights into the evolving threat landscape.
Class 4. Legitimate but Unusual Behavior: The transaction seemed legitimate but genuinely unusual for that customer or segment, and it’s valuable to understand why the model flagged it.
▫ Important Notes:
I would check any records with False Negative (FN) flag where fraud transactions are not flagged as fraud. But this often comes from other sources, like customer complaints or chargebacks because these transactions are not flagged by the model but later found to be fraudulent. So, in this experiment, I excluded these FNs.
In reality, it is crucial that fraud experts check the records to accurately categorize them.
Here are some of the flagged records and human_review I added:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure E. List of flagged transaction records (synthetic) by IsolationForest and human review examples (Created by Kuriko IWAI)
Let's look at the first record: record #0. A refund of $77 isn't necessarily suspicious or fraudulent, but it does represent unusual behavior for that particular customer.
Record #67 (at the bottom) also shows numerous refunds. If we identify this as a new fraud scheme, we can flag it as a “3".
For records #23, #25, and others, the customers are over 90 years old. It's generally unusual to see such a high volume of transactions from individuals in that age group.
The list continues. And out of 166 potential anomalies, the distribution by class was:
Class 1 accounted for 94,
Class 2 for 30,
Class 3 for 10, and
Class 4 for 23.
◼ Adjusting Contamination
Now, the review results tells us:
Total Flagged Anatomy: 166 -
True Positives (TP): 94
False Positives (FP): 72 (= 166–94)
Precision of the flagged set: TP / (TP+FP) = 94 / 166 ≈ 56.63%
This mean that although the model predicted a 16.6% contamination rate, the actual, seeable contamination rate for this dataset was 9.4% (94 anomalies out of 1,000 total samples).
And crucially, we must also consider False Negatives (missed anomalies) that the model didn’t flag in this experiment.
So, the true contamination rate is likely ≥9.4%.
Given this fact, for the next iteration, I’ll start with a contamination rate of 0.1 (10%), for example:
1from sklearn.ensemble import IsolationForest
2
3refined_isolation_forest = IsolationForest(
4 n_estimators=500,
5 contamination=0.1, # updated to 0.1
6 max_samples='auto',
7 max_features=1,
8 bootstrap=True,
9 random_state=42,
10 n_jobs=-1
11)
12
◼ Adding New Labels to Training Samples
Another important step is to retrain the primal model on updated dataset.
I added three columns to the original DataFrame to flag human_review results as is_fraud variables:
human_review categorical, (0, 1, 2, 3, 4): Stores zero if not anomaly, else the human review result (class number from 1 to 4)
is_fraud_iforest binomial, (1, -1): Stores the initial anomaly detection results from the IsolationForest model.
is_fraud binomial (0, 1): Stores the final fraud decisions (1 = fraud).
The is_fraud column is used as a target variable to train the primary model.
1import pandas as pd
2from collections import Counter
3
4df_human_review = pd.read_csv(csv_file_path, index_col=1)
5df_merged = df_new.merge(
6 df_human_review[['human_review']],
7 left_index=True,
8 right_index=True,
9 how='left'
10)
11df_merged['human_review'] = df_merged['human_review'].fillna(0).astype(int)
12df_merged['is_fraud'] = (df_merged['human_review'] == 1).astype(int)
13

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure F. Updated data after adding the three columns (Created by Kuriko IWAI)
◼ Retrain the Primal Model (LightGBM)
I scaled the minority class with SMOTE to address class imbalance:
1from sklearn.model_selection import train_test_split
2from imblearn.over_sampling import SMOTE
3from collections import Counter
4
5X = df_merged.copy().drop(columns='is_fraud', axis='columns')
6y = df_merged.copy()['is_fraud']
7X_tv, X_test, y_tv, y_test = train_test_split(X, y, test_size=300, shuffle=True, stratify=y, random_state=12)
8X_train, X_val, y_train, y_val = train_test_split(X_tv, y_tv, test_size=300, shuffle=True, stratify=y_tv, random_state=12)
9print(Counter(y_train))
10smote = SMOTE(sampling_strategy={1: 75}, random_state=42)
11X_train, y_train = smote.fit_resample(X_train, y_train)
12print(Counter(y_train))
13
Counter({0: 370, 1: 30}) → Counter({0: 370, 1: 75})
Then, retrained the primal model (Light GBM) with the updated training samples:
1from sklearn.ensemble import HistGradientBoostingClassifier
2from sklearn.linear_model import LogisticRegression
3
4
5# primary model
6lgbm = HistGradientBoostingClassifier(
7 learning_rate=0.05,
8 max_iter=500,
9 max_leaf_nodes=20,
10 max_depth=5,
11 min_samples_leaf=32,
12 l2_regularization=1.0,
13 max_features=0.7,
14 max_bins=255,
15 early_stopping=True,
16 n_iter_no_change=5,
17 scoring="f1",
18 validation_fraction=0.2,
19 tol=1e-5,
20 random_state=42,
21 class_weight='balanced'
22)
23
24# baseline model
25lr = LogisticRegression(
26 penalty='l2',
27 dual=False,
28 tol=1e-5,
29 C=1.0,
30 class_weight='balanced',
31 random_state=42,
32 solver="lbfgs",
33 max_iter=500,
34 n_jobs=-1,
35)
36
◼ Performance Evaluation
I evaluated performance of the primal model against a Logistic Regression baseline using train, validation, and test datasets.
To mitigate False Negatives, I set the F1 score as the primary metric:
Logistic Regression (L2 norm): Train 0.9886–0.9790 → Generalization : 0.9719
Light GBM — Train 0.9133–0.8873 → Generalization 0.9040
Both models demonstrate good generalization capabilities, as their generalization scores are close to the training scores.
Logistic Regression with L2 regularization is the better-performing model in this comparison, achieving higher accuracy on both the training data and, more importantly, on the unseen generalization data. Its generalization performance (0.9719) is superior to LightGBM’s (0.9040).
◼ Synthetic Anomaly Injection
Lastly, I created 50 synthetic data points to test the robustness of the models.
The results tells us that LightGBM achieved a perfect F1-score of 1.0000 in detecting synthetic fraud, significantly outperforming Logistic Regression (F1-score: 0.8764, with 0.90 precision and 0.84 recall).
1from sklearn.metrics import f1_score
2
3num_synthetic_fraud = 50
4synthetic_fraud_X = generate_synthetic_fraud(num_synthetic_fraud, X_test, df_merged)
5y_pred_test_with_synthetic = pipeline.predict(X_test_with_synthetic)
6f1_test_with_synthetic = f1_score(y_test_with_synthetic, y_pred_test_with_synthetic, average='weighted')
7

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure G. F1 score results (Created by Kuriko IWAI)
While LightGBM perfectly identifies injected fraudulent cases, Logistic Regression with an L2 norm might offer a better overall balance for fraud classification at this stage.
Wrapping Up
In our experiment, I demonstrated how anomaly flags can step-by-step refine both the detection scheme and the primary classification models for fraud detection.
We also found Isolation Forest to be a powerful anomaly detection model, effective without requiring the explicit modeling of normal patterns.
For practical application, automating the human review system and developing a structured approach to creating rules for fraudulent transactions would be crucial enhancements to the methods presented.
Continue Your Learning
If you enjoyed this blog, these related entries will complete the picture:
Beyond K-Means: A Deep Dive into Gaussian Mixture Models and the EM Algorithm
Related Books for Further Understanding
These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps
Share What You Learned
Kuriko IWAI, "Beyond Labels: Implementing Unsupervised Anomaly Detection with Isolation Forest and LightGBM" in Kernel Labs
https://kuriko-iwai.com/unsupervised-anomaly-detection-for-unseen-risk-events
Looking for Solutions?
- Deploying ML Systems 👉 Book a briefing session
- Hiring an ML Engineer 👉 Drop an email
- Learn by Doing 👉 Enroll AI Engineering Masterclass
Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.
