Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example with financial data #195

Merged
merged 70 commits into from
Nov 14, 2023
Merged
Changes from 1 commit
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
09ee61d
Slim vector (#32)
gcattan Oct 18, 2023
16fbb5a
Update Dockerfile
gcattan Oct 18, 2023
0b5d02a
Update financial_data.py
gcattan Oct 18, 2023
249f6c8
Update Dockerfile
gcattan Oct 18, 2023
2319233
Update examples/other_datasets/financial_data.py
gcattan Oct 19, 2023
3139261
Update examples/other_datasets/financial_data.py
gcattan Oct 19, 2023
d485d8d
- rename file to run on Ci
gcattan Oct 19, 2023
d1faa49
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 19, 2023
8cf752b
- print gridsearch results
gcattan Oct 19, 2023
bfad074
change location of a comment
gcattan Oct 19, 2023
8b314cc
plot a sample of the epochs
gcattan Oct 19, 2023
f99eff8
let's try to plot waveforms
gcattan Oct 19, 2023
f72d31b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 19, 2023
e51e7c8
- transpose missing
gcattan Oct 19, 2023
5072a13
correct warning in doc building
qbarthelemy Oct 19, 2023
0d1fd19
standardscaler
gcattan Oct 19, 2023
011f5ec
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 19, 2023
7817975
small modification
gcattan Oct 19, 2023
8e59ed1
fix doc building
qbarthelemy Oct 20, 2023
f97e437
test standardscaler fix on CI
gcattan Oct 21, 2023
6042b80
add randomforest
gcattan Oct 21, 2023
54d6ab3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 21, 2023
3ca1535
fix features not known
gcattan Oct 21, 2023
31bcaf1
ndstandardscaler
gcattan Oct 21, 2023
a0dd1b8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 21, 2023
2a66e5b
fix signature
gcattan Oct 21, 2023
58bc705
- add more variables
gcattan Oct 21, 2023
dceca35
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 21, 2023
2dea963
- add gitignore
gcattan Oct 23, 2023
8578f5a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 23, 2023
d474629
update protobuf
gcattan Oct 23, 2023
71d61e9
Update examples/other_datasets/plot_financial_data.py
gcattan Oct 24, 2023
593592f
declare rf inside pipeline
gcattan Oct 24, 2023
258ac24
Update requirements.txt
gcattan Oct 25, 2023
459d264
Merge branch 'main' into main
gcattan Oct 25, 2023
6f2c87c
new implementation of ndstandardscaler
gcattan Oct 26, 2023
eea2520
minor improvement of comments
gcattan Oct 26, 2023
84eafa5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 26, 2023
e91f3e2
- use stratify
gcattan Oct 27, 2023
b49f922
unsupervised classification with collusion
gcattan Oct 27, 2023
f30f257
improve comment
gcattan Oct 27, 2023
de2afa1
move print of ERP
gcattan Oct 27, 2023
d08be8f
fix label encoding
gcattan Oct 28, 2023
770d6db
plot the two erps on the same figure
gcattan Oct 28, 2023
1b6dda0
Plot ERPs, change some variables.
gcattan Oct 28, 2023
448b720
move ERP plotting to another location
gcattan Oct 28, 2023
9400bea
improve display
gcattan Oct 30, 2023
9e83112
improve graphics
gcattan Oct 30, 2023
a7a26c6
typo
gcattan Nov 8, 2023
29160ce
- Try Tomeklinks
gcattan Nov 9, 2023
14243c9
Select SALDO_ANTES_PRESTAMO
gcattan Nov 10, 2023
f5394d8
use balanced accuracy
gcattan Nov 11, 2023
496f3db
change balance ratio
gcattan Nov 12, 2023
50a0c90
small clean-up
gcattan Nov 12, 2023
77af437
Merge branch 'main' into main
gcattan Nov 12, 2023
e449957
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 12, 2023
8b78f8a
lint
gcattan Nov 12, 2023
62dae07
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 12, 2023
65ea77d
fix seaborn version
gcattan Nov 12, 2023
ad94160
fix Dockerfile
gcattan Nov 12, 2023
4d866a2
Update examples/other_datasets/plot_financial_data.py
gcattan Nov 14, 2023
d4bd2fa
Update examples/other_datasets/plot_financial_data.py
gcattan Nov 14, 2023
e052e92
Update examples/other_datasets/plot_financial_data.py
gcattan Nov 14, 2023
73f8004
Update examples/other_datasets/plot_financial_data.py
gcattan Nov 14, 2023
517d6b1
Update examples/other_datasets/plot_financial_data.py
gcattan Nov 14, 2023
216dcbd
Update examples/other_datasets/plot_financial_data.py
gcattan Nov 14, 2023
ed9ec59
Update examples/other_datasets/plot_financial_data.py
gcattan Nov 14, 2023
585d026
Update examples/other_datasets/plot_financial_data.py
gcattan Nov 14, 2023
e1810bf
Update examples/other_datasets/plot_financial_data.py
gcattan Nov 14, 2023
eb048b4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
standardscaler
gcattan authored Oct 19, 2023
commit 0d1fd197612c4757d5d407a7aa485b68a9adb581
23 changes: 13 additions & 10 deletions examples/other_datasets/plot_financial_data.py
Original file line number Diff line number Diff line change
@@ -22,6 +22,7 @@
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from imblearn.under_sampling import NearMiss
from pyriemann.preprocessing import Whitening
@@ -81,6 +82,8 @@
# of the `ToEpochs` transformer (see below)
features["index"] = features.index

# Apply a StandardScaler to the feature
features_scaled = StandardScaler().fit_transform(features.to_numpy())

##############################################################################
# Pipeline for binary classification
@@ -185,23 +188,19 @@ def transform(self, X):
# Note: at this stage `features` also contains the `index` column.
# So `NearMiss` we choose the closest 200 non-fraud epochs to the 200 fraud-epochs
# based also on this `index` column. This should be improved for real use cases.
X, y = NearMiss().fit_resample(features.to_numpy(), target.to_numpy())
X, y = NearMiss().fit_resample(features_scaled, target.to_numpy())

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

labels, counts = np.unique(y_train, return_counts=True)
print(f"Training set shape: {X_train.shape}, genuine: {counts[0]}, frauds: {counts[1]}")

labels, counts = np.unique(y_test, return_counts=True)
print(f"Testing set shape: {X_test.shape}, genuine: {counts[0]}, frauds: {counts[1]}")

# before fitting the GridSearchCV, let's display a sample of the epochs:
# before fitting the GridSearchCV, let's display the "ERP" (see [3]_)
epochs = ToEpochs(n=10).transform(X_train)
print("Profile of an epoch:")
print(epochs[0])

# ...and the "ERP"
# (see https://pyriemann.readthedocs.io/en/latest/auto_examples/ERP/plot_ERP.html)
plot_waveforms(epochs, "hist")
plt.show()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plot is ok now, but it seems that there is no ERP...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I think I found the problem. We include loans from different clients in the epochs. Probably better to only consider the customer history. The bad news is that there are not a lot of loans by customers in the dataset.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to find something. If you have an idea, feel free :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a little bit of time on this. This is the best I can obtain to date, by creating a "fake" customer history by taking the "closest" loans:

Figure_1

image

https://github.com/gcattan/pyRiemann-qiskit/blob/financial_data_with_KNN/examples/other_datasets/plot_financial_data.py


In the current example (without KNN), if I remove the fraudulent loan itself from the epoch, the score is lower.

image

But interestingly, it is still able to grab something using only the past loans, even if they are owned by different customers:

image

This might indicate, for example, that some scams imply a collusion between different customers.

One possible, and practical implication, is that we can raise a warning before a fraudulent transaction actually occurs, while the random forest can only say afterward if a loan was fraudulent or genuine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the ERP, can you check that ScandarScaler is applied on the correct dimension?

Copy link
Collaborator Author

@gcattan gcattan Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

With standard scaler:

Figure_1

With robust scaler:

image

Ok, the ownership may just be bad.

Copy link
Member

@qbarthelemy qbarthelemy Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Visually, we don't see ERP.
  2. Results show that a method not ERP-aware (Random Forest) has an excellent classification score.

So, I am starting to doubt the presence of an ERP in the data...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I don't think they detect the same kind of fraud.
ERP method can potentially detect fraudulent behavior over time, while RF is not able to do so.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put this example on hold, time to think of a better way (or data) to show this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will be very interesting to have an example on another type of data than biosignals.


@@ -213,8 +212,9 @@ def transform(self, X):
# Let's fit our GridSearchCV, to find the best hyper parameters
gs.fit(X_train, y_train)

# Print cross-validation results
print(gs.cv_results_)
# Print best parameters
print("Best parameters are:")
print(gs.best_params_)

# This is the best score with the classical SVM.
# (with this train/test split at least)
@@ -234,5 +234,8 @@ def transform(self, X):
# ----------
# .. [1] 'SUSPICIOUS ACTIVITY DETECTION USING QUANTUM COMPUTER',
# Patent application number: 18/380799
# .. [2] 'Synthetic Data of Transactions for Inmediate Loans Fraud'
# .. [2] 'Synthetic Data of Transactions for Inmediate Loans Fraud'
# https://zenodo.org/records/7418458
# .. [3] https://pyriemann.readthedocs.io/en/latest/auto_examples/ERP/plot_ERP.html
#
#