-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA add secure persistence #128
Conversation
What do you think about storing having a metainfo object in the schema? So basically something like: {
"metainfo": {...},
"obj": <actual obj>
} We could put stuff like protocol version, sklearn/skops version, etc. into metainfo. Also, we could add a hash/fingerprint of the object there to verify it. |
I added the common tests here, and here's the summary:
|
It kinda makes sense, but there isn't a clear distinction between the two. For estimators, it makes total sense, but for a numpy array, is the file name metainfo or the object itself? Or for a numpy function, is there anything in the object then? |
Of course we fail on 3.7 🤦🏼 |
- Add more test cases: nested pipeline, FunctionTransformer - Make list of estimators that fail more fine grained; instead of ignoring a class completely, ignore a specific instance of the class because some instances of the same class may fail or not fail - Mark estimators that fail to xfail with strict=True, this way we can quickly discover if a change made an estimator pass - Fix a bug in testing function that did not correctly compare values if they were nan (because nan!=nan) - Check predict_log_proba method
…est-additions-ben
@adrinjalali Hey, I accidentally pushed directly on your remote instead of using my own branch, sorry for that. Please let me know if my recent changes make sense to you or if I should fix anything. Here is the description:
I also wanted to add |
skops/tests/test_persist.py
Outdated
] | ||
|
||
|
||
ESTIMATORS_EXPECTED_TO_FAIL_NON_FITTED = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to have one list, where we have the estimator, and when we remove the estimator all tests for that estimator should pass. So I'll revert to the previous list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't quite understand. What you describe should work with the current approach, no?
One problem of the previous approach is that it does not differentiate between different instantiations of the same estimator. E.g. FunctionTransformer
with numpy functions works but FunctionTransformer
with scipy functions doesn't. If only the class is checked, we can't differentiate.
Or is the problem that we have two lists, one for non-fitted and one for fitted?
This is useful e.g. for clustering models that don't have a predict method. While adding these tests. A few estimators started to fail. Most of them failed because of the discussed issue of numpy scalars being loaded as 0-dim arrays. The issue was fixed via explicit type casting. However, 3 estimators are now failing and had to be added back to the list of failing estimators. I added a comment for each one of them why they fail and how to potentially address the issue.
Works locally but not on CI...
Co-authored-by: Adrin Jalali <[email protected]>
On some systems, during conversion, there is a loss of precision that makes the tests fail. This change loosesn the tolerance, making the tests pass.
Windows complains that files don't have permissions, even though each test gets their own file. Maybe this helps.
From my understanding, pytest will (eventually) clean up the temporary files it creates. Therefore, use the pytest tmp_path fixture instead of the builtin tempfile module and don't explicitly clean up. What's strange is that for other tests, this wasn't necessary, not sure why it is here. It may have to do with how we save and load here or with the use of zip files. Ideally, someone with access to Windows could test it.
Not pretty but let's see if it works.
Great job getting this merged! 🤗 |
Very exciting 🚀 looking forward to try it out |
This PR adds secure persistence for sklearn models. You can test it with:
The file it creates is a zip file which you can investigate. It creates a
schema.json
inside that zip file which includes all the info needed to reconstruct the object.Things to add:
Things to do in a separate PR:
We basically go through the attributes of the object, and we persist them, very similar to what https://github.com/pytorch/torchsnapshot does; except that the objects we deal with, unlike
pytorch
objects, don't exposestate_dict
andload_state_dict
. Therefore we implement the equivalent of those methods here ourselves. For third party libraries, they would need to implement the equivalent methods and we'll have a way for them to register those methods for their objects with us.This is a very early prototype, and very open to discussions regarding the format and the design.
cc @skops-dev/maintainers @osanseviero @LysandreJik @julien-c