Add lint and mypy CI jobs #3

afrittoli · 2024-02-07T14:27:19Z

Add CI jobs for:

linting, using black
type checking, using mypy

For CI jobs to pass, implement corresponding fixes:

Automatic code reformat by black
Added ibm-fms to requirements
Ran no_implicit_optional (PEP 484 prohibits implicit Optional)
Exclude files that depend on untyped packages
Fix typing issues in dataset_util

Add CI jobs for: - linting, using black - type checking, using mypy These jobs are similar to those used for the fms main repo. Signed-off-by: Andrea Frittoli <[email protected]>

main_training.py

nairbv · 2024-02-07T14:40:11Z

pretraining/utils/dataset_utils.py

+            if reader is not None:
+                del reader
+            if shard_file is not None:
+                shard_file.close()


this looks like more than just black reformatting changes? these are all required for mypy?

I added this, because open_file was used with a path which according to the stubs does not work and also open_file may return None.

I added the shard_file file descriptor and added the logic to close files.

without testing, I think we need to be cautious about making any changes that aren't just mypy type hints in this PR. reading closer... this change is opening a file and creating a new file object, and introduces new logic about where it gets closed.

I just tried locally doing f=open(some_arrow_file_path); reader = pa.ipc.open_file(f) and got an error of TypeError: binary file expected, got text file. I don't see any other issues on quick glance, but it's easy to miss things like double-closing a file that would cause failures.

To make this work, I think it would have needed to be shard_file = open(path, mode='rb'). Without actually testing it, to be on the safe side, I think it'll be better to just use a targeted type ignore hint with a comment, or fix it some simpler way.

.github/workflows/lint.yml

fms_to_hf.py

pretraining/config/training.py

pretraining/utils/dataset_utils.py

Signed-off-by: Andrea Frittoli <[email protected]>

.github/workflows/mypy.yml

pretraining/utils/dataset_utils.py

.github/workflows/mypy.yml

README.md

pretraining/utils/dataset_utils.py

nairbv · 2024-02-08T22:07:29Z

pretraining/utils/train_utils.py

@@ -115,7 +110,7 @@ def get_policies(cfg, rank):
    verify_bfloat_support = (
        torch.version.cuda
        and torch.cuda.is_bf16_supported()
-        and packaging.version.parse(torch.version.cuda).release >= (11, 0)


Since this was using packaging not version, it makes me think that the import packaging.version was failing and we're actually only importing pkg_resources (as packaging). I think the changes to the imports are changing what happens here such taht I'm not sure this will work.

The original code could have been either of them, and the fallback import suggests they would be identical in signature.

However the second one according to mypy is broken (there is no packaging module in pkg_resources).

I'll test it now in case return values changed.

I don't have cuda installed on my laptop so I cannot verify completely.
From what I tested both behave the same, but to be on the safe side I added an ignore comment there too.

pretraining/utils/dataset_utils.py

Based on the output of mypy, fixed the following: - Added ibm-fms to requirements - Ran no_implicit_optional (PEP 484 prohibits implicit Optional) - Exclude files that depend on untyped packages - Fix typing issues in dataset_util Signed-off-by: Andrea Frittoli <[email protected]>

pretraining/utils/train_utils.py

Add isort config based on the fms setup. Fix imports where required. Include the root folder. Signed-off-by: Andrea Frittoli <[email protected]>

requirements.txt

lchu6 · 2024-02-16T22:09:33Z

@afrittoli Is this change what we expected?

52235e4#diff-060f427d7e3e19de03067dd05b75e3758ba40f69745b2afa58a6ffe63024f81eL844

Doesn't look right to me.

There are two issues I am seeing:

we use same class name type check inside the same class, shouldn't this change trigger a syntax failure? I pulled this commit and my IDE didn't like it.
even with above fixed, I assume we want to use the list of subclass to replace the parent class, but inside this module there seems to be much more subclasses. e.g. _Stateful_Dataset -> _Wrapper_Dataset -> Preprocess_Dataset/Skip_Batch_Dataset/Preload_Buffer_Dataset/PLM_Dataset

@daviswer can you take a look at it?
cc @nairbv

nairbv · 2024-02-16T22:37:36Z

@lchu-ibm the way this dataset parameter is used is it's called as a constructor (classname) where the class constructor has to accept another datasets kwarg. Specifically on line 923 it's called like dataset(**passthrough_args, datasets=[d]), so we can't make it completely generic. IIRC there were only a couple of classes that support the expected contract (though the code comments for that contract could probably be clearer).

Also some of the classes listed as options in the documentation were in the process of being removed in another PR, and weren't used anywhere else in this repo, so made sense to exclude from the type hint just to avoid conflicts.

If you need more types as options though (and they support the call mechanism) feel free to add them.

lchu6 · 2024-02-16T22:48:59Z

@nairbv I didn't make my question clear, my questions are:

I know what a constructor is, but this PR introduces "class name type check inside the SAME class". keyword is SAME here so I assume that's not expected and maybe a typo/mistake.
I don't think you can do things like:

class A:
    def __init__(self, x: A)

for type hint.

If this is truly what we want, then there are ways to do it, but not like this.

after the removal in the other PR you mentioned, I still see other subclasses missing. maybe those are not being used but current is not a full list. Anyway, for this one I will let @daviswer confirm as I didn't read carefully on the hierarchy.

nairbv · 2024-02-16T23:01:43Z

Ah I guess I misunderstood.

I think you can do class A:; def __init__(self, x:A), otherwise you wouldn't be able to do things like class TreeNode(self, value: int, left: Optional[TreeNode], right[TreeNode]). I'm not looking closely at the details of the particular class hierarchy here though, we were just making sure mypy tests pass.

If there are subclasses that need to be passed here, I think it's probably fine to add them, we weren't being very particular in deciding what to put there. If you're getting test failures locally though in your IDE on code that passes CI, there might just be some mismatch in mypy version or config.

lchu6 · 2024-02-16T23:08:43Z

class TreeNode(self, value: int, left: Optional[TreeNode], right[TreeNode])

Can this be done? I didn't check it but aren't we using a variable before declaring it? not from a class perspective but just python...

I think there are ways to use Self (not self), or "TreeNode" (as string), but definitely not TreeNode as how the compiler would know this var.

afrittoli · 2024-02-16T23:42:30Z

@nairbv I didn't make my question clear, my questions are:

I know what a constructor is, but this PR introduces "class name type check inside the SAME class". keyword is SAME here so I assume that's not expected and maybe a typo/mistake.
I don't think you can do things like:
class A:
    def __init__(self, x: A)
for type hint.

That was in the original code, my change restricted the list of types to a subset because the generic _Stateful_Dataset does not meet the contract of how it's used in the code, and mypy detects that.

If this is truly what we want, then there are ways to do it, but not like this.

after the removal in the other PR you mentioned, I still see other subclasses missing. maybe those are not being used but current is not a full list. Anyway, for this one I will let @daviswer confirm as I didn't read carefully on the hierarchy.

We can add more classes to the list or alternatively add a common parent that meets the contract and use its type instead.

lchu6 · 2024-02-16T23:50:26Z

@afrittoli

I think it is definitely my original question that confused both you and Brian. so let me explain -

the original code uses

class A:
    def __init__(self, x: Z)

where Z is the "to be unpacked parent class", and let's say Z has subset [A, B]

this PR changed it to

class A:
    def __init__(self, x: [A,B])

which isn't correct based on my understanding of python as you put an undeclared variable A there. you need something like, e.g. (see my quote around A)

class A:
    def __init__(self, x: ["A",B])

afrittoli · 2024-02-18T11:43:38Z

which isn't correct based on my understanding of python as you put an undeclared variable A there. you need something like, e.g. (see my quote around A)
class A:
    def __init__(self, x: ["A",B])

Got it, thank you for the clarification. Even though mypy does not detect the problem, the recursive definition does indeed require quotes. Fix: #16

Add lint and mypy CI jobs

4bde276

Add CI jobs for: - linting, using black - type checking, using mypy These jobs are similar to those used for the fms main repo. Signed-off-by: Andrea Frittoli <[email protected]>

afrittoli commented Feb 7, 2024

View reviewed changes

main_training.py Show resolved Hide resolved