Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lint and mypy CI jobs #3

Merged
merged 4 commits into from
Feb 9, 2024
Merged

Conversation

afrittoli
Copy link
Collaborator

Add CI jobs for:

  • linting, using black
  • type checking, using mypy

For CI jobs to pass, implement corresponding fixes:

  • Automatic code reformat by black
  • Added ibm-fms to requirements
  • Ran no_implicit_optional (PEP 484 prohibits implicit Optional)
  • Exclude files that depend on untyped packages
  • Fix typing issues in dataset_util

Add CI jobs for:
- linting, using black
- type checking, using mypy

These jobs are similar to those used for the fms main repo.

Signed-off-by: Andrea Frittoli <[email protected]>
if reader is not None:
del reader
if shard_file is not None:
shard_file.close()
Copy link
Contributor

@nairbv nairbv Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like more than just black reformatting changes? these are all required for mypy?

Copy link
Collaborator Author

@afrittoli afrittoli Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this, because open_file was used with a path which according to the stubs does not work and also open_file may return None.

I added the shard_file file descriptor and added the logic to close files.

Copy link
Contributor

@nairbv nairbv Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without testing, I think we need to be cautious about making any changes that aren't just mypy type hints in this PR. reading closer... this change is opening a file and creating a new file object, and introduces new logic about where it gets closed.

I just tried locally doing f=open(some_arrow_file_path); reader = pa.ipc.open_file(f) and got an error of TypeError: binary file expected, got text file. I don't see any other issues on quick glance, but it's easy to miss things like double-closing a file that would cause failures.

To make this work, I think it would have needed to be shard_file = open(path, mode='rb'). Without actually testing it, to be on the safe side, I think it'll be better to just use a targeted type ignore hint with a comment, or fix it some simpler way.

.github/workflows/lint.yml Outdated Show resolved Hide resolved
fms_to_hf.py Show resolved Hide resolved
pretraining/config/training.py Outdated Show resolved Hide resolved
pretraining/utils/dataset_utils.py Show resolved Hide resolved
pretraining/utils/dataset_utils.py Show resolved Hide resolved
pretraining/utils/dataset_utils.py Show resolved Hide resolved
pretraining/utils/dataset_utils.py Show resolved Hide resolved
pretraining/utils/dataset_utils.py Show resolved Hide resolved
pretraining/utils/dataset_utils.py Show resolved Hide resolved
pretraining/utils/dataset_utils.py Show resolved Hide resolved
pretraining/utils/dataset_utils.py Outdated Show resolved Hide resolved
@afrittoli afrittoli force-pushed the ci branch 2 times, most recently from d81a3ce to 72d61ab Compare February 7, 2024 14:56
Signed-off-by: Andrea Frittoli <[email protected]>
.github/workflows/mypy.yml Outdated Show resolved Hide resolved
@afrittoli afrittoli force-pushed the ci branch 3 times, most recently from a7d22ee to d3ae276 Compare February 7, 2024 20:18
README.md Outdated Show resolved Hide resolved
@afrittoli afrittoli force-pushed the ci branch 3 times, most recently from 79849df to a0f55be Compare February 8, 2024 22:01
@@ -115,7 +110,7 @@ def get_policies(cfg, rank):
verify_bfloat_support = (
torch.version.cuda
and torch.cuda.is_bf16_supported()
and packaging.version.parse(torch.version.cuda).release >= (11, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this was using packaging not version, it makes me think that the import packaging.version was failing and we're actually only importing pkg_resources (as packaging). I think the changes to the imports are changing what happens here such taht I'm not sure this will work.

Copy link
Collaborator Author

@afrittoli afrittoli Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code could have been either of them, and the fallback import suggests they would be identical in signature.

However the second one according to mypy is broken (there is no packaging module in pkg_resources).

I'll test it now in case return values changed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have cuda installed on my laptop so I cannot verify completely.
From what I tested both behave the same, but to be on the safe side I added an ignore comment there too.

Based on the output of mypy, fixed the following:
- Added ibm-fms to requirements
- Ran no_implicit_optional (PEP 484 prohibits implicit Optional)
- Exclude files that depend on untyped packages
- Fix typing issues in dataset_util

Signed-off-by: Andrea Frittoli <[email protected]>
Add isort config based on the fms setup.
Fix imports where required. Include the root folder.

Signed-off-by: Andrea Frittoli <[email protected]>
@afrittoli afrittoli merged commit 27e655d into foundation-model-stack:main Feb 9, 2024
2 checks passed
@lchu6
Copy link
Contributor

lchu6 commented Feb 16, 2024

@afrittoli Is this change what we expected?

52235e4#diff-060f427d7e3e19de03067dd05b75e3758ba40f69745b2afa58a6ffe63024f81eL844

Doesn't look right to me.

There are two issues I am seeing:

  1. we use same class name type check inside the same class, shouldn't this change trigger a syntax failure? I pulled this commit and my IDE didn't like it.
  2. even with above fixed, I assume we want to use the list of subclass to replace the parent class, but inside this module there seems to be much more subclasses. e.g. _Stateful_Dataset -> _Wrapper_Dataset -> Preprocess_Dataset/Skip_Batch_Dataset/Preload_Buffer_Dataset/PLM_Dataset

@daviswer can you take a look at it?
cc @nairbv

@nairbv
Copy link
Contributor

nairbv commented Feb 16, 2024

@lchu-ibm the way this dataset parameter is used is it's called as a constructor (classname) where the class constructor has to accept another datasets kwarg. Specifically on line 923 it's called like dataset(**passthrough_args, datasets=[d]), so we can't make it completely generic. IIRC there were only a couple of classes that support the expected contract (though the code comments for that contract could probably be clearer).

Also some of the classes listed as options in the documentation were in the process of being removed in another PR, and weren't used anywhere else in this repo, so made sense to exclude from the type hint just to avoid conflicts.

If you need more types as options though (and they support the call mechanism) feel free to add them.

@lchu6
Copy link
Contributor

lchu6 commented Feb 16, 2024

@nairbv I didn't make my question clear, my questions are:

  1. I know what a constructor is, but this PR introduces "class name type check inside the SAME class". keyword is SAME here so I assume that's not expected and maybe a typo/mistake.
    I don't think you can do things like:
class A:
    def __init__(self, x: A)

for type hint.

If this is truly what we want, then there are ways to do it, but not like this.

  1. after the removal in the other PR you mentioned, I still see other subclasses missing. maybe those are not being used but current is not a full list. Anyway, for this one I will let @daviswer confirm as I didn't read carefully on the hierarchy.

@nairbv
Copy link
Contributor

nairbv commented Feb 16, 2024

Ah I guess I misunderstood.

I think you can do class A:; def __init__(self, x:A), otherwise you wouldn't be able to do things like class TreeNode(self, value: int, left: Optional[TreeNode], right[TreeNode]). I'm not looking closely at the details of the particular class hierarchy here though, we were just making sure mypy tests pass.

If there are subclasses that need to be passed here, I think it's probably fine to add them, we weren't being very particular in deciding what to put there. If you're getting test failures locally though in your IDE on code that passes CI, there might just be some mismatch in mypy version or config.

@lchu6
Copy link
Contributor

lchu6 commented Feb 16, 2024

class TreeNode(self, value: int, left: Optional[TreeNode], right[TreeNode])

Can this be done? I didn't check it but aren't we using a variable before declaring it? not from a class perspective but just python...

I think there are ways to use Self (not self), or "TreeNode" (as string), but definitely not TreeNode as how the compiler would know this var.

@afrittoli
Copy link
Collaborator Author

@nairbv I didn't make my question clear, my questions are:

  1. I know what a constructor is, but this PR introduces "class name type check inside the SAME class". keyword is SAME here so I assume that's not expected and maybe a typo/mistake.
    I don't think you can do things like:
class A:
    def __init__(self, x: A)

for type hint.

That was in the original code, my change restricted the list of types to a subset because the generic _Stateful_Dataset does not meet the contract of how it's used in the code, and mypy detects that.

If this is truly what we want, then there are ways to do it, but not like this.

  1. after the removal in the other PR you mentioned, I still see other subclasses missing. maybe those are not being used but current is not a full list. Anyway, for this one I will let @daviswer confirm as I didn't read carefully on the hierarchy.

We can add more classes to the list or alternatively add a common parent that meets the contract and use its type instead.

@lchu6
Copy link
Contributor

lchu6 commented Feb 16, 2024

@afrittoli

I think it is definitely my original question that confused both you and Brian. so let me explain -

the original code uses

class A:
    def __init__(self, x: Z)

where Z is the "to be unpacked parent class", and let's say Z has subset [A, B]

this PR changed it to

class A:
    def __init__(self, x: [A,B])

which isn't correct based on my understanding of python as you put an undeclared variable A there. you need something like, e.g. (see my quote around A)

class A:
    def __init__(self, x: ["A",B])

@afrittoli
Copy link
Collaborator Author

afrittoli commented Feb 18, 2024

which isn't correct based on my understanding of python as you put an undeclared variable A there. you need something like, e.g. (see my quote around A)

class A:
    def __init__(self, x: ["A",B])

Got it, thank you for the clarification. Even though mypy does not detect the problem, the recursive definition does indeed require quotes. Fix: #16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants