Thank you for considering contributing to Kedro-Datasets! Kedro-Datasets is a collection of Kedro's data connectors. We welcome contributions in the form of pull requests, issues or code reviews. You can contribute new datasets, fix bugs in existing datasets, or simply send us spelling and grammar fixes or extra tests. Contribute anything that you think improves the community for us all!
The following sections describe our vision and the contribution process.
The Kedro team pledges to foster and maintain a welcoming and friendly community in all of our spaces. All members of our community are expected to follow our Code of Conduct, and we will do our best to enforce those principles and build a happy environment where everyone is treated with respect and dignity.
We use GitHub Issues to keep track of known bugs. We keep a close eye on them and try to make it clear when we have an internal fix in progress. Before reporting a new issue, please do your best to ensure your problem hasn't already been reported. If so, it's often better to just leave a comment on an existing issue, rather than create a new one. Old issues also can often include helpful tips and solutions to common problems.
If you are looking for help with your code, please consider posting a question on our Slack organisation. You can post your questions to the #questions
channel. Past questions and discussions from our Slack organisation are accessible on Linen. In the interest of community engagement we also believe that help is much more valuable if it's shared publicly, so that more people can benefit from it.
If you have already checked the existing issues on GitHub and are still convinced that you have found odd or erroneous behaviour then please file a new issue. We have a template that helps you provide the necessary information we'll need in order to address your query.
If you have new ideas for Kedro-Datasets then please open a GitHub issue with the label enhancement
. Please describe in your own words the feature you would like to see, why you need it, and how it should work.
If you're unsure where to begin contributing to Kedro-Datasets, please start by looking through the good first issue
and help wanted
on GitHub.
If you want to contribute a new dataset, read the tutorial to create and contribute a custom dataset in the Kedro documentation.
Make sure to add the new dataset to kedro_datasets.rst
so that it shows up in the API documentation and to kedro-datasets/static/jsonschema/kedro-catalog-X.json
for IDE validation.
Below is a guide to help you understand the process of contributing a new dataset, whether it falls under the category of core or experimental datasets.
Core datasets are maintained by the Kedro Technical Steering Committee (TSC) and adhere to specific standards. These datasets adhere to the following requirements:
- Must be something that the Kedro TSC is willing to maintain.
- Must be fully documented.
- Must have working doctests (unless complex cloud/DB setup required, which can be discussed in the review).
- Must run as part of the regular CI/CD jobs.
- Must have 100% test coverage.
- Should support all Python versions under NEP 29 (3.10+ currently).
- Should work on Linux, macOS, and Windows.
The requirements for experimental datasets are more flexible and these datasets are not maintained by the Kedro TSC. Experimental datasets:
- Do not need to be fully documented but must have docstrings explaining their use.
- Do not need to run as part of regular CI/CD jobs.
- Can be in the early stages of development or do not have to meet the criteria for core Kedro datasets.
If your dataset is initially considered experimental but matures over time, it may qualify for graduation to a core dataset.
- Anyone, including TSC members and users, can trigger the graduation process.
- An experimental dataset requires 1/2 approval from the TSC to graduate to the core datasets space.
- Your dataset can graduate when it meets all requirements of a core dataset.
A dataset initially considered core might be demoted if it no longer meets the required standards.
- The demotion process will be initiated by someone from the TSC.
- A core dataset requires 1/2 approval from the TSC to be demoted to the experimental datasets space.
Working on your first pull request? You can learn how from these resources:
- Aim for cross-platform compatibility on Windows, macOS and Linux
- We use Anaconda as a preferred virtual environment
- We use SemVer for versioning
Our code is designed to be compatible with Python 3.6 onwards and our style guidelines are (in cascading order):
- PEP 8 conventions for all Python code
- Google docstrings for code comments
- PEP 484 type hints for all user-facing functions / class methods e.g.
def count_truthy(elements: List[Any]) -> int:
return sum(1 for elem in elements if elem)
Note: We only accept contributions under the Apache 2.0 license, and you should have permission to share the submitted code.
We use a branching model that helps us keep track of branches in a logical, consistent way. All branches should have the hyphen-separated convention of: <type-of-change>/<short-description-of-change>
e.g. feature/awesome-new-feature
Types of changes | Description |
---|---|
docs |
Changes to the documentation of the plugin |
feature |
Non-breaking change which adds functionality |
fix |
Non-breaking change which fixes an issue |
tests |
Changes to project unit (tests/ ) and / or integration (features/ ) tests |
- Fork the project
- Develop your contribution in a new branch.
- Add your dataset to
kedro_datasets_experimental
. - Make sure all your commits are signed off by using
-s
flag withgit commit
. - Open a PR against the
main
branch and make sure that the PR title follows the Conventional Commits specs with the scope(datasets)
. - The TSC will review your contribution and decide whether they want to maintain the dataset, and thus, whether it is contributed as a core or experimental dataset.
- Make sure the CI builds are green (have a look at the section Running checks locally below).
- Update the PR according to the reviewer's comments.
To run tests you need to install the test requirements, do this using the following command:
make plugin=kedro-datasets install-test-requirements
make install-pre-commit
All checks run by our CI / CD pipeline can be run locally on your computer.
make plugin=kedro-datasets lint
make plugin=kedro-datasets test
If the tests in kedro-datasets/kedro_datasets/spark
are failing, and you are not planning to work on Spark related features, then you can run the reduced test suite that excludes them with this command:
make test-no-spark