-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRAFT add Data Health Checker #1574
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
github-actions
bot
added
the
waiting for triage
Cannot auto-triage, wait for triage.
label
Jun 28, 2023
@microsoft-github-policy-service agree |
…r check, reformatted summary
benheckmann
force-pushed
the
data-health-checker
branch
from
July 17, 2023 11:10
73bf59e
to
1c7c5d0
Compare
Add unit tests in qlib.test |
github-actions
bot
added
the
documentation
Improvements or additions to documentation
label
Jan 8, 2025
SunsetWolf
added
enhancement
New feature or request
and removed
waiting for triage
Cannot auto-triage, wait for triage.
labels
Jan 9, 2025
Hi, @benheckmann |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
First draft for a data health checker as discussed in #854. The checker receives a path to the data in CSV or qlib format (not implemented yet). It will convert the data to a DataFrame and perform basic checks for data completeness and correctness.
I am not too familiar with the qlib data handling yet, so I am hoping to get some first feedback on whether this goes in the right direction.
Motivation and Context
See #854. This was an issue where a user would get a non-meaningful error message when his data did not adhere to the format (specifically the "volume" column was named "vol"). When checking the data of #854 with this checker, the user would get:
Note: the large step change uses two configurable thresholds (one for price and one for volume) and checks only step changes in OHLCV columns.
How Has This Been Tested?
No tests yet as this is only a first draft
pytest qlib/tests/test_all_pipeline.py
under upper directory ofqlib
.Screenshots of Test Results (if appropriate):
Types of changes