-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Onboard The General Index Dataset #342
Feat: Onboard The General Index Dataset #342
Conversation
".csv", "-" + str(chunk_number) + ".csv" | ||
) | ||
df = pd.DataFrame() | ||
df = pd.concat([df, chunk]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you are concatenating an empty dataframe to chunk each time here. I think the result of lines 209 and 210 is the same as just df = chunk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@happyhuman Gowtham and myself are working on this together. It turns out that data in some of the source files exceeds pandas csv row string length maximum and pandas does not provide a method to be able to extend that maximum, so for this implementation we need to rewrite the file reading process to use "import csv" module instead. As a result we are not yet ready to push to production.
) -> pd.DataFrame: | ||
for column in null_string_list: | ||
logging.info(f"Removing null strings from column {column}") | ||
df[column] = df[column].str.replace("\\N", "", regex=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if it should be "\n"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@happyhuman The \N is actual text in the data "\N" is put in place of null when the source data was dumped; so we are not actually looking for newlines. Therefore, the "\N" is correct here.
|
||
|
||
def convert_dt_format(dt_str: str) -> str: | ||
if not dt_str or str(dt_str).lower() == "nan" or str(dt_str).lower() == "nat": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If dt_str
can be None, or nan or nat, a good way to check them all is to use pd.notnull(dt_str)
.
Also, if dt_str
is expected to be of type str, then using str(dt_str)
seems redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@happyhuman The nan/nat issue has been identified before in previous PR's. Use of notnull has not been successful before. However, we will try both issues again to see if we can get your suggestion/s to work. Thanks for the reminder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@happyhuman pd.notnull will not work for NaN or NaT. Please see https://pandas.pydata.org/docs/reference/api/pandas.notnull.html
the only way to access lower is through the "str" object so the only other alternative would be str.lower(dt_str) which is not much different so I think this is moot. Let me know if you still want me to change to str.lower(dt_str) or not.
…re available for running the pipelines.
@gkodukula this is failing in dev. please review and fix. thanks! |
Checklist
Note: If an item applies to you, all of its sub-items must be fulfilled
datasets/the_general_index>
and nothing outside of that directory