Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Onboard The General Index Dataset #342

Merged
merged 22 commits into from
Jun 8, 2022

Conversation

gkodukula
Copy link
Contributor

@gkodukula gkodukula commented Apr 13, 2022

Checklist

Note: If an item applies to you, all of its sub-items must be fulfilled

  • (Required) This pull request is appropriately labeled
  • Please merge this pull request after it's approved
  • I'm adding or editing a dataset
    • The Google Cloud Datasets team is aware of the proposed dataset
    • I put all my code inside datasets/the_general_index> and nothing outside of that directory

".csv", "-" + str(chunk_number) + ".csv"
)
df = pd.DataFrame()
df = pd.concat([df, chunk])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you are concatenating an empty dataframe to chunk each time here. I think the result of lines 209 and 210 is the same as just df = chunk

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@happyhuman Gowtham and myself are working on this together. It turns out that data in some of the source files exceeds pandas csv row string length maximum and pandas does not provide a method to be able to extend that maximum, so for this implementation we need to rewrite the file reading process to use "import csv" module instead. As a result we are not yet ready to push to production.

) -> pd.DataFrame:
for column in null_string_list:
logging.info(f"Removing null strings from column {column}")
df[column] = df[column].str.replace("\\N", "", regex=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if it should be "\n"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@happyhuman The \N is actual text in the data "\N" is put in place of null when the source data was dumped; so we are not actually looking for newlines. Therefore, the "\N" is correct here.



def convert_dt_format(dt_str: str) -> str:
if not dt_str or str(dt_str).lower() == "nan" or str(dt_str).lower() == "nat":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If dt_str can be None, or nan or nat, a good way to check them all is to use pd.notnull(dt_str).

Also, if dt_str is expected to be of type str, then using str(dt_str) seems redundant.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@happyhuman The nan/nat issue has been identified before in previous PR's. Use of notnull has not been successful before. However, we will try both issues again to see if we can get your suggestion/s to work. Thanks for the reminder.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@happyhuman pd.notnull will not work for NaN or NaT. Please see https://pandas.pydata.org/docs/reference/api/pandas.notnull.html

the only way to access lower is through the "str" object so the only other alternative would be str.lower(dt_str) which is not much different so I think this is moot. Let me know if you still want me to change to str.lower(dt_str) or not.

@nlarge-google nlarge-google changed the title Feat: Onboard The General index Dataset Feat: Onboard The General Index Dataset Apr 15, 2022
@nlarge-google
Copy link
Collaborator

@gkodukula this is failing in dev. please review and fix. thanks!

@nlarge-google nlarge-google merged commit 67d7216 into GoogleCloudPlatform:main Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants