Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create generalised extract_data function #117

Merged
merged 46 commits into from
Nov 12, 2024

Conversation

ehwenk
Copy link
Collaborator

@ehwenk ehwenk commented Nov 11, 2024

  • Main item is a new, generalised extract_data function that makes it possible to subset a traits.build database using any column in any table.
    - The function works with single values or vectors of values
    - Does not return an error for empty searches
    - For context property values, searches across all context property categories simultaneously
  • The function from traits.build that is used to bind together datasets has been moved to austraits, allowing one to extract multiple pieces of a database, then bind them back together into a single database (bind_datasets)
    - There is code in this function to rename the individual datasets before they are bound together - it is not clear if this is necessary, because the output is identical with and without this step, but it is being maintained for now, but is a bit cumbersome since databases to be bound aren't always datasets with a dataset_id. (Tried building all of austraits.build with and without this too and identical output.)
  • Various documentation fixes
  • Add tests

ehwenk and others added 30 commits November 1, 2024 13:28
Write new extract_data function that works across all columns for all datasets.

Still need to write tests for it.
Move function `bind_databases` from traits.build (where it is `build_combine`. And also the helper function `util_df_to_list`.
* rename `database_create_combined_table` to `flatten_database`
* update documentation
* it is intentional that the two util functions are exported, but can be discussed
*fix bug in `flatten_database` - contributors cannot return many columns
* rename file
- Wasn't separating context property `link_vals` that were concatenated on a single line
A start on tests for the new extract function.
- Probably need a few more tests to make sure complicated context searches give correct output
continued edits to tests

undercovered a problem when using multiple "extracts" in sequence... still working on that
Specific context tables might be empty (null) if a series of extract functions are used in tandem.

Was fixing tests for austraits_lite5.0.0 and came across another issue... Unlike the old extract functions the new one doesn't work with vectors of values - line 123 in `extract_data`
The tests to do with extract_data are generally working, but there is a weird, oscillating bug with the order of tables in the database list that I can't work out.

And also there is a test on line 198 (expect_equal(trait_subset$traits %>% dplyr::distinct(dataset_id) %>% nrow(), 5)) alternates between giving 5 & 8 as output.

But bugs fixed:
- works with empty tibbles
- works with vectors of values to match
series of small errors causing tests to fail:
- needed to reorder tables in some of the austraits_lite versions
- recreated the austraits_lite for versions 3 & 4 that are empty, other than having the structure required to test compatibility
- plot_data needed an `as.numeric`
- the function `summarise_trait_means` was still being tests on v3.0 and indeed looking at the code, isn't going to summarise, because it is only searching for replicates within a single observation_id
The function `join_context_properties` was not being exported
@ehwenk ehwenk requested review from fontikar and dfalster November 11, 2024 09:12
R/bind_databases.R Outdated Show resolved Hide resolved
Copy link
Member

@dfalster dfalster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @ehwenk @fontikar. This is complex work, well executed. Some minor coding suggestions.

NAMESPACE Outdated Show resolved Hide resolved
R/bind_databases.R Outdated Show resolved Hide resolved
R/extract_data.R Show resolved Hide resolved
R/extract_data.R Outdated Show resolved Hide resolved
codecov.yml Show resolved Hide resolved
R/extract_data.R Outdated Show resolved Hide resolved
R/extract_data.R Show resolved Hide resolved
R/extract_data.R Outdated Show resolved Hide resolved
R/extract_data.R Show resolved Hide resolved
tests/testthat/test-database_create_combined_table.R Outdated Show resolved Hide resolved
Copy link
Collaborator

@fontikar fontikar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Lizzy! The extract_data function is a beast but one that we've tamed :D
It bugs me I don't understand how he back tick works but that is for another day!

Well done!!

- rename all with `convert_` instead of `util_`
- add `convert_list_to_df1` from traits.build, so all 3 functions in one location
- implemented or tried to implement all suggestions in review
- for things that didn't work I've included the suggestion but commented it out - I suspect many are minor syntax fixes that i am not working out.
- remove naming of individual databases from `bind_databases`. Once we use this version of the functions in traits.build we will have to add in the creation of the `dataset_id` column in `taxonomic_updates` from elsewhere. That is the only use of the database name and it introduces an error when use in generic situations
- added lots of loops, more compact coding to `extract_data`
@ehwenk ehwenk merged commit fdaf6bb into develop Nov 12, 2024
5 checks passed
@ehwenk ehwenk deleted the generalise_extract_off_develop branch November 12, 2024 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants