-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculation of frequencies during ingest #5236
Comments
Hi @lubitchv, we should catch up about Scholar's Portal's schedule for the Data Curation Tool development and what the dependencies are. I'll get in touch with @amberleahey to schedule. |
ping @landreev - we should talk about this one. Are we OK with ingest adding frequencies to the DB? |
Thanks @lubitchv for the PR with the proposed changes. Moving to code review. |
@lubitchv Looking at the PR, couple of things:
So a really clean implementation should probably include separate cases for strings, floats, doubles and integers - just like elsewhere in IngestService, where we calculate summary stats - and maybe use Hashtable<Object, Double> to count. |
oh, and there may be |
BTW, one thing I'm a little confused about: why is VariableCategory.frequency a float, and not an integer?? That class file is very old, legacy code... I'm guessing maybe the original intent back in the day was to actually store frequencies, and not the numbers of occurrences? (as in, number of occurrences/total number of cases) |
Hi,
Regarding frequencies as Double, it is indeed confusing. My guess, frequency is Double in dataverse because of weighted frequencies. There is a flag that says if variable is weighted. Weighted frequencies are real numbers. There is a problem here, since we will need both frequencies and weighted frequencies. The weighted frequency field does not exist at the moment in dataverse. Data Curation Tool will allow users to weight variables and calculate weighted frequencies. Then an api should write it back to dataverse database. But anyway, weighted frequencies are for another issue, latter. |
@lubitchv |
@landreev |
@lubitchv |
@lubitchv The way I first described this problem was: what if we have a tab file where the same value is represented differently on different lines... like "0" and "0.0". On a second thought, let's not even worry about this use case. The tab files that our ingest produces should have all the numeric values formatted the same way. If we discover a tab file where this is not the case - we'll fix it. What we should worry about is the case where the formatting of the numeric value, as it appears in the tab file, is different from the way it's represented in the category value, as stored in the database (this is VariableCategory.getValue()). Because our ingest does not actually guarantee that. Does that make sense? A normal example of a variable category: A researcher has a numeric, integer column in their data file. She assigns categorical labels: "Urban" to 0 and "Rural" to 1. In this, trivial example, when this data file is ingested into Dataverse, "0" and "1" are saved as cat values, and "Urban" and "Rural" - as category labels. And the tabular file has a column of "0"s and "1"s. The current code in the PR works just fine for this example. When you open the tab file and subset this variable column, as a vector of Strings, you get back an array of "0"s and "1"s, you compare them to the .getValue() strings and you get the correct counts of the values... But here's another example: In Stata you can assign value labels to integers, and also to floats (but only to values with zero fractional part). So I created this Stata file (attached; I had to change the extension .dta to .txt, in order for GitHub to accept it), with the single variable column of type float, made of values 0.0 and 1.0. I assigned the labels "Urban" as "Rural", as above. When ingested into Dataverse, the entries in the tab file are formatted as floats:
) This is because Stata treats all these categorical values as integers internally; and because we never thought it was important to format the values exactly the same way as the entries in the tab file... So the code currently in the PR will produce fequency = 0 for each of the 2 categories. |
@landreev |
@landreev |
@landreev |
@lubitchv |
Frequencies are not calculated during upload and ingest of a file into dataverse. As a result export DDI metadata does not have frequency field in catStat in exported xml DDI file. This information is needed for new Data Curation tool. Data Curation Tool is discussed in #4174, #4448, #3604
The text was updated successfully, but these errors were encountered: