-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Biochem-relevant repos are astronomical #194
Comments
Looking at one of these - https://oss-tracker-dev.eto.tech/project/?name=jobovy/galpy - it seems it is also cited a ton in operations management, which is equally odd. |
And then the list at https://oss-tracker-dev.eto.tech/?field_of_study=Economics---Operations+management&order_by=relevance&summary_order=open_issues&compare_graph=commit_dates seems extremely similar, if not identical to the biochem one ^ |
My first thought was to see if e.g. biochem and operations management just tended to have low scores on the papers we include. Spot-checking doesn't show a clear relationship: Percentage of papers with the given field in their top three where that field scored > 0.5: Good-looking fields 3032/5909 = 51.3% > 0.5 - artificial intelligence Unexpected-looking fields 7159/10364 = 69.1% > 0.5 - biochemistry Then I thought I'd just see what the numbers would look like if I assigned a field to a paper if its score exceeded a certain threshold rather than if it was in the top three fields for that paper. Given 344,402 papers where we've found at least one repo mention, and using a query like this to threshold:
And then a query to count papers with a given field in that set:
I get, for Topology:
At 0.8, I see the following counts for the other fields:
I then thought to check the number of papers in our dataset associated with any field under this method. Somewhat worryingly, it's 42,545, or ~34% of the papers we find repo mentions in. (to be continued, next step is to try it anyway and see how it looks, then do something else if it looks bad) |
@jamesdunham @atoney-CSET documenting a bit of a journey of discovery 🚢 on the v2 field scores here. No action needed from you - I'm tight on time and am just going to brute-force my way to some kind of solution here - but thought you might be interested! |
Thanks for sharing. We found in our first trial application of the ZH model to analysis, for Cole's CAS project, that the ZH model didn't perform well enough when applied to CAS publications. These results seem consistent with that. I'd recommend not using the scores for ZH-only papers here. |
Tracking ^ under georgetown-cset/fields-of-study-pipeline#9 |
Ok. I normalized the scores like this:
Just eyeballing it: 1.) The field list is significantly reduced. Some fields which had several quite good repos like Topology are gone. Some L0 fields have nothing under them. Math has only one subfield now I'm going to futz just a little more to see if I can get some of the good subfields back but I think actually we're going to want to normalize. For many of the remaining fields, things are looking really good now |
Oops, pasted wrong query above, fixed. I can't just min(score, 1) for normalization purposes |
I wonder if ranking by field score percentile (over papers) would work. |
It's a good thought. I tried
but got resources exceeded. I still get resources exceeded for just
Much as I hate to default to curating the fields manually I think it's time, but I'll come back to this later! |
Closing this for my tracking (I'm going to just remove biochemistry), with further work in #201 |
When I browse the list at https://oss-tracker-dev.eto.tech/?field_of_study=Chemistry---Biochemistry&order_by=relevance&summary_order=open_issues&compare_graph=commit_dates, it appears that many, if not most, of the repos are actually related to astronomy, not biochemistry. This seems weird.
The text was updated successfully, but these errors were encountered: