Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Biochem-relevant repos are astronomical #194

Closed
za158 opened this issue Jun 22, 2023 · 17 comments
Closed

Biochem-relevant repos are astronomical #194

za158 opened this issue Jun 22, 2023 · 17 comments
Assignees

Comments

@za158
Copy link
Member

za158 commented Jun 22, 2023

When I browse the list at https://oss-tracker-dev.eto.tech/?field_of_study=Chemistry---Biochemistry&order_by=relevance&summary_order=open_issues&compare_graph=commit_dates, it appears that many, if not most, of the repos are actually related to astronomy, not biochemistry. This seems weird.

@za158
Copy link
Member Author

za158 commented Jun 22, 2023

Looking at one of these - https://oss-tracker-dev.eto.tech/project/?name=jobovy/galpy - it seems it is also cited a ton in operations management, which is equally odd.

@za158
Copy link
Member Author

za158 commented Jun 22, 2023

@jmelot jmelot self-assigned this Jun 23, 2023
@jmelot
Copy link
Contributor

jmelot commented Jun 27, 2023

My first thought was to see if e.g. biochem and operations management just tended to have low scores on the papers we include. Spot-checking doesn't show a clear relationship:

Percentage of papers with the given field in their top three where that field scored > 0.5:

Good-looking fields

3032/5909 = 51.3% > 0.5 - artificial intelligence
1608/1966 = 81.8% > 0.5 - topology

Unexpected-looking fields

7159/10364 = 69.1% > 0.5 - biochemistry
5451/7123 = 76.5 > 0.5 - operations management

Then I thought I'd just see what the numbers would look like if I assigned a field to a paper if its score exceeded a certain threshold rather than if it was in the top three fields for that paper. Given 344,402 papers where we've found at least one repo mention, and using a query like this to threshold:

SELECT
  merged_id,
  name,
  field.score AS score
FROM
  fields_of_study_v2.field_scores
CROSS JOIN
  UNNEST(fields) AS field
LEFT JOIN
  fields_of_study_v2.field_meta
  ON
    field_id = field.id
WHERE
  (level = 1) and (score > <threshold>)

And then a query to count papers with a given field in that set:

SELECT
  COUNT(DISTINCT(merged_id))
FROM
  tmp.thresholded_scores -- this is where I wrote the output of the above query
WHERE
  merged_id IN (
  SELECT
    merged_id
  FROM
    staging_github_metrics.repos_in_papers
  CROSS JOIN
    UNNEST(merged_ids) AS merged_id)
  AND (name = "Topology")

I get, for Topology:

  • 125102 papers at threshold = 0.5 - so, like, ~1/3 of the papers with repo mentions
  • 86581 at 0.6
  • 9752 at 0.7
  • 2734 at 0.8

At 0.8, I see the following counts for the other fields:

  • Artificial intelligence - 4543
  • Biochemistry - 5273
  • Operations management - 4078

I then thought to check the number of papers in our dataset associated with any field under this method. Somewhat worryingly, it's 42,545, or ~34% of the papers we find repo mentions in.

(to be continued, next step is to try it anyway and see how it looks, then do something else if it looks bad)

@jmelot
Copy link
Contributor

jmelot commented Jun 27, 2023

Ok I'm not going to bother testing with threshold at 0.8 actually, just eyeballing the first page of the paper-fields from

-- get top 3 level 1 fields of study per paper
WITH field_order as (
SELECT
  merged_id,
  name,
  field.score AS score
FROM
  fields_of_study_v2.field_scores
CROSS JOIN
  UNNEST(fields) AS field
LEFT JOIN
  fields_of_study_v2.field_meta
  ON
    field_id = field.id
WHERE
  (level = 1) and (score > 0.8))

SELECT
  merged_id,
  ARRAY_AGG(STRUCT(name, score)) AS top_lvl1_fields
FROM
  field_order
GROUP BY merged_id

I see stuff like this:

Screenshot 2023-06-27 at 11 26 56 AM Screenshot 2023-06-27 at 11 26 33 AM Screenshot 2023-06-27 at 11 26 04 AM

I think we need to stick with top 3 for now. I'll first try adding a score threshold of > 0.5 and if that looks worse I'll just manually drop some fields

@jmelot
Copy link
Contributor

jmelot commented Jun 27, 2023

I thought I found the problem - here's metadata for three of the papers with a bazillion high-scoring fields

Screenshot 2023-06-27 at 11 36 34 AM

I filtered out rows without both an english title and abstract but still see weird stuff unfortunately, although it is less prevalent. Looking at three papers that still have a billion fields, they have zh titles and abstracts. I wonder if there is some kind of issue with the zh model.

Screenshot 2023-06-27 at 11 51 56 AM

@jmelot
Copy link
Contributor

jmelot commented Jun 27, 2023

I also tried filtering to only papers that have en title and abstract and no foreign title or abstract. This seems to have drastically reduced the number of papers with a ton of fields. Here's three that remain - it looks like in these cases the abstracts are just super short/non-specific.

Screenshot 2023-06-27 at 11 56 45 AM

I can't spend forever futzing with this but I might try lowering my threshold somewhat, filtering to only papers with en title+abstract and no foreign title or abstract, and setting a min length on the abstract. If that doesn't work I'll stop thresholding and go back to top three, but maybe keep the new filters and see if they reduce the weird false positives.

@jmelot
Copy link
Contributor

jmelot commented Jun 27, 2023

@jamesdunham @atoney-CSET documenting a bit of a journey of discovery 🚢 on the v2 field scores here. No action needed from you - I'm tight on time and am just going to brute-force my way to some kind of solution here - but thought you might be interested!

@jamesdunham
Copy link
Member

Thanks for sharing. We found in our first trial application of the ZH model to analysis, for Cole's CAS project, that the ZH model didn't perform well enough when applied to CAS publications. These results seem consistent with that. I'd recommend not using the scores for ZH-only papers here.

@jmelot
Copy link
Contributor

jmelot commented Jun 27, 2023

Duly noted!

I did a sanity check of the papers that mention galpy just to make sure there isn't any issue with the repo-paper associations somehow. The first few I checked mention galpy and the first 50 I skimmed all looked like they would plausibly mention that repo and (although I'm not particularly well-read on this domain) seemed obviously relevant to Astronomy/Astrophysics

Screenshot 2023-06-27 at 3 03 30 PM

I then looked at the top fields for these papers, which we're now getting from

-- get top 3 level 1 fields of study per paper
WITH
field_name_scores AS (
  SELECT
    merged_id,
    name,
    field.score AS score
  FROM
    fields_of_study_v2.field_scores
  CROSS JOIN
    UNNEST(fields) AS field
  LEFT JOIN
    fields_of_study_v2.field_meta
    ON
      field_id = field.id
  WHERE
    (level = 1)),

field_order AS (
  SELECT
    merged_id,
    name,
    score,
    ROW_NUMBER() OVER(PARTITION BY merged_id ORDER BY score DESC) AS row_num
  FROM
    field_name_scores)

SELECT
  merged_id,
  ARRAY_AGG(STRUCT(name, score)) AS top_lvl1_fields
FROM
  field_order
WHERE
  (row_num < 4) AND (merged_id in (select merged_id from gcp_cset_links_v2.corpus_merged where (title_english is not null) and (abstract_english is not null) and (length(abstract_english) > 500)))
GROUP BY merged_id

Using the query

with s2_v3 as (
  select distinct merged_id as merged_id_v3, content.text as text from semantic_scholar.fulltext left join gcp_cset_links_v3.article_links on orig_id = cast(corpusid as string) where merged_id in (select merged_id_v3 from gcp_cset_links_v3.merged_id_crosswalk where merged_id_v2 is not null)
)

select top_level1_fields.* from staging_github_metrics.top_level1_fields inner join (select distinct merged_id_v2, merged_id_v3 from gcp_cset_links_v3.merged_id_crosswalk) on merged_id = merged_id_v2 inner join s2_v3 using(merged_id_v3) inner join (select distinct merged_id from staging_github_metrics.repos_in_papers cross join unnest(merged_ids) as merged_id where repo = "jobovy/galpy") as repos on repos.merged_id = merged_id_v2

Operations management, in particular, is indeed just super common here for some reason

Screenshot 2023-06-27 at 3 17 24 PM

As I kept scrolling I noticed some papers where Astronomy or similar was the top field but the scores were just really low, like these

Screenshot 2023-06-27 at 3 19 02 PM

It's enough to make me want to try normalizing scores per-field, although we've talked about the potential pitfalls here. I think I'm going to do it, then assuming it doesn't work give up and just manually throw out fields that look bad (and say I'm doing so in the documentation)

jmelot added a commit that referenced this issue Jun 27, 2023
@jmelot
Copy link
Contributor

jmelot commented Jun 27, 2023

Ok nevermind I'm not normalizing I think. For some reason I thought these scores were a number between 0 and 1

Screenshot 2023-06-27 at 4 08 28 PM Screenshot 2023-06-27 at 4 09 15 PM

@jmelot
Copy link
Contributor

jmelot commented Jun 27, 2023

Tracking ^ under georgetown-cset/fields-of-study-pipeline#9

@jmelot
Copy link
Contributor

jmelot commented Jun 27, 2023

Ok. I normalized the scores like this:

-- get top 3 level 1 fields of study per paper
WITH
field_name_scores AS (
  SELECT
    merged_id,
    name,
    field.score AS score
  FROM
    fields_of_study_v2.field_scores
  CROSS JOIN
    UNNEST(fields) AS field
  LEFT JOIN
    fields_of_study_v2.field_meta
    ON
      field_id = field.id
  WHERE
    (level = 1) and (field.score <= 1)),

max_field_scores AS (
  SELECT
    name,
    max(score) as max_score
  FROM
    field_name_scores
  GROUP BY
    name
),

norm_field_names_scores AS (
  SELECT
    merged_id,
    name,
    score/max_score as score
  FROM
    field_name_scores
  LEFT JOIN
    max_field_scores
  USING(name)
),

field_order AS (
  SELECT
    merged_id,
    name,
    score,
    ROW_NUMBER() OVER(PARTITION BY merged_id ORDER BY score DESC) AS row_num
  FROM
    norm_field_names_scores)

SELECT
  merged_id,
  ARRAY_AGG(STRUCT(name, score)) AS top_lvl1_fields
FROM
  field_order
WHERE
  (
    row_num < 4
  ) AND (
    merged_id IN (
      SELECT merged_id
      FROM
        gcp_cset_links_v2.corpus_merged
      WHERE (title_english IS NOT NULL) AND (abstract_english IS NOT NULL) AND (LENGTH(abstract_english) > 500)
    )
  )
GROUP BY merged_id

Just eyeballing it:

1.) The field list is significantly reduced. Some fields which had several quite good repos like Topology are gone. Some L0 fields have nothing under them. Math has only one subfield now
2.) BUT -- the prevalence of fields with a bunch of irrelevant repos is drastically reduced

I'm going to futz just a little more to see if I can get some of the good subfields back but I think actually we're going to want to normalize. For many of the remaining fields, things are looking really good now

@jmelot
Copy link
Contributor

jmelot commented Jun 27, 2023

Oops, pasted wrong query above, fixed. I can't just min(score, 1) for normalization purposes

@jmelot
Copy link
Contributor

jmelot commented Jun 28, 2023

Nevermind, now forensic engineering is extremely prevalent

Screenshot 2023-06-27 at 8 11 12 PM

Probably down to manual field removal now

@jamesdunham
Copy link
Member

I wonder if ranking by field score percentile (over papers) would work.

@jmelot
Copy link
Contributor

jmelot commented Jun 28, 2023

It's a good thought. I tried

-- get top 3 level 1 fields of study per paper
WITH
field_name_scores AS (
  SELECT
    merged_id,
    name,
    field.score AS score
  FROM
    fields_of_study_v2.field_scores
  CROSS JOIN
    UNNEST(fields) AS field
  LEFT JOIN
    fields_of_study_v2.field_meta
    ON
      field_id = field.id
  WHERE
    (level = 1) and (field.score <= 1)),

field_percentiles AS (
  SELECT
    merged_id,
    name,
    PERCENT_RANK() OVER(partition by name order by score desc) as score
  FROM
    field_name_scores
),

field_order AS (
  SELECT
    merged_id,
    name,
    score,
    ROW_NUMBER() OVER(PARTITION BY merged_id ORDER BY score DESC) AS row_num
  FROM
    field_percentiles)

SELECT
  merged_id,
  ARRAY_AGG(STRUCT(name, score)) AS top_lvl1_fields
FROM
  field_order
WHERE
  (
    row_num < 4
  ) AND (
    merged_id IN (
      SELECT merged_id
      FROM
        gcp_cset_links_v2.corpus_merged
      WHERE (title_english IS NOT NULL) AND (abstract_english IS NOT NULL) AND (LENGTH(abstract_english) > 500)
    )
  )
GROUP BY merged_id

but got resources exceeded. I still get resources exceeded for just

-- get top 3 level 1 fields of study per paper
WITH
field_name_scores AS (
  SELECT
    merged_id,
    name,
    field.score AS score
  FROM
    fields_of_study_v2.field_scores
  CROSS JOIN
    UNNEST(fields) AS field
  LEFT JOIN
    fields_of_study_v2.field_meta
    ON
      field_id = field.id
  WHERE
    (level = 1) and (field.score <= 1) and (
    merged_id IN (
      SELECT merged_id
      FROM
        gcp_cset_links_v2.corpus_merged
      WHERE (title_english IS NOT NULL) AND (abstract_english IS NOT NULL) AND (LENGTH(abstract_english) > 500)
    )))

  SELECT
    merged_id,
    name,
    PERCENT_RANK() OVER(partition by name order by score desc) as score
  FROM
    field_name_scores

Much as I hate to default to curating the fields manually I think it's time, but I'll come back to this later!

@jmelot
Copy link
Contributor

jmelot commented Jun 28, 2023

Closing this for my tracking (I'm going to just remove biochemistry), with further work in #201

@jmelot jmelot closed this as completed Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants