Biochem-relevant repos are astronomical #194

za158 · 2023-06-22T00:18:07Z

When I browse the list at https://oss-tracker-dev.eto.tech/?field_of_study=Chemistry---Biochemistry&order_by=relevance&summary_order=open_issues&compare_graph=commit_dates, it appears that many, if not most, of the repos are actually related to astronomy, not biochemistry. This seems weird.

za158 · 2023-06-22T00:20:01Z

Looking at one of these - https://oss-tracker-dev.eto.tech/project/?name=jobovy/galpy - it seems it is also cited a ton in operations management, which is equally odd.

za158 · 2023-06-22T00:21:16Z

And then the list at https://oss-tracker-dev.eto.tech/?field_of_study=Economics---Operations+management&order_by=relevance&summary_order=open_issues&compare_graph=commit_dates seems extremely similar, if not identical to the biochem one ^

jmelot · 2023-06-27T00:12:02Z

My first thought was to see if e.g. biochem and operations management just tended to have low scores on the papers we include. Spot-checking doesn't show a clear relationship:

Percentage of papers with the given field in their top three where that field scored > 0.5:

Good-looking fields

3032/5909 = 51.3% > 0.5 - artificial intelligence
1608/1966 = 81.8% > 0.5 - topology

Unexpected-looking fields

7159/10364 = 69.1% > 0.5 - biochemistry
5451/7123 = 76.5 > 0.5 - operations management

Then I thought I'd just see what the numbers would look like if I assigned a field to a paper if its score exceeded a certain threshold rather than if it was in the top three fields for that paper. Given 344,402 papers where we've found at least one repo mention, and using a query like this to threshold:

SELECT
  merged_id,
  name,
  field.score AS score
FROM
  fields_of_study_v2.field_scores
CROSS JOIN
  UNNEST(fields) AS field
LEFT JOIN
  fields_of_study_v2.field_meta
  ON
    field_id = field.id
WHERE
  (level = 1) and (score > <threshold>)

And then a query to count papers with a given field in that set:

SELECT
  COUNT(DISTINCT(merged_id))
FROM
  tmp.thresholded_scores -- this is where I wrote the output of the above query
WHERE
  merged_id IN (
  SELECT
    merged_id
  FROM
    staging_github_metrics.repos_in_papers
  CROSS JOIN
    UNNEST(merged_ids) AS merged_id)
  AND (name = "Topology")

I get, for Topology:

125102 papers at threshold = 0.5 - so, like, ~1/3 of the papers with repo mentions
86581 at 0.6
9752 at 0.7
2734 at 0.8

At 0.8, I see the following counts for the other fields:

Artificial intelligence - 4543
Biochemistry - 5273
Operations management - 4078

I then thought to check the number of papers in our dataset associated with any field under this method. Somewhat worryingly, it's 42,545, or ~34% of the papers we find repo mentions in.

(to be continued, next step is to try it anyway and see how it looks, then do something else if it looks bad)

jmelot · 2023-06-27T15:33:06Z

Ok I'm not going to bother testing with threshold at 0.8 actually, just eyeballing the first page of the paper-fields from

-- get top 3 level 1 fields of study per paper
WITH field_order as (
SELECT
  merged_id,
  name,
  field.score AS score
FROM
  fields_of_study_v2.field_scores
CROSS JOIN
  UNNEST(fields) AS field
LEFT JOIN
  fields_of_study_v2.field_meta
  ON
    field_id = field.id
WHERE
  (level = 1) and (score > 0.8))

SELECT
  merged_id,
  ARRAY_AGG(STRUCT(name, score)) AS top_lvl1_fields
FROM
  field_order
GROUP BY merged_id

I see stuff like this:

I think we need to stick with top 3 for now. I'll first try adding a score threshold of > 0.5 and if that looks worse I'll just manually drop some fields

jmelot · 2023-06-27T15:52:41Z

I thought I found the problem - here's metadata for three of the papers with a bazillion high-scoring fields

I filtered out rows without both an english title and abstract but still see weird stuff unfortunately, although it is less prevalent. Looking at three papers that still have a billion fields, they have zh titles and abstracts. I wonder if there is some kind of issue with the zh model.

jmelot · 2023-06-27T16:01:29Z

I also tried filtering to only papers that have en title and abstract and no foreign title or abstract. This seems to have drastically reduced the number of papers with a ton of fields. Here's three that remain - it looks like in these cases the abstracts are just super short/non-specific.

I can't spend forever futzing with this but I might try lowering my threshold somewhat, filtering to only papers with en title+abstract and no foreign title or abstract, and setting a min length on the abstract. If that doesn't work I'll stop thresholding and go back to top three, but maybe keep the new filters and see if they reduce the weird false positives.

jmelot · 2023-06-27T16:03:51Z

@jamesdunham @atoney-CSET documenting a bit of a journey of discovery 🚢 on the v2 field scores here. No action needed from you - I'm tight on time and am just going to brute-force my way to some kind of solution here - but thought you might be interested!

jamesdunham · 2023-06-27T16:32:31Z

Thanks for sharing. We found in our first trial application of the ZH model to analysis, for Cole's CAS project, that the ZH model didn't perform well enough when applied to CAS publications. These results seem consistent with that. I'd recommend not using the scores for ZH-only papers here.

jmelot · 2023-06-27T19:22:05Z

Duly noted!

I did a sanity check of the papers that mention galpy just to make sure there isn't any issue with the repo-paper associations somehow. The first few I checked mention galpy and the first 50 I skimmed all looked like they would plausibly mention that repo and (although I'm not particularly well-read on this domain) seemed obviously relevant to Astronomy/Astrophysics

I then looked at the top fields for these papers, which we're now getting from

-- get top 3 level 1 fields of study per paper
WITH
field_name_scores AS (
  SELECT
    merged_id,
    name,
    field.score AS score
  FROM
    fields_of_study_v2.field_scores
  CROSS JOIN
    UNNEST(fields) AS field
  LEFT JOIN
    fields_of_study_v2.field_meta
    ON
      field_id = field.id
  WHERE
    (level = 1)),

field_order AS (
  SELECT
    merged_id,
    name,
    score,
    ROW_NUMBER() OVER(PARTITION BY merged_id ORDER BY score DESC) AS row_num
  FROM
    field_name_scores)

SELECT
  merged_id,
  ARRAY_AGG(STRUCT(name, score)) AS top_lvl1_fields
FROM
  field_order
WHERE
  (row_num < 4) AND (merged_id in (select merged_id from gcp_cset_links_v2.corpus_merged where (title_english is not null) and (abstract_english is not null) and (length(abstract_english) > 500)))
GROUP BY merged_id

Using the query

with s2_v3 as (
  select distinct merged_id as merged_id_v3, content.text as text from semantic_scholar.fulltext left join gcp_cset_links_v3.article_links on orig_id = cast(corpusid as string) where merged_id in (select merged_id_v3 from gcp_cset_links_v3.merged_id_crosswalk where merged_id_v2 is not null)
)

select top_level1_fields.* from staging_github_metrics.top_level1_fields inner join (select distinct merged_id_v2, merged_id_v3 from gcp_cset_links_v3.merged_id_crosswalk) on merged_id = merged_id_v2 inner join s2_v3 using(merged_id_v3) inner join (select distinct merged_id from staging_github_metrics.repos_in_papers cross join unnest(merged_ids) as merged_id where repo = "jobovy/galpy") as repos on repos.merged_id = merged_id_v2

Operations management, in particular, is indeed just super common here for some reason

As I kept scrolling I noticed some papers where Astronomy or similar was the top field but the scores were just really low, like these

It's enough to make me want to try normalizing scores per-field, although we've talked about the potential pitfalls here. I think I'm going to do it, then assuming it doesn't work give up and just manually throw out fields that look bad (and say I'm doing so in the documentation)

Part of #194 Part of #201

jmelot · 2023-06-27T20:10:11Z

Ok nevermind I'm not normalizing I think. For some reason I thought these scores were a number between 0 and 1

jmelot · 2023-06-27T21:36:11Z

Tracking ^ under georgetown-cset/fields-of-study-pipeline#9

jmelot · 2023-06-27T23:40:35Z

Ok. I normalized the scores like this:

-- get top 3 level 1 fields of study per paper
WITH
field_name_scores AS (
  SELECT
    merged_id,
    name,
    field.score AS score
  FROM
    fields_of_study_v2.field_scores
  CROSS JOIN
    UNNEST(fields) AS field
  LEFT JOIN
    fields_of_study_v2.field_meta
    ON
      field_id = field.id
  WHERE
    (level = 1) and (field.score <= 1)),

max_field_scores AS (
  SELECT
    name,
    max(score) as max_score
  FROM
    field_name_scores
  GROUP BY
    name
),

norm_field_names_scores AS (
  SELECT
    merged_id,
    name,
    score/max_score as score
  FROM
    field_name_scores
  LEFT JOIN
    max_field_scores
  USING(name)
),

field_order AS (
  SELECT
    merged_id,
    name,
    score,
    ROW_NUMBER() OVER(PARTITION BY merged_id ORDER BY score DESC) AS row_num
  FROM
    norm_field_names_scores)

SELECT
  merged_id,
  ARRAY_AGG(STRUCT(name, score)) AS top_lvl1_fields
FROM
  field_order
WHERE
  (
    row_num < 4
  ) AND (
    merged_id IN (
      SELECT merged_id
      FROM
        gcp_cset_links_v2.corpus_merged
      WHERE (title_english IS NOT NULL) AND (abstract_english IS NOT NULL) AND (LENGTH(abstract_english) > 500)
    )
  )
GROUP BY merged_id

Just eyeballing it:

1.) The field list is significantly reduced. Some fields which had several quite good repos like Topology are gone. Some L0 fields have nothing under them. Math has only one subfield now
2.) BUT -- the prevalence of fields with a bunch of irrelevant repos is drastically reduced

I'm going to futz just a little more to see if I can get some of the good subfields back but I think actually we're going to want to normalize. For many of the remaining fields, things are looking really good now

jmelot · 2023-06-27T23:43:52Z

Oops, pasted wrong query above, fixed. I can't just min(score, 1) for normalization purposes

jmelot · 2023-06-28T00:12:21Z

Nevermind, now forensic engineering is extremely prevalent

Probably down to manual field removal now

jamesdunham · 2023-06-28T00:25:25Z

I wonder if ranking by field score percentile (over papers) would work.

jmelot · 2023-06-28T02:58:04Z

It's a good thought. I tried

-- get top 3 level 1 fields of study per paper
WITH
field_name_scores AS (
  SELECT
    merged_id,
    name,
    field.score AS score
  FROM
    fields_of_study_v2.field_scores
  CROSS JOIN
    UNNEST(fields) AS field
  LEFT JOIN
    fields_of_study_v2.field_meta
    ON
      field_id = field.id
  WHERE
    (level = 1) and (field.score <= 1)),

field_percentiles AS (
  SELECT
    merged_id,
    name,
    PERCENT_RANK() OVER(partition by name order by score desc) as score
  FROM
    field_name_scores
),

field_order AS (
  SELECT
    merged_id,
    name,
    score,
    ROW_NUMBER() OVER(PARTITION BY merged_id ORDER BY score DESC) AS row_num
  FROM
    field_percentiles)

SELECT
  merged_id,
  ARRAY_AGG(STRUCT(name, score)) AS top_lvl1_fields
FROM
  field_order
WHERE
  (
    row_num < 4
  ) AND (
    merged_id IN (
      SELECT merged_id
      FROM
        gcp_cset_links_v2.corpus_merged
      WHERE (title_english IS NOT NULL) AND (abstract_english IS NOT NULL) AND (LENGTH(abstract_english) > 500)
    )
  )
GROUP BY merged_id

but got resources exceeded. I still get resources exceeded for just

-- get top 3 level 1 fields of study per paper
WITH
field_name_scores AS (
  SELECT
    merged_id,
    name,
    field.score AS score
  FROM
    fields_of_study_v2.field_scores
  CROSS JOIN
    UNNEST(fields) AS field
  LEFT JOIN
    fields_of_study_v2.field_meta
    ON
      field_id = field.id
  WHERE
    (level = 1) and (field.score <= 1) and (
    merged_id IN (
      SELECT merged_id
      FROM
        gcp_cset_links_v2.corpus_merged
      WHERE (title_english IS NOT NULL) AND (abstract_english IS NOT NULL) AND (LENGTH(abstract_english) > 500)
    )))

  SELECT
    merged_id,
    name,
    PERCENT_RANK() OVER(partition by name order by score desc) as score
  FROM
    field_name_scores

Much as I hate to default to curating the fields manually I think it's time, but I'll come back to this later!

jmelot · 2023-06-28T21:51:18Z

Closing this for my tracking (I'm going to just remove biochemistry), with further work in #201

jmelot self-assigned this Jun 23, 2023

jmelot added a commit that referenced this issue Jun 27, 2023

Add additional field filters

4a17417

Part of #194 Part of #201

jmelot mentioned this issue Jun 27, 2023

Update field-paper assignment #201

Closed

jmelot closed this as completed Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Biochem-relevant repos are astronomical #194

Biochem-relevant repos are astronomical #194

za158 commented Jun 22, 2023

za158 commented Jun 22, 2023

za158 commented Jun 22, 2023

jmelot commented Jun 27, 2023 •

edited

Loading

jmelot commented Jun 27, 2023

jmelot commented Jun 27, 2023

jmelot commented Jun 27, 2023

jmelot commented Jun 27, 2023

jamesdunham commented Jun 27, 2023

jmelot commented Jun 27, 2023 •

edited

Loading

jmelot commented Jun 27, 2023

jmelot commented Jun 27, 2023

jmelot commented Jun 27, 2023 •

edited

Loading

jmelot commented Jun 27, 2023

jmelot commented Jun 28, 2023

jamesdunham commented Jun 28, 2023

jmelot commented Jun 28, 2023

jmelot commented Jun 28, 2023

Biochem-relevant repos are astronomical #194

Biochem-relevant repos are astronomical #194

Comments

za158 commented Jun 22, 2023

za158 commented Jun 22, 2023

za158 commented Jun 22, 2023

jmelot commented Jun 27, 2023 • edited Loading

jmelot commented Jun 27, 2023

jmelot commented Jun 27, 2023

jmelot commented Jun 27, 2023

jmelot commented Jun 27, 2023

jamesdunham commented Jun 27, 2023

jmelot commented Jun 27, 2023 • edited Loading

jmelot commented Jun 27, 2023

jmelot commented Jun 27, 2023

jmelot commented Jun 27, 2023 • edited Loading

jmelot commented Jun 27, 2023

jmelot commented Jun 28, 2023

jamesdunham commented Jun 28, 2023

jmelot commented Jun 28, 2023

jmelot commented Jun 28, 2023

jmelot commented Jun 27, 2023 •

edited

Loading

jmelot commented Jun 27, 2023 •

edited

Loading

jmelot commented Jun 27, 2023 •

edited

Loading