Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clinicaltrials.gov #40

Open
Jiros opened this issue Mar 31, 2020 · 24 comments
Open

clinicaltrials.gov #40

Jiros opened this issue Mar 31, 2020 · 24 comments
Assignees
Labels
Status: In Dev This issue has been moved to the Dev environment for testing Type: Data Source To identify an issue as a data source

Comments

@Jiros
Copy link

Jiros commented Mar 31, 2020

For more information and comprehensive guidance see the excellent article from Kirsten Langendorf -
https://www.s-cubed-global.com/news/covidgraph-nerds-response-to-the-pandemic

Repo

https://github.com/covidgraph/data_clinical-trials-gov

Description

Suggested by - lynnehansen

Add data about clinical trials. There are a few databases where the results of clinical trials are published. The most relevant general purpose databases are clinicaltrials.gov and clinicaltrialsregister.eu

There might be two more datasources:

  • collections of clinical trials relevant for specific areas such as Covid-19
  • monitoring of ongoing clinical trials by the competent authorities (such as https://www.pei.de/EN)

Data Sources

https://clinicaltrials.gov/
https://www.clinicaltrialsregister.eu/

Note

All clinical studies registered on https://clinicaltrials.gov/ related to covid19.

Dependencies

None

@motey
Copy link
Member

motey commented Apr 28, 2020

With https://clinicaltrials.gov/ct2/download_studies?down_chunk=1
https://clinicaltrials.gov/ct2/download_studies?down_chunk=2 and so on, one can download all the raw data in xml format.
Only thing is that i cant find information about how many chunks there are :)

Source: https://clinicaltrials.gov/ct2/resources/download#DownloadMultipleRecords

@paltusplintus
Copy link

paltusplintus commented Apr 28, 2020

I would suggest to start with loading the basic trial description data (only for trials relevant to COVID) from clinicaltrials.gov api endpoint (see cypher query attached)

load_covid_trials_clintrials_gov.txt

Unfortunately max_rnk for this query is limited to 1000, so when there are more than 1000 trials in total, the query should be splitted into several more specific queries (e.g. per trial phase) - 'expr=covid' to be updated.

After the basic info is loaded as nodes, some parts of it (e.g. PrimaryOutcomeMeasure) could be already parsed and linked to other data in the graph.
Additional data from clinicaltrials.gov could then be loaded per trial (NCTId) with following query:

MATCH (ct:ClinicalTrial)
call apoc.load.json('https://clinicaltrials.gov/api/query/full_studies?expr='+ct.NCTId[0]+'&fmt=json') yield value
with value.FullStudiesResponse.FullStudies as studies unwind studies as study
// add code to store in Neo
return study

Description of the API: https://clinicaltrials.gov/api/gui/ref/api_urls

@KirstenLangendorf
Copy link

I have started to write the missing code (// add code to store Neo) unwinding the relevant info from JSON being returned. Do you suggest to add this additional info as nodes or as properties to the ClinicalTrial type nodes? I guess from your text it should be new nodes linked to ClinicalTrial nodes?

@paltusplintus
Copy link

Do you suggest to add this additional info as nodes or as properties to the ClinicalTrial type nodes? I guess from your text it should be new nodes linked to ClinicalTrial nodes?

Yes, I suggest separate nodes linked to ClinicalTrial, especially the data that could be linked to other data in the graph: what comes to my mind - endpoints, inclusion/exclusion criteria. If you feel that some of the data is not relevant for linking, we could leave it as a properties for now and refactor the graph in the future if required to link this data.

@motey
Copy link
Member

motey commented Apr 29, 2020

I have started to write the missing code (// add code to store Neo) unwinding the relevant info from JSON being returned.

Awesome!
Hint: To later integrate the data to the main graph, a docker image would be great. see https://github.com/covidgraph/data_template and https://github.com/covidgraph/motherlode for more informations. if you have any questions ping me (@tim.bleimehl:meet.dzd-ev.de).

@KirstenLangendorf
Copy link

ok, thanks. BTW there seems to be 1095 studies containing COVID. I have downloaded the JSON and will use that instead of the URL having the limit of 1000.

@KirstenLangendorf
Copy link

Hi, sorry but been busy with daily work and needed to get my head around the JSON input data. I have made a first attempt. For COVID studies I could not find any results, yet. PrimaryOutcomeMeasure are made as nodes, but the data is a bit messy. I have made my script in Jupyter notes (attached) using my own local graph for testing (that can be changed). Comments/feedback are
more than welcome. I am happy to do more scripting extending/changing what I have made so far. EligibilityCriteria could be added as a property-the in/exclusion criteria tend to be non-standard too. Also appreciate feedback on the scripting :-) (it is not part of my daily work)
@tim.bleimehl:meet.dzd-ev.de I think I need a bit of help if you need the suff differently.
clinicaltrials.ipynb.zip

@motey
Copy link
Member

motey commented May 7, 2020

@KirstenLangendorf Great work! The json is a mess (why is every single attribute value wrapped in list :D ? ) but looks like you tamed it 😎
I could setup a repository with a bit of boilerplate code (python/docker setup), where you can then paste your queries in. If that would help you? I would try to make it today in the afternoon or tomorrow morning.

@KirstenLangendorf
Copy link

@KirstenLangendorf Great work! The json is a mess (why is every single attribute value wrapped in list :D ? ) but looks like you tamed it 😎
I could setup a repository with a bit of boilerplate code (python/docker setup), where you can then paste your queries in. If that would help you? I would try to make it today in the afternoon or tomorrow morning.

Let me try it out. No rush - tomorrow is a Danish bank holiday. I think I will add Eligibility as nodes too. Saw the presentation by Martin Preusse and it seems that you are using Machine Learning type tools to combine messy data. Which tools are you using?

There is more data in the ClinicalTrials.gov - and hopefully also some study results at some point. What is the best way for me to get information about important data needed for the rest of the graph? will that be reading the use cases?

@motey
Copy link
Member

motey commented May 7, 2020

Saw the presentation by Martin Preusse and it seems that you are using Machine Learning type tools to combine messy data. Which tools are you using?

The ML/NLP Team is still in a experimentation/poc phase (as far as i can keep track of that atm). if you are interested in can invite you in the chat group.

There is more data in the ClinicalTrials.gov - and hopefully also some study results at some point. What is the best way for me to get information about important data needed for the rest of the graph? will that be reading the use cases?

afaik atm there is no standardized process to determine that. A discussion in the CovidGraph chat group would be the most purposeful way atm.

@KirstenLangendorf
Copy link

Saw the presentation by Martin Preusse and it seems that you are using Machine Learning type tools to combine messy data. Which tools are you using?

The ML/NLP Team is still in a experimentation/poc phase (as far as i can keep track of that atm). if you are interested in can invite you in the chat group.
Yes please, thank you :-)

There is more data in the ClinicalTrials.gov - and hopefully also some study results at some point. What is the best way for me to get information about important data needed for the rest of the graph? will that be reading the use cases?

afaik atm there is no standardized process to determine that. A discussion in the CovidGraph chat group would be the most purposeful way atm.
Ok will look out there.

@motey
Copy link
Member

motey commented May 8, 2020

Saw the presentation by Martin Preusse and it seems that you are using Machine Learning type tools to combine messy data. Which tools are you using?

The ML/NLP Team is still in a experimentation/poc phase (as far as i can keep track of that atm). if you are interested in can invite you in the chat group.
Yes please, thank you :-)

Just saw you are already in the group :) (CovidGraph Data Analysis)

@KirstenLangendorf
Copy link

@KirstenLangendorf Great work! The json is a mess (why is every single attribute value wrapped in list :D ? ) but looks like you tamed it 😎
I could setup a repository with a bit of boilerplate code (python/docker setup), where you can then paste your queries in. If that would help you? I would try to make it today in the afternoon or tomorrow morning.

Hi Tim,
I have time this weekend to work on covidgraph in case I should try out the python/docker setup.

@motey
Copy link
Member

motey commented May 12, 2020

Hi Kirsten,

you can start with https://github.com/covidgraph/data_template by clicking "Use this template" in the github webinterface.
Basicly you have to copy your queries into https://github.com/covidgraph/data_template/blob/master/dataloader/main.py

If you need any further help with git,docker or python just ping me in the chat.

@mpreusse
Copy link
Member

@KirstenLangendorf I can also help with the data loading template!

@KirstenLangendorf
Copy link

KirstenLangendorf commented May 15, 2020

@KirstenLangendorf I can also help with the data loading template!

Thanks:-) I will start looking at the loading tomorrow. I am at work today.

Ok, couldn't help it. Had to look :-)
Documentation is made in https://github.com/covidgraph/data_template. I have made one: https://github.com/KirstenLangendorf/load_clinical_trials_gov and will fill in during tomorrow.

Do I just paste the queries I have in after line 22 (delete the rest)? in
https://github.com/KirstenLangendorf/load_clinical_trials_gov/blob/master/dataloader/main.py

..ok I will read trough the instruction and revert once I have everything in my Github template.

@KirstenLangendorf
Copy link

KirstenLangendorf commented May 17, 2020

@tim and @mpreusse I have now put the script on the dataloader folder: load_data and data_profile for the stats queries.
I have written a bit on the ReadMe.

https://github.com/KirstenLangendorf/load_clinical_trials_gov

I need help on the rest since I not quite sure how to make it execute and publish in the right way.

@motey
Copy link
Member

motey commented May 17, 2020

@KirstenLangendorf cool! i will have a deeper look at it tomorrow, fork it and and try to bring it in an executable state.

@mpreusse
Copy link
Member

@KirstenLangendorf that looks great! @motey tell me if I can help. Looks similar to e.g. the text fragger, there are now downloads but only Cypher queries. Pretty long ones though 😄

@KirstenLangendorf
Copy link

@KirstenLangendorf that looks great! @motey tell me if I can help. Looks similar to e.g. the text fragger, there are now downloads but only Cypher queries. Pretty long ones though 😄

I know the queries are long but It was to avoid calling the ClinicalTrials.gov json several times.

@motey
Copy link
Member

motey commented May 18, 2020

@KirstenLangendorf Hi Kirsten. i have done following things today:

  • renamed data_profile and load_data to data_profile,cypher and load_data.cypher
  • Created a function in main.py to read in your queries from the file data_profile.cypher
  • created a main function in main.py to run your queries
  • created a pipeline to build a docker image when there is a new release of the reposiory (aka git tag) and push the container to docker hub at covidgraph/data-clinical_trials_gov (see .github/workflows/build_container_prd.yml)
  • Updated the readme.md
  • Forked your whole repo to covidgraph/data_clinical-trials-gov and made you an admin (full rights). this was needed to allow me to add docker hub credentials and to have the repo in the same scheme as the others. if that is an issue for you, just let me know and we can find another solution

If you could test the repo against a neo4j db? I am to lazy to setup a local neo4j instance with apoc :)

If the tests are successful i can integrate your script in the covidgraph dataloader pipeline 🚀

@KirstenLangendorf
Copy link

@KirstenLangendorf Hi Kirsten. i have done following things today:

  • renamed data_profile and load_data to data_profile,cypher and load_data.cypher
  • Created a function in main.py to read in your queries from the file data_profile.cypher
  • created a main function in main.py to run your queries
  • created a pipeline to build a docker image when there is a new release of the reposiory (aka git tag) and push the container to docker hub at covidgraph/data-clinical_trials_gov (see .github/workflows/build_container_prd.yml)
  • Updated the readme.md
  • Forked your whole repo to covidgraph/data_clinical-trials-gov and made you an admin (full rights). this was needed to allow me to add docker hub credentials and to have the repo in the same scheme as the others. if that is an issue for you, just let me know and we can find another solution

If you could test the repo against a neo4j db? I am to lazy to setup a local neo4j instance with apoc :)

If the tests are successful i can integrate your script in the covidgraph dataloader pipeline 🚀

Installed the Docker app.
In my terminal docker pull covidgraph/data-clinical_trials_gov
then writing this
docker build -t data-clinical_trials_gov .
returns this error:
error checking context: 'can't stat '/Users/Kirsten/.Trash''.
Tried to google it but couldn't find a fix. @motey Do you know what to do?

@motey
Copy link
Member

motey commented May 19, 2020

stupid question, but did you try it with sudo :=) ?

@KirstenLangendorf
Copy link

KirstenLangendorf commented May 19, 2020

stupid question, but did you try it with sudo :=) ?

nope - I can try. It reported same error :-(

Couldn't see your message on Riot - encrypted

@Jiros Jiros transferred this issue from covidgraph/documentation Dec 7, 2020
@Jiros Jiros added Type: Data Source To identify an issue as a data source Status: In Dev This issue has been moved to the Dev environment for testing labels Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: In Dev This issue has been moved to the Dev environment for testing Type: Data Source To identify an issue as a data source
Projects
None yet
Development

No branches or pull requests

5 participants