A Python tool that streamlines the process of creating scientific networks visualizations (aka science maps), by using data from Dimensions on Google BigQuery. Currenlty the only output visualization supported is VOSviewer. More visualizations might be added in the future.
It is possible to generate network analyses on the whole COVID19 database, or using a selected subset of data. This is achieved by letting users input any SQL query defining a COVID-19 document subset of interest (e.g. a group of journals, or a group of countries).
The Python application deals with the extraction of data from BigQuery and calculation of the network representation. Currently we have included two possible network calculations:
- Concept co-occurrence network. This query generates two-concept pairs and counts how many publications are shared between these concepts (note: concepts in Dimensions are publication-level keywords normalised and weighted based on a relevancy score).
- Organisation network. This query generates two-organisations pairs (from the authors affiliations) and counts how many publications are shared between these organisations.
Finally, the data extracted from BigQuery gets converted into a VOSviewer JSON file and packaged up into an HTML application that can be viewed in a browser. The Python library also includes a local server component that can be used to view the files locally on a computer.
By default the tool uses the Dimensions COVID-19 dataset. The dataset is openly available on the Google Cloud Marketplace and contains all published articles and preprints, grants, clinical trials, and research datasets from Dimensions.ai that are related to COVID-19.
At time of writing (May 2022), the dataset contains:
- 1M+ Publications and preprints
- 16k+ Grants
- 41k+ Patents
- 14k+ Clinical Trials
- 32k+ Research Datasets
- 36k+ Research Organizations
Data model: see the official documentation.
Users with an active subscription to the full Dimensions on Google BigQuery dataset can perform network analyses using all data in Dimensions, not just the COVID19 subset.
In order to do so, pass the --fulldimensions
(or -f
) option when invoking the script. E.g.
$ dimensions-networks {SQL_QUERY_FILE} --fulldimensions
In order to access the Dimensions datasets, you need to be able to connect to Google BigQuery using Python. This means:
- Installing the SDK. Installing & authorizing the the Google Cloud SDK, "gcloud," available directly from Google. If you can open a terminal and the
gcloud
command is recognized, it has been sufficiently configured. - Setting up a GCP project. Each time you interact with BigQuery, you need to specify which GCP project you are using. This is generally used for resources access management. More info here.
Note: newly created projects which have no associated billing account provide a sandbox experience, providing initial access to the free tier of BigQuery provided by Google. The free tier is more than enough for using this library.
With Python 3.9 and virtualenvwrapper
$ git clone [email protected]:digital-science/dimensions-network-gen.git
$ mkvirtualenv dimensions-networks
$ pip install -r requirements.txt
$ pip install -e .
After installation, you can run the application by calling dimensions-networks
.
$ dimensions-networks
Usage: dimensions-networks [OPTIONS] [FILENAME]...
dimensions-networks: a tool for creating network visualizations powered by
data from Dimensions on Google BigQuery. Example:
dimensions-networks {QUERY_FILE}
QUERY_FILE. File name containing the GBQ query to be converted into a
network. If a folder is passed, all files in the folder will be processed.
Options:
-i, --buildindex Just build the index page listing out previously
created networks.
-f, --fulldimensions Query using the full Dimensions dataset, instead of
the COVID19 subset (note: requires subscription).
-r, --runserver Run the webserver.
-p, --port INTEGER Specify the port on which the webserver should listen
for connections (default: 8009).
--verbose Verbose mode
--help Show this message and exit.
Visualizations are generated based on a subset of publications that you can define using SQL.
- Each visualization is triggered by a corresponding SQL file containing a query and, optionally, some configuration directives.
- SQL queries are stored in the
topics
directory. - Each file should contain a single SQL query that returns a list of Dimensions publication IDs in a field called
id
. - File names should be of the format
$title.sql
.- For example, a file called
archaeology.sql
will create a network listed under the title "archaeology".
- For example, a file called
E.g. these are the contents of last_30_days.sql
:
select id
from `covid-19-dimensions-ai.data.publications`
where
EXTRACT(DATE FROM date_inserted) >= DATE_ADD(CURRENT_DATE(), INTERVAL -30 DAY)
And this is how you're generate a network from that query:
$ dimensions-networks topics/last_30_days.sql
Network configuration options can be optionally defined in the SQL files before your query, as a series of commented lines starting with a predefined keyword. For example:
-- network_types: concepts, organizations
-- max_nodes: 400
-- min_edge_weight: 3
-- min_concept_relevance: 0.5
-- min_concept_frequency: 4
select id
from `covid-19-dimensions-ai.data.publications`
where
EXTRACT(DATE FROM date_inserted) >= DATE_ADD(CURRENT_DATE(), INTERVAL -30 DAY)
and altmetrics.score > 1
If omitted, the default configuration values will be used. These are all the possible configurations and their meaning.
Option | Default | Notes |
---|---|---|
network_types | concepts, organizations | Currenlty two network types are supported. |
max_nodes | 500 | How many nodes should be displayed, at maximum? |
min_edge_weight | 3 | How many edges should two nodes share before they are linked in the network? |
min_concept_relevance | 0.5 | Each concept tagged to a publication is assigned a relevance score between 0 and 1. What is the threshold that must be cleared before we consider a concept as relevant? |
min_concept_frequency | 5 | How many times should a concept appear in the corpus overall before it's included in the network? |
Generated visualizations get added to the folder build
, which is automatically created after running an extraction.
The folder contains a static website consisting of HTML, JS and JSON assets. The website uses relative links hence it can be published on web server as is. For example, see the /docs
folder in this repository, which is viewable at https://digital-science.github.io/dimensions-network-gen/.
In order to browse the output folder locally, run the server utility: dimensions-networks -s
. That will start a server on http://127.0.0.1:8009/
An organization collaboration network:
- Vosviewer docs https://app.vosviewer.com/docs/
- Dimensions BigQuery docs https://docs.dimensions.ai/bigquery/