-
Notifications
You must be signed in to change notification settings - Fork 355
Build process
Setting up a complete installation of ConceptNet requires some Python code, some associated technology such as PostgreSQL, and various dependencies. This guide attempts to walk you through how to set it up.
This is no longer our recommended way to run ConceptNet. We would rather automate the dependencies, instead of having to describe all the steps here. The conceptnet-deployment repository describes how to set up ConceptNet using either Packer or Puppet, which will take care of almost all of these steps for you.
If you are running this on an existing computer, you will need:
- A Unix system with command-line tools like
sort
andgrep
- Python 3.7 or later, with development headers (
python3-dev
) - A Python environment where you can install packages without
sudo
(for example, using virtualenv) - PostgreSQL 10 or later, and the ability to create databases
- Set up PostgreSQL's permissions so that you can run "createdb conceptnet5" as your current user, without sudo.
- Git
- 300 GB of free disk space
- At least 30 GB of available RAM
- The time and bandwidth to download 24 GB of raw data
- The
numpy
andscipy
libraries - The
libhdf5-dev
library for reading and writing HDF5 tables - The
libmecab-dev
library for tokenizing Japanese, and its dictionary,mecab-ipadic-utf8
Check out the source code of ConceptNet from Git:
git clone [email protected]:commonsense/conceptnet5
cd conceptnet5
Make sure that the development libraries that ConceptNet needs are available. For example, on Ubuntu:
sudo apt install build-essential python3-pip python3-dev libhdf5-dev libmecab-dev mecab-ipadic-utf8
mecab-ipadic-utf8
is the Japanese dictionary needed by MeCab to tokenize Japanese text. If you're on a non-Ubuntu system, the package may be called something else. Be sure to get the UTF-8 version. ConceptNet uses UTF-8 consistently. The default EUC-JP version of IPADic will not work.
If you are installing a version of ConceptNet 5 prior to 5.5.5, such as to reproduce a published result, you should run pip install xmltodict==0.10.2
to satisfy its dependency on a library that has made breaking changes since then.
Install PostgreSQL 10 or later. This command, for example, will install PostgreSQL 10 on Ubuntu:
sudo apt install postgresql-10
You'll need to configure PostgreSQL's permissions so that you can create and write to a database as your current user. The details of this are outside the scope of this tutorial. See How to install and use PostgreSQL on Ubuntu, though this article is dated.
Your PostgreSQL user account has to be able to access the database by connecting to a local address, not just using the "Unix domain socket" that the psql
command uses. You'll either need to set a password on your PostgreSQL account and store that in the CONCEPTNET_DB_PASSWORD
environment variable, or follow a guide such as this one to not require a password when connecting locally.
Create a PostgreSQL database named conceptnet5
that you have the ability to write to:
createdb conceptnet5
Create a data
directory within conceptnet5
that will contain ConceptNet's data. If necessary, make it a symbolic link to a hard drive with more space on it.
mkdir data
Install ConceptNet as a python package in your environment, including the optional "vectors" dependencies:
pip install -e '.[vectors]'
Now that you've either done the manual installation described in the section above, or used Puppet to automate it, you can run the build process which creates the ConceptNet graph from raw data. This process uses a build tool for reproducible data science called Snakemake.
Start the build by running:
./build.sh
You can test that the ConceptNet code and build process work as expected by running the test suite using pytest. The actual database doesn't necessarily have to be built, because the tests run a small example build as part of their setup.
First install the test dependencies:
pip install pytest PyLD
Then you can run the test suite:
pytest
If you have built the full ConceptNet database, you can add tests that are usually skipped that test that the database is working correctly:
pytest --fulldb
Here are some useful outputs of the build process:
- The
conceptnet5
PostgreSQL database, containing an index of all the edges -
assertions/assertions.csv
: A CSV file of all the assertions in ConceptNet -
assertions/assertions.msgpack
: The same data in the more efficient (and less readable) msgpack format -
edges/
: The edges from individual sources that these assertions were built from. -
stats/
: Some text files that count the distribution of different languages, relations, and datasets in the built data. -
assoc/reduced.csv
: A tabular text file of just the concept-to-concept associations (plus additional 'negated concept' nodes that represent negative relations), filtered for concepts that are referred to frequently enough -
vectors/mini.h5
: A vector space of high-quality word embeddings built from an ensemble of ConceptNet, word2vec, and GloVe, stored as a Pandas data frame in HDF5 format
Some other files you can build by request (type snakemake
followed by the file name):
-
data/vectors/numberbatch.h5
: the full ConceptNet Numberbatch matrix, with a larger vocabulary and more precision thanvectors/mini.h5
-
data/stats/evaluation.h5
: evaluation results comparingnumberbatch.h5
to other pre-computed word embeddings
If you ran the Puppet installation, then the Web server that serves the API will be running for you, and all you need to do is restart the process:
sudo systemctl restart conceptnet
Otherwise, you've got more installation steps. Install the sub-package for the Web server:
cd web
pip install -e .
You can serve the API by running it as a Python script. You have to be in the web
subdirectory of the repository (the one we just cd
ed to above), or else it won't be able to find its files:
python conceptnet_web/api.py
This will run the API inside Flask's simple Web server. The Puppet version of the setup actually sets up a more efficient web server, using Nginx and uWSGI. You could configure these yourself, but at this point you're probably better off using conceptnet-deployment.
Starting points
Reproducibility
Details