-
Notifications
You must be signed in to change notification settings - Fork 355
Changelog
Data changes:
- Added new Japanese crowdsourced knowledge, collected by Naoki Otani, with Kyoto University & Yahoo Research Japan
- Updated external data sources such as Wiktionary and CLDR
Code changes:
- We retain fine-grained information about word senses, when it's available, that we were previously truncating to just a part of speech. This shows up as a ConceptNet URI such as
/c/en/cat/n/wn/animal
for the WordNet sense of "cat" that is an animal. - The Web interface and the
"sense_label"
key in the API show a human-readable name for different word senses when it is available, such as "cat (n, animal)". - Revised how DB queries work to use PostgreSQL's
jsonb_path_ops
index. This simplifies the query logic and should allow us to avoid some unacceptably slow queries that the old system created.
This release replaces an attempted release of ConceptNet 5.7 in February 2019, which had performance problems with its queries.
- Added an API for directly looking up the relatedness between two terms. For example: http://api.conceptnet.io/relatedness?node1=/c/en/apple&node2=/c/en/orange
- Replaced the
pg8000
dependency withpsycopg2-binary
, fixing a denial-of-service vulnerability where a malformed request could make the Web server's database connection stop working.
There are some pending updates to the build process, but we haven't yet updated the data to use them, in the interest of getting the above fix deployed quickly and maintaining the availability of ConceptNet.
- Fixed deprecated uses of
DataFrame.loc
with missing keys inconceptnet5.vectors
. This fixes warnings that certain steps of the build process will soon raise KeyErrors. - Fixed a server error when querying the API for terms related to the empty string.
- Sped up the
cn5-vectors evaluate
command. - Added the SimLex evaluation to
cn5-vectors evaluate
. - The main ConceptNet package is now responsible for constructing API responses (but not serving them): the module
conceptnet_web.responses
has been renamed toconceptnet5.api
. This fixes a circular dependency that appeared when testingconceptnet5
andconceptnet_web
.
There is no version 5.6.1; that version number was eaten by a PyPI bug.
Data changes:
- Added data from CEDict, an open-data Chinese translation dictionary.
- Added data from the Unicode CLDR describing emoji in a number of languages. This helps to create a multilingual alignment for a number of words, especially words for emotions. The emoji themselves appear with the language code
mul
, as in /c/mul/😂. - Uses the
preprocess_text
function from wordfreq 2.0 to perform stronger standardization (normalization) of text. Arabic, Hebrew, Persian, and Urdu terms, for example, will appear consistently without vowel points as they do in natural text, even when dictionary-like sources use vowel points. Serbian/Croatian/Bosnian terms are consistently in the Latin alphabet. - The term vectors (word embeddings) for looking up related terms have been updated to account for the new data, which will become a new release of ConceptNet Numberbatch.
- Fixed bugs in wikiparsec, our Wiktionary parser, and updated the data. This removes some spurious assertions that terms were "derived from" a vaguely-related term in another language.
- Updated the source data from English Wiktionary.
API changes:
- All keys in the API responses should resolve as JSON-LD properties.
- Objects returned by the API now contain type information.
Build process changes:
- The raw data is hosted on Zenodo, providing a long-term URL for the data needed to start the build process.
- The build uses
wget
to download data instead ofcurl
, so that it's more resilient to network interruptions. - We're giving up on Docker, whose design isn't very good for working with moderate amounts of static data. We're instead improving the instructions for running the ConceptNet code on an arbitrary Linux computer, and providing an Amazon Web Services AMI where these instructions have been run.
- Added the
/x/
namespace for sub-words, which we intend to use for better lookup of term vectors. The code exists for a "morph-fitting" process that uses these when building the vector space, but it's disabled for now.
Data changes:
- Uses ConceptNet Numberbatch 17.06, which incorporates de-biasing to avoid harmful stereotypes being encoded in its word representations.
- Fixed a glitch in retrofitting, where terms in ConceptNet that were two steps removed from any term that existed in one of the existing word-embedding data sources were all being assigned the same meaningless vector. They now get vectors that are propagated (after multiple steps) from terms that do have existing word embeddings, as intended.
- Filtered some harmful assertions that came from disruptive or confused Open Mind Common Sense contributors. (Some of them had been filtered before, but changes to the term representation had defeated the filters.)
- Added a new source of input word embeddings, created at Luminoso by running a multilingual variant of fastText over OpenSubtitles 2016.
Build process changes:
- We measured the amount of RAM the build process requires at its peak to be 30 GB, and tested that it completes on a machine with 32 GB of RAM. We updated the Snakefile to reflect these requirements and to use them to better plan which tasks to run in parallel.
- The build process starts by checking for some requirements (having enough RAM, enough disk space, and a usable PostgreSQL database), and exits early if they aren't met, instead of crashing many hours later.
- The tests have been organized into tests that can be run before building ConceptNet, tests that can be run after a small example build, and tests that require the full ConceptNet. The first two kinds of tests are run automatically, in the right sequence, by the
test.sh
script. -
test.sh
andbuild.sh
have been moved into the top-level directory, where they are more visible.
Library changes:
- Uses the
marisa-trie
library to speed up inferring vectors for out-of-vocabulary words. - Uses the
annoy
library to suggest nearest neighbors that map a larger vocabulary into a smaller one. - Depends on a specific version of
xmltodict
, because a breaking change toxmltodict
managed to break the build process of many previous versions of ConceptNet. - The
cn5-vectors evaluate
command can evaluate whether a word vector space contains gender biases or ethnic biases.
API changes:
- Queries to the
/related
endpoint use our SemEval-winning strategy for handling out-of-vocabulary words. - The default server that runs in the Docker container is now the API server. You no longer have to configure the name "api.localhost" to point to your own machine to get to the API. (To get the Web-page interface instead, you do need to get to it as "www.localhost".)
- The API is supposed to return JSON wrapped in HTML when a client sends the
Accept: text/html
header, but it was also doing it when theAccept:
header was missing entirely. Fixed this so that the default really is JSON. - In case the
Accept:
header isn't enough, you may now add the?format=json
parameter to force a pure-JSON response.
Build process changes:
- Fixed the build process so fewer memory-hungry processes should run simultaneously. It should once again be possible to build ConceptNet on a machine with 16 GB of RAM.
- Failed downloads should no longer leave junk files around that you have to manually clear out.
Library changes:
- The
cn5-vectors evaluate
command can now evaluate ConceptNet Numberbatch on Chinese words (PKU-500), Japanese words (TMU-RW), and on SemEval 2017's Multilingual and Cross-Lingual Word Similarity task.
Updated the Docker build scripts.
An upstream repository, "postgres", changed the way it runs its initialization scripts, and they now run as the container's "postgres" user, not as root. We had to change the script for importing ConceptNet's Postgres data to accomodate this.
Data changes:
- Some Chinese data from the "Pet Game" GWAP was being read with its surface text in the wrong slots
- Chinese now uses the
/r/HasA
relation to represent the sentence frame "{1} 有 {2}。"
Library:
- Fixes to
conceptnet5.vectors
to ensure that the ConceptNet 5.5 paper is reproducible
Build process:
- Fixed the dependency graph, ensuring that the DB is ready before steps that require it
- The file of SHA-256 sums for downloadable data is updated when that data changes
API fixes:
- Enabled CORS
- Fixed a bug that caused
/api/related
queries to fail when filtered for Chinese results - Fixed a bug that would serve cached API responses in one format to clients that requested a different format
- Fixed the JSON-LD type of the "surfaceText" field
- Some errors were being served with HTTP code 200, fixed to be 500
- Error code 500 no longer gets the default "Internal server error" page
- API errors are reported using Sentry
Data changes:
- Removed some vandalism from the OMCS data
ConceptNet 5.5 contains many changes to make ConceptNet easier to work with and more powerful. The API has changed, but the data model is the same as ConceptNet 5.4.
Significant changes:
- A new Web API, with changes to make its results compatible with JSON-LD
- You can query by 'node' and 'other' in the API, matching edges connected to a node regardless of whether that node is considered the 'start' or 'end'
- New code for working with state-of-the-art word embeddings (
conceptnet5.vectors
) - A new browsable Web interface, hosted at http://conceptnet.io
- Data is stored in PostgreSQL, providing reliable, fast queries
- Words can be connected to other Linked Data resources using the ExternalURL relation
- Words in all languages can be connected to their root form with the FormOf relation
- Because of that, English words no longer have to be reduced to a root form to be looked up. You can look up
/c/en/words
and get results. - Converting text to its URI form is easy now: you lowercase the text and replace spaces with underscores. Languages such as Hindi are no longer mangled in URIs.
Data changes:
- Rewrote the Wiktionary parser to be more informative and accurate
- Added data from Open Multilingual WordNet
- Revised data from OpenCyc
- Rewrote the DBPedia reader to focus on terms that are also represented in other sources
- Removed GlobalMind as a data source. It helped us get started in various languages, but we aren't confident enough about its quality.
Process changes:
- Use Snakemake instead of Ninja to manage the build process
- Use Docker Compose to make the process of installing and building ConceptNet straightforward and reproducible
Bug fixes:
- The API server was resolving some URLs incorrectly and non-deterministically. Fixed the URL rules so that they don't overlap anymore.
- A wildcard match for
*
in theMANIFEST.in
was matching directories, makingsetup.py install
andpip install conceptnet
crash.
This version fixes some bugs that existed in the build process of 5.4.0:
- Some of the raw data files that are needed to build ConceptNet were not being included in the distributed package of raw data.
- The data was unpacked into the wrong directory.
- The language codes 'zh-tw' and 'zh-cn' were not being recognized as being equivalent to 'zh'.
ConceptNet 5.4 includes small updates to the source data, a significant simplification to how texts are represented as URIs, and a new build process.
As of August 10, its code is available on the version5.4
branch, and its raw data is available at http://conceptnet5.media.mit.edu/downloads/v5.4/ . Other downloads and the Web API are still being prepared.
- We've updated the data from nadya.jp, a game that collects relational knowledge in Japanese.
- Updated DBPedia to its 2014 release.
- Dropped some not-very-useful relations that snuck in from old experiments with ConceptNet, such as
/r/InheritsFrom
. - We've simplified the way that natural language texts are represented as ConceptNet URIs. Instead of an English-specific, machine-learned tokenizer from NLTK, we use a simpler regex based on Unicode character classes to split the text into words. The most noticeable change is that hyphens are now token boundaries just like spaces:
/c/en/mother-in-law
is now/c/en/mother_in_law
, and the prefix/c/en/proto-
is now/c/en/proto
. - Assertions now store the original texts of the terms that they relate in the
surfaceStart
andsurfaceEnd
fields. The assertion whosesurfaceText
is[[Fire]] is [[hot]]
has properties including{"surfaceStart": "Fire", "surfaceEnd": "hot"}
. - The Makefile for ConceptNet was becoming unwieldy, so we've replaced it with a Ninja file. Ninja is a build system that's similar in spirit to Make, but deals better with parallel builds and with build steps that produce many outputs.
- Uses the
langcodes
module to parse language names and codes more robustly, especially those from Wiktionary.
ConceptNet 5.3 introduces changes in many areas:
- The search index is now implemented in pure Python, using SQLite. Solr is no longer a dependency.
- The API now uses this new search index. One effect of this is that it matches only complete path components, not any prefix of a URI. Searching for "/c/en/cat" will get "/c/en/cat" and "/c/en/cat/n/animal", but not "/c/en/catamaran".
- Exact matches are also possible. Searching for "/c/en/cat/." -- with a dot as the final component -- will find only "/c/en/cat".
- Because the search index no longer uses Solr, the "score" attribute no longer appears on edges. This attribute was an artifact of Solr that represented the product of the edge's "weight" with Solr's built-in search weight. If you were using "score", you should use "weight" instead.
- ConceptNet now imports data from Umbel, a Semantic Web-style view of OpenCyc.
- Indonesian (
id
) and Malay (ms
) concepts have been unified into the Malay macrolanguage (also designatedms
), similarly to the way we already unify Chinese, because of their highly overlapping vocabularies. In a later version, we may be able to make distinctions between languages within a macrolanguage when necessary. - We've implemented a better Wiktionary reader using Grako, a framework for writing recursive parsers in Python. This parser is able to understand the structure of a Wiktionary entry, giving more results and fewer errors than what we did before.
- Wiktionary parsing now covers entries written in German as well as English. (As before, the entries are about words in hundreds of languages.)
The intermediate format for lists of ConceptNet edges is now msgpack instead of JSON. This format is compatible with JSON but saves disk space and parsing time.
The "assoc-space", a dimensionality-reduced vector space of which words are like other words, uses an updated version of the assoc_space
package. It can now be built in shards that are combined to form the complete space, instead of having to be built all at once, making it possible to run using a reasonable amount of RAM.
Some of ConceptNet's data is available under the Creative Commons Attribution (CC-By) license, even though the dataset as a whole requires the Creative Commons Attribution-ShareAlike license (CC-By-SA). This information is marked on each edge, but in ConceptNet 5.2, there was no easy way to get the CC-By subset.
By now, there are enough CC-By-SA data sources that it doesn't make sense to attempt a complete build of ConceptNet without them. However, ConceptNet 5.3's downloads include a file containing only the CC-By edges, as individual edges that aren't grouped into assertions.
ConceptNet 5.3's support code still runs on Python 2, but we would like to drop support for Python 2 in an upcoming version. As has been the case since version 5.2, the data cannot be built correctly on Python 2.
The data files described by MANIFEST.in
are now also installed as package_data
in setup.py
, making them available when installed as a package. (This accomplishes what the last bullet point in 5.2.3 was supposed to be about.)
- Fix a typo in the Makefile that prevented it from downloading the initial raw data.
- Enforce the rate limit in the API.
- Merge in NLP code from
metanl
, instead of having it as an external dependency. The dependency is now on the simpler packageftfy
. - Add a
MANIFEST.in
so that the necessary data can still be found after apip install
orsetup.py install
.
- Fix the accidental omission of nadya.jp data.
5.2.1 is a significant revision to the code that builds ConceptNet, but it retains mostly the same representation and almost all of the same knowledge as 5.2.0. The cases where they differ are largely due to bugs that were discovered in the refactor.
- Reorganized much of the code for working with nodes and edges in ConceptNet.
- The code is now designed for Python 3 as its primary environment. A small amount of compatibility code makes sure that it will still run on Python 2.7 as well, but it will not necessarily get the same results from all Unicode operations.
- Removed a fair amount of dead code.
- Added test cases that cover most of the code; removed tests for 5.0 that clearly wouldn't work anymore.
- Combined assertions (such as what the 5.2 API returns) keep track of their full list of sources and their first-seen dataset, so they can be searched like edges in 5.1.
A change will be noticeable in the Web API, because for a while it was serving the union of ConceptNet 5.1 and 5.2 data structures, with both separate edges and combined assertions. Now it is only serving the combined assertions. The results should be similar, but with less duplication.
- The set of knowledge sources has changed. JMdict is in. ReVerb is out, because we couldn't filter it well enough.
- Some bugs in building from existing sources were fixed.
- ConceptNet can now be built from its raw data using a Makefile. (See Build process)
- The code comes with everything you need to build and query "assoc spaces" -- vector spaces representing semantic connections between concepts -- thanks to the open-source release of assoc_space by Luminoso.
- The API now returns one result per assertion, even if that assertion comes from multiple sources.
- Because of that, the representation of knowledge sources has changed. The sources used to be lists of reasons that an assertion got added, and each one implicitly represented a conjunction. The "sources" field in the API now always contains one element for each assertion, and that element contains the full AND-OR tree of sources.
Version 5.1 has a new, simpler representation of nodes and edges than ConceptNet 4.0 or 5.0, making it suitable to represent ConceptNet 5 with downloadable flat files and efficient search indexes.
- Made base URIs shorter. For example,
/concept/en/dog
becomes/c/en/dog
. - Changed the representation of assertions. Assertions are a bundle of edges (hyperedges, really) that connect two arguments and a relation. These edges are labeled with all the appropriate metadata.
- Created JSON and CSV flat-files.
- Created a Solr index and an accompanying API. The MongoDB is deprecated.
ConceptNet 5.1.1 was an incremental update that maintains full API compatibility with 5.1.
- First API for ConceptNet 5.
- All assertions were reified as nodes, with edges for arguments. This turned out to be an ineffective representation.
Starting points
Reproducibility
Details