Open Refine Extension to resolve ambiguity
OpenRefine is the cool and established tool to clean data.
However, to match or parse dirty "user-specific" data more easily or more automatically, I think it requires more features.
Of course, there are already some services to support these points (e.g dedupe, elastic search), I think it worth implementing it in OpenRefine because of its rich current features and its existing community.
The destination project is the project you want to add some information from another project.
The source project is the project offers the destination project some additional information from its rows.
The match is done by comparing each key column (field) of the source project and each that of the destination project.
Currently, the match algorithm is based on wolfgarbe's SimSpell.
It finds max k similar rows in the range of edit distance d.
The current supported edit distance is only the optimal string alignment distance.
Some text normalization or standardization (e.g CNTK normalization, to lower case) is done in comparing,
You can use this fuzzy match from "fuzzyCross" funciton in grel in the following way.
-
create indices from the column header menu of the key columns of the source project in the source project's view.
-
get row objects with GREL function "fuzzyCross" in "edit" or "transform" of the destination project's view.
It's a similar function with "Cross" in OpenRefine, but a more complex one than it.Here, currently, "similar" means only "have common characters with a similar aliment".
For example,fuzzyCross( row, // dest row object [ "DestinationKeyColumnName1", "DestinationKeyColumnName2"], //dest key column names sourceProjectName, [ "SourceKeyColumnName1", "SourceKeyColumnName2"], //source key column names [1, 3], // max distances for each key 10, // max number of reaturned rows [15, 5] //optional, prefix lengths to compare )
Each element in each array of the arguments is for each key pair.
i.e. It compare "DestinationKeyColumnName1" column's value of the destination row and that of "SourceKeyColumnName1" in the source rows with max edit distance 1 and prefix length 15.
-
can't flush indices based on the changes of the source project This is due to open refine's limitation.
If you change the cell value of the key field or rows for "source project",
you're required to create the indices again from column menu by yourself. -
slow to construct index with long sentences (> about 15) with long prefix and not short distance (> 5).
copy this project directory into the extention directory of your open refine. (recommend remove .git folder) For detail, please read the extention inroduction in open refine document.
- OpenRefine 2.8
- Java 8
copy this project into the "extensions" folder of OpenRefine
Vern1erCa11per - Initial work
This project is licensed under Apache 2.0 - see the LICENSE.md file for details. The License files for the dependencies are in LICENSES.
The current version is under development and the destructive changes may be added.