refine-fuzzymatch-extension

Open Refine Extension to resolve ambiguity

project purposes

OpenRefine is the cool and established tool to clean data.

However, to match or parse dirty "user-specific" data more easily or more automatically, I think it requires more features.

Of course, there are already some services to support these points (e.g dedupe, elastic search), I think it worth implementing it in OpenRefine because of its rich current features and its existing community.

current feature

experimental

fuzzy matching of records between projects

1. fuzzy match

What is this

The destination project is the project you want to add some information from another project.
The source project is the project offers the destination project some additional information from its rows.

The match is done by comparing each key column (field) of the source project and each that of the destination project.
Currently, the match algorithm is based on wolfgarbe's SimSpell.
It finds max k similar rows in the range of edit distance d.
The current supported edit distance is only the optimal string alignment distance.
Some text normalization or standardization (e.g CNTK normalization, to lower case) is done in comparing,

How to use

You can use this fuzzy match from "fuzzyCross" funciton in grel in the following way.

create indices from the column header menu of the key columns of the source project in the source project's view.
get row objects with GREL function "fuzzyCross" in "edit" or "transform" of the destination project's view.
It's a similar function with "Cross" in OpenRefine, but a more complex one than it.

Here, currently, "similar" means only "have common characters with a similar aliment".
For example,
```
 fuzzyCross(
            row,    // dest row object
            [ "DestinationKeyColumnName1",  "DestinationKeyColumnName2"],   //dest key column names
              sourceProjectName,    
             [ "SourceKeyColumnName1",  "SourceKeyColumnName2"],    //source key column names
             [1, 3],   // max distances for each key
             10,   // max number of reaturned rows
             [15, 5]  //optional,   prefix lengths to compare
             )
```
Each element in each array of the arguments is for each key pair.
i.e. It compare "DestinationKeyColumnName1" column's value of the destination row and that of "SourceKeyColumnName1" in the source rows with max edit distance 1 and prefix length 15.

known limitation

can't flush indices based on the changes of the source project This is due to open refine's limitation.
If you change the cell value of the key field or rows for "source project",
you're required to create the indices again from column menu by yourself.
slow to construct index with long sentences (> about 15) with long prefix and not short distance (> 5).

production

Getting Started

copy this project directory into the extention directory of your open refine. (recommend remove .git folder) For detail, please read the extention inroduction in open refine document.

Prerequisites

OpenRefine 2.8
Java 8

Installing

copy this project into the "extensions" folder of OpenRefine

Running the tests

Break down into end to end tests

And coding style tests

Deployment

Built With

Contributing

Versioning

Authors

Vern1erCa11per - Initial work

License

This project is licensed under Apache 2.0 - see the LICENSE.md file for details. The License files for the dependencies are in LICENSES.

Acknowledgments

Notice

The current version is under development and the destructive changes may be added.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
LICENSES		LICENSES
module		module
samples		samples
scripts		scripts
src/com/vern1erca11per/refine/extension/fuzzymatch/symspell		src/com/vern1erca11per/refine/extension/fuzzymatch/symspell
test/com/vern1erca11per/refine/extension/fuzzymatch/symspell		test/com/vern1erca11per/refine/extension/fuzzymatch/symspell
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.xml		build.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

refine-fuzzymatch-extension

project purposes

current feature

experimental

fuzzy matching of records between projects

1. fuzzy match

What is this

How to use

known limitation

production

Getting Started

Prerequisites

Installing

Running the tests

Break down into end to end tests

And coding style tests

Deployment

Built With

Contributing

Versioning

Authors

License

Acknowledgments

Notice

About

Releases

Packages

Languages

License

yatszhash/refine-fuzzymatch-extension

Folders and files

Latest commit

History

Repository files navigation

refine-fuzzymatch-extension

project purposes

current feature

experimental

fuzzy matching of records between projects

1. fuzzy match

What is this

How to use

known limitation

production

Getting Started

Prerequisites

Installing

Running the tests

Break down into end to end tests

And coding style tests

Deployment

Built With

Contributing

Versioning

Authors

License

Acknowledgments

Notice

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages