UEDI: Unsupervised Evaluation of Dataset Integration

Evaluation is a bottleneck in data integration processes: it is performed by domain experts through manual onerous data inspections. This task is particularly heavy in real business scenarios, where the large amount of data makes checking all integrated tuples infeasible. Our idea is to address this issue by providing the experts with an unsupervised measure, based on word frequencies, which quantifies how much a dataset is representative of another dataset, giving an indication of how good is the integration process. The paper motivates and introduces the measure and provides extensive experimental evaluations, that show the effectiveness and the efficiency of the approach.

For a detailed description of the work please read our paper. Please cite the paper if you use the code from this repository in your work.

@inproceedings{10.1145/3477314.3507688,
    author = {Paganelli, Matteo and Buono, Francesco Del and Guerra, Francesco and Ferro, Nicola},
    title = {Evaluating the Integration of Datasets},
    year = {2022},
    publisher = {Association for Computing Machinery},
    booktitle = {Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing},
    pages = {347–356},
    series = {SAC '22}
}

Library

Requirements

Python: Python 3.*
Packages: requirements.txt

Installation

$ cd source

$ virtualenv -p python3 venv

$ source venv/bin/activate

$ pip install -r requirements.txt

Install the necessary datasets/models for nltk functions to work Example

import nltk
nltk.download()

How to Use

Input and Output Representativness

import pandas as pd
from uedi.evaluation import prepare_dataset
from uedi.representativeness import representativness

# Compute input and output representativness
filename = 'data/Structured_Fodors-Zagats.csv'
columns = ['name', 'addr', 'city', 'phone', 'type', 'class']
df = pd.read_csv(filename)
df1, df2, dfi = prepare_dataset(df, columns)

input_repr, output_repr = representativness(df_s=df1, df_i=dfi)
print(f'\nSource 1 Representativness')
print(f'Input Representativness: {input_repr:0.4f}')
print(f'Output Representativness: {output_repr:0.4f}')

Input and Output Ranking Representativness

import numpy as np
import pandas as pd
from uedi.evaluation import prepare_dataset
from uedi.rank import input_ranking, output_ranking

# Compute input and output representativness
filename = 'data/Structured_Fodors-Zagats.csv'
columns = ['name', 'addr', 'city', 'phone', 'type', 'class']
df = pd.read_csv(filename)
df1, df2, dfi = prepare_dataset(df, columns)

# Compute input ranking
input_ranks = input_ranking(df_s=df1, df_i=dfi)
idx = np.argmin(input_ranks)
print('\nThe least represented record is: ')
print(df1.iloc[idx])


# Compute output ranking
output_ranks = output_ranking(df_list=[df1, df2], df_i=dfi)
idx = np.argmin(output_ranks)
print('\nThe least represented record is: ')
print(dfi.iloc[idx])

Please feel free to contact me if you need any further information

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
uedi		uedi
.gitignore		.gitignore
README.md		README.md
demo.py		demo.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UEDI: Unsupervised Evaluation of Dataset Integration

Library

Requirements

Installation

How to Use

About

Releases

Packages

Languages

softlab-unimore/UEDI

Folders and files

Latest commit

History

Repository files navigation

UEDI: Unsupervised Evaluation of Dataset Integration

Library

Requirements

Installation

How to Use

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages