Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate samples and overlap between train and test #44

Closed
Britefury opened this issue Aug 31, 2017 · 1 comment · Fixed by #45
Closed

Duplicate samples and overlap between train and test #44

Britefury opened this issue Aug 31, 2017 · 1 comment · Fixed by #45
Labels

Comments

@Britefury
Copy link

I hope I have got this right, but it seems that there are 43 samples duplicated in the training set and 1 sample that is duplicated in the test set. There are also 10 samples in the training set that appear in the test set. This was done by comparing the samples at the byte level.

Here is a list of the duplicates:

Training set duplicates:
[601, 39865]
[831, 24228]
[1826, 23718]
[2024, 53883]
[4974, 6293]
[5520, 49165]
[5790, 11845]
[5822, 33399]
[6139, 37731]
[6280, 41036]
[8485, 31238]
[8841, 28184]
[12571, 56657]
[14096, 32343]
[14710, 22159]
[15587, 28635]
[19308, 20114]
[19668, 21571]
[19760, 39489]
[19888, 24443]
[21072, 32800]
[22852, 28789]
[23052, 57107]
[23413, 33731]
[24785, 46015]
[25297, 40077]
[25629, 49588]
[26314, 49351]
[27045, 40033]
[27421, 31627]
[32113, 38337]
[32300, 33730]
[32303, 56840]
[32888, 41918]
[32922, 54584]
[36634, 39841]
[38261, 41877]
[42756, 53842]
[46667, 57724]
[46782, 54829]
[47929, 54185]
[48480, 59607]
[48955, 51368]
Test set duplicates:
[6334, 8569]
Training set samples overlapping with test set:
Train samples [3763] overlap with test samples [7243]
Train samples [4944] overlap with test samples [7781]
Train samples [6168] overlap with test samples [9227]
Train samples [12404] overlap with test samples [4037]
Train samples [15943] overlap with test samples [6659]
Train samples [22403] overlap with test samples [7762]
Train samples [34617] overlap with test samples [4990]
Train samples [35772] overlap with test samples [7216]
Train samples [48228] overlap with test samples [5867]
Train samples [52205] overlap with test samples [9560]

The code required to generate the above output is as follows (assuming the input images are in the variables train_X and test_X:

def sample_bytes(x):
    result = []
    for i in range(len(x)):
        b = x[i].tobytes()
        result.append(b)
    return result

train_h = sample_bytes(train_X)
test_h = sample_bytes(test_X)

train_dict = {}
test_dict = {}
for i, h in enumerate(train_h):
    train_dict.setdefault(h, []).append(i)
for i, h in enumerate(test_h):
    test_dict.setdefault(h, []).append(i)

print('Training set duplicates:')
for k, v in train_dict.items():
    if len(v) > 1:
        for j in range(1, len(v)):
            assert (ds.train_X_u8[v[0]] == ds.train_X_u8[v[j]]).all()
        print(v)

print('Test set duplicates:')
for k, v in test_dict.items():
    if len(v) > 1:
        for j in range(1, len(v)):
            assert (ds.test_X_u8[v[0]] == ds.test_X_u8[v[j]]).all()
        print(v)

print('Training set samples overlapping with test set:')
for k, v in train_dict.items():
    if k in test_dict:
        assert (ds.train_X_u8[v[0]] == ds.test_X_u8[test_dict[k][0]]).all()
        print('Train samples {} overlap with test samples {}'.format(v, test_dict[k]))

overlap = set(train_h).intersection(set(test_h))
print(len(overlap))
assert overlap == set()
@hanxiao
Copy link
Collaborator

hanxiao commented Aug 31, 2017

🙇‍♂️ Many thanks for finding out this issue! check my PR #45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants