Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure publication_code, issue_code and item_code uniqueness #174

Open
3 tasks
griff-rees opened this issue Aug 17, 2023 · 4 comments
Open
3 tasks

Ensure publication_code, issue_code and item_code uniqueness #174

griff-rees opened this issue Aug 17, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@griff-rees
Copy link
Collaborator

griff-rees commented Aug 17, 2023

A recent check of publication uniqueness suggests there are 76 newspaper publication_code duplicates (all just 1 other record, so a count of 2).

  • 76 same publication_code records
  • 82520 same issue_code records
  • 3670454 same item_code records

These might be cases of multiple editions of issue on the same day (following @kmcdono2 in #120), or actual duplicate records (meaning... just wrong). I think the majority of the publication_code cases are the later (and thankfully quite a few have no related issues, and by extension items):

>>> from django.db.models import QuerySet
>>> from newspaper.models import Newspaper, Issue, Item
>>> from lwmdb.utils import similar_records

>>> newspaper_same_codes: QuerySet = similar_records(Newspaper.objects.all(), check_fields=('publication_code',))
>>> issue_same_codes: QuerySet = similar_records(Issue.objects.all(), check_fields=('issue_code',))
>>> item_same_codes: QuerySet = similar_records(Item.objects.all(), check_fields=('item_code',))
>>> len(newspaper_same_codes)
76
>>> len(issue_same_codes)
81520
>>> len(item_same_codes)
3670454
>>> all(record for record in newspaper_same_codes if record['id__count'] == 2)
True
>>> all(record for record in issue_same_codes if record['id__count'] == 2)
True
>>> all(record for record in item_same_codes if record['id__count'] == 2)
True
@griff-rees
Copy link
Collaborator Author

see #55 and #93

@griff-rees griff-rees added the bug Something isn't working label Aug 18, 2023
@griff-rees griff-rees self-assigned this Aug 18, 2023
@griff-rees griff-rees added this to the Alpha release v0.1.12 milestone Aug 18, 2023
@griff-rees
Copy link
Collaborator Author

Updated description and ease separating into separate tasks.

@griff-rees griff-rees changed the title Ensure Newspaper publication_code uniqueness Ensure publication_code, 'issue_code' and 'item_code' uniqueness Dec 10, 2023
@griff-rees griff-rees changed the title Ensure publication_code, 'issue_code' and 'item_code' uniqueness Ensure publication_code, issue_code and item_code uniqueness Dec 10, 2023
@griff-rees
Copy link
Collaborator Author

  • Add a collection field in case there are newspapers duplicated across collections
    • Document from current configuration
    • Roadmap for changes to come
    • Enough for first database export
  • FullText
    • Until duplicate question is solved leave that separate
    • Have a workshop to address those

@griff-rees
Copy link
Collaborator Author

#119 is also related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant