Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated annotation of large files is slow #256

Closed
ledsoft opened this issue Feb 12, 2024 · 1 comment · Fixed by #257
Closed

Repeated annotation of large files is slow #256

ledsoft opened this issue Feb 12, 2024 · 1 comment · Fixed by #257
Labels
performance Performance issue

Comments

@ledsoft
Copy link
Contributor

ledsoft commented Feb 12, 2024

When text analysis is invoked on an already annotated larger file (cca 1MB) containing many term occurrences, processing of its results can take minutes to finish. This makes it practically unusable, as the user is unsure whether it is normal that the application shows Please wait... for several minutes and may leave/attempt to refresh.

Analysis of repeated annotation of the metropolitan plan shows the following times:

  • Invocation of text analysis: 8.5s
  • Resolution of occurrences in the file: 47s
  • Saving occurrences: 5min 31s

The goal should be to get at least under a minute altogether, preferably even better.

@ledsoft ledsoft added the performance Performance issue label Feb 12, 2024
@ledsoft
Copy link
Contributor Author

ledsoft commented Feb 12, 2024

After a bit more investigation, it seems repeated annotation is actually faster, because most of the existing annotations can be reused and nothing needs to change in the repository. Problem is saving new annotations. In MPP, there are 4386 term occurrences and since each occurrence usually has two selectors, it gives over 7800 instances to be saved.

Asynchronous saving of term occurrences could be used to improve performance of text analysis as a whole.

ledsoft added a commit that referenced this issue Feb 12, 2024
…parate class.

This way an alternative implementation using asynchronous processing can be introduced.
ledsoft added a commit that referenced this issue Feb 12, 2024
…n processing performance.

Helps mainly when no occurrences existed originally.
ledsoft added a commit that referenced this issue Feb 12, 2024
Should decrease number of iterations over occurrences in annotated source.
ledsoft added a commit that referenced this issue Feb 13, 2024
…currences in analyzed file.

Since the same terms are likely to occur multiple times in a file, it makes sense to cache existence check results, thus improving performance of term occurrence resolution.
@ledsoft ledsoft linked a pull request Feb 13, 2024 that will close this issue
@ledsoft ledsoft closed this as completed Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant