Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the code for retrieving annotations #7748

Merged
merged 6 commits into from
Apr 23, 2024

Conversation

Marishka17
Copy link
Contributor

@Marishka17 Marishka17 commented Apr 10, 2024

Motivation and context

After analyzing the current implementation it turned out that we evaluate queryset and iterate over them only one time when merging table rows and initializing custom structure for storing objects. It means that generally, we can disable internal Django's default logic for caching querysets. This approach allows us to reduce the amount of memory used when there are a large number of objects.

How has this been tested?

I've checked the amount of memory used, the number of queries to the database, and the required time.

Without iterator:
Memory usage: 18GB

Profile details
filename: /home/maya/Documents/cvat/cvat/apps/dataset_manager/task.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    116    290.7 MiB    290.7 MiB           1   @profile
    117                                         def _merge_table_rows(rows, keys_for_merge, field_id):
    118                                             # It is necessary to keep a stable order of original rows
    119                                             # (e.g. for tracked boxes). Otherwise prev_box.frame can be bigger
    120                                             # than next_box.frame.
    121    290.7 MiB      0.0 MiB           1       merged_rows = OrderedDict()
    122                                         
    123                                             # Group all rows by field_id. In grouped rows replace fields in
    124                                             # accordance with keys_for_merge structure.
    125  19053.0 MiB  17140.9 MiB     1721804       for row in rows: #.iterator():
    126  19053.0 MiB   -567.5 MiB     1721803           row_id = row[field_id]
    127  19053.0 MiB   -567.6 MiB     1721803           if not row_id in merged_rows:
    128  19053.0 MiB    127.0 MiB      373063               merged_rows[row_id] = dotdict(row)
    129  19053.0 MiB   -162.9 MiB      746126               for key in keys_for_merge:
    130  19053.0 MiB   -111.0 MiB      373063                   merged_rows[row_id][key] = []
    131                                         
    132  19053.0 MiB  -1121.3 MiB     3443606           for key in keys_for_merge:
    133  19053.0 MiB  -2753.5 MiB    10330818               item = dotdict({v.split('__', 1)[-1]:row[v] for v in keys_for_merge[key]})
    134  19053.0 MiB   -485.9 MiB     1721803               if item.id is not None:
    135  19053.0 MiB   -525.5 MiB     1573530                   merged_rows[row_id][key].append(item)
    136                                         
    137                                             # Remove redundant keys from final objects
    138  19053.0 MiB      0.0 MiB           7       redundant_keys = [item for values in keys_for_merge.values() for item in values]
    139  19053.0 MiB  -5728.4 MiB      373064       for i in merged_rows:
    140  19053.0 MiB -22913.7 MiB     1492252           for j in redundant_keys:
    141  19053.0 MiB -17185.2 MiB     1119189               del merged_rows[i][j]
    142                                         
    143  19052.9 MiB     -0.0 MiB           1       return list(merged_rows.values())
filename: /home/maya/Documents/cvat/cvat/apps/dataset_manager/task.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   749    290.7 MiB    290.7 MiB           1       @profile
   750                                             def init_from_db(self):
   751    290.7 MiB      0.0 MiB           1           self._init_tags_from_db()
   752  19052.3 MiB  18761.7 MiB           1           self._init_shapes_from_db()
   753  19052.3 MiB      0.0 MiB           1           self._init_tracks_from_db()
   754  19052.3 MiB      0.0 MiB           1           self._init_version_from_db()

Screenshot from 2024-04-10 13-57-33
With iterator:
Memory usage: 5.5GB

Profile details
filename: /home/maya/Documents/cvat/cvat/apps/dataset_manager/task.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   116    290.9 MiB    290.9 MiB           1   @profile
   117                                         def _merge_table_rows(rows, keys_for_merge, field_id):
   118                                             # It is necessary to keep a stable order of original rows
   119                                             # (e.g. for tracked boxes). Otherwise prev_box.frame can be bigger
   120                                             # than next_box.frame.
   121    290.9 MiB      0.0 MiB           1       merged_rows = OrderedDict()
   122                                         
   123                                             # Group all rows by field_id. In grouped rows replace fields in
   124                                             # accordance with keys_for_merge structure.
   125   4345.7 MiB   3783.7 MiB     1721804       for row in rows.iterator():
   126   4345.7 MiB     24.2 MiB     1721803           row_id = row[field_id]
   127   4345.7 MiB      0.0 MiB     1721803           if not row_id in merged_rows:
   128   4345.7 MiB     78.9 MiB      373063               merged_rows[row_id] = dotdict(row)
   129   4345.7 MiB      5.9 MiB      746126               for key in keys_for_merge:
   130   4345.7 MiB      0.0 MiB      373063                   merged_rows[row_id][key] = []
   131                                         
   132   4345.7 MiB      0.3 MiB     3443606           for key in keys_for_merge:
   133   4345.7 MiB    152.9 MiB    10330818               item = dotdict({v.split('__', 1)[-1]:row[v] for v in keys_for_merge[key]})
   134   4345.7 MiB      9.0 MiB     1721803               if item.id is not None:
   135   4345.7 MiB      0.0 MiB     1573530                   merged_rows[row_id][key].append(item)
   136                                         
   137                                             # Remove redundant keys from final objects
   138   4345.7 MiB      0.0 MiB           7       redundant_keys = [item for values in keys_for_merge.values() for item in values]
   139   4345.7 MiB      0.0 MiB      373064       for i in merged_rows:
   140   4345.7 MiB      0.0 MiB     1492252           for j in redundant_keys:
   141   4345.7 MiB      0.0 MiB     1119189               del merged_rows[i][j]
   142                                         
   143   4348.5 MiB      2.8 MiB           1       return list(merged_rows.values())
Filename: /home/maya/Documents/cvat/cvat/apps/dataset_manager/task.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   590    290.9 MiB    290.9 MiB           1       @profile
   591                                             def _init_shapes_from_db(self):
   592    290.9 MiB      0.0 MiB           3           db_shapes = self.db_job.labeledshape_set.prefetch_related(
   593    290.9 MiB      0.0 MiB           1               "label",
   594    290.9 MiB      0.0 MiB           1               "labeledshapeattributeval_set"
   595    290.9 MiB      0.0 MiB           2           ).values(
   596    290.9 MiB      0.0 MiB           1               'id',
   597    290.9 MiB      0.0 MiB           1               'label_id',
   598    290.9 MiB      0.0 MiB           1               'type',
   599    290.9 MiB      0.0 MiB           1               'frame',
   600    290.9 MiB      0.0 MiB           1               'group',
   601    290.9 MiB      0.0 MiB           1               'source',
   602    290.9 MiB      0.0 MiB           1               'occluded',
   603    290.9 MiB      0.0 MiB           1               'outside',   
   604    290.9 MiB      0.0 MiB           1               'z_order',
   605    290.9 MiB      0.0 MiB           1               'rotation',
   606    290.9 MiB      0.0 MiB           1               'points',
   607    290.9 MiB      0.0 MiB           1               'parent',
   608    290.9 MiB      0.0 MiB           1               'labeledshapeattributeval__spec_id',
   609    290.9 MiB      0.0 MiB           1               'labeledshapeattributeval__value',
   610    290.9 MiB      0.0 MiB           1               'labeledshapeattributeval__id',
   611    290.9 MiB      0.0 MiB           1               ).order_by('frame')                         
   618   4328.6 MiB   4037.7 MiB       2           db_shapes = _merge_table_rows(
   619    290.9 MiB      0.0 MiB           1               rows=db_shapes,
   620    290.9 MiB      0.0 MiB           1               keys_for_merge={
   621    290.9 MiB      0.0 MiB           1                   'labeledshapeattributeval_set': [
   622                                                                       'labeledshapeattributeval__spec_id',
   623                                                                       'labeledshapeattributeval__value',
   624                                                                       'labeledshapeattributeval__id',
   625                                                                  ],
   626                                                                  },
   627    290.9 MiB      0.0 MiB           1               field_id='id',
   628                                                 )
   649   4385.6 MiB      0.0 MiB           1           serializer = serializers.LabeledShapeSerializerFromDB(list(shapes.values()), many=True)
   650   5990.2 MiB   1604.6 MiB           1           self.ir_data.shapes = serializer.data
Filename: /home/maya/Documents/cvat/cvat/apps/dataset_manager/task.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   749    290.6 MiB    290.6 MiB           1       @profile
   750                                             def init_from_db(self):
   751    290.9 MiB      0.3 MiB           1           self._init_tags_from_db()
   752   5990.2 MiB   5699.3 MiB           1           self._init_shapes_from_db()
   753   5990.2 MiB      0.0 MiB           1           self._init_tracks_from_db()
   754   5990.2 MiB      0.0 MiB           1           self._init_version_from_db()

Screenshot from 2024-04-10 12-36-09

Checklist

  • I submit my changes into the develop branch
  • I have created a changelog fragment
    - [ ] I have updated the documentation accordingly
    - [ ] I have added tests to cover my changes
    - [ ] I have linked related issues (see GitHub docs)
    - [ ] I have increased versions of npm packages if it is necessary
    (cvat-canvas,
    cvat-core,
    cvat-data and
    cvat-ui)

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.

@Marishka17 Marishka17 requested a review from azhavoro April 12, 2024 12:02
@Marishka17 Marishka17 marked this pull request as ready for review April 12, 2024 12:02
@Marishka17
Copy link
Contributor Author

Marishka17 commented Apr 18, 2024

@azhavoro, @zhiltsov-max
I've re-measured metrics again. Besides that, I figured out that the prefetche_related used here was useless since Django implementation. To put it briefly, prefetch_related(...) will not work if values(...) is used because the cache is saved to _prefetched_objects_cache object attribute, but after using values(...) ValuesIterable is used as _iterable_class, returned queryset contains dicts and it is not possible to set an attribute to dict. (Link to the code block)

Queryset Time for retrieving annotations (min) Number of Queries Memory usage (GB) Chunk size
original 17.8 17 18.3 default 100
original + iterator 18.3 17 5.6 default 2000
original - prefetch + iterator 18 17 5.6 default 2000
original - prefetch + iterator 18.5 17 5.6 10000

@azhavoro
Copy link
Contributor

Also, please add a note to the changelog

@Marishka17 Marishka17 requested a review from nmanovic as a code owner April 22, 2024 11:40
@azhavoro azhavoro merged commit 2c64721 into develop Apr 23, 2024
32 checks passed
@Marishka17 Marishka17 deleted the mk/optimize-retrieving-annotations branch April 23, 2024 10:00
@cvat-bot cvat-bot bot mentioned this pull request Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants