Optimize the code for retrieving annotations #7748

Marishka17 · 2024-04-10T15:50:05Z

Motivation and context

After analyzing the current implementation it turned out that we evaluate queryset and iterate over them only one time when merging table rows and initializing custom structure for storing objects. It means that generally, we can disable internal Django's default logic for caching querysets. This approach allows us to reduce the amount of memory used when there are a large number of objects.

How has this been tested?

I've checked the amount of memory used, the number of queries to the database, and the required time.

Without iterator:
Memory usage: 18GB

Profile details

filename: /home/maya/Documents/cvat/cvat/apps/dataset_manager/task.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    116    290.7 MiB    290.7 MiB           1   @profile
    117                                         def _merge_table_rows(rows, keys_for_merge, field_id):
    118                                             # It is necessary to keep a stable order of original rows
    119                                             # (e.g. for tracked boxes). Otherwise prev_box.frame can be bigger
    120                                             # than next_box.frame.
    121    290.7 MiB      0.0 MiB           1       merged_rows = OrderedDict()
    122                                         
    123                                             # Group all rows by field_id. In grouped rows replace fields in
    124                                             # accordance with keys_for_merge structure.
    125  19053.0 MiB  17140.9 MiB     1721804       for row in rows: #.iterator():
    126  19053.0 MiB   -567.5 MiB     1721803           row_id = row[field_id]
    127  19053.0 MiB   -567.6 MiB     1721803           if not row_id in merged_rows:
    128  19053.0 MiB    127.0 MiB      373063               merged_rows[row_id] = dotdict(row)
    129  19053.0 MiB   -162.9 MiB      746126               for key in keys_for_merge:
    130  19053.0 MiB   -111.0 MiB      373063                   merged_rows[row_id][key] = []
    131                                         
    132  19053.0 MiB  -1121.3 MiB     3443606           for key in keys_for_merge:
    133  19053.0 MiB  -2753.5 MiB    10330818               item = dotdict({v.split('__', 1)[-1]:row[v] for v in keys_for_merge[key]})
    134  19053.0 MiB   -485.9 MiB     1721803               if item.id is not None:
    135  19053.0 MiB   -525.5 MiB     1573530                   merged_rows[row_id][key].append(item)
    136                                         
    137                                             # Remove redundant keys from final objects
    138  19053.0 MiB      0.0 MiB           7       redundant_keys = [item for values in keys_for_merge.values() for item in values]
    139  19053.0 MiB  -5728.4 MiB      373064       for i in merged_rows:
    140  19053.0 MiB -22913.7 MiB     1492252           for j in redundant_keys:
    141  19053.0 MiB -17185.2 MiB     1119189               del merged_rows[i][j]
    142                                         
    143  19052.9 MiB     -0.0 MiB           1       return list(merged_rows.values())

filename: /home/maya/Documents/cvat/cvat/apps/dataset_manager/task.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   749    290.7 MiB    290.7 MiB           1       @profile
   750                                             def init_from_db(self):
   751    290.7 MiB      0.0 MiB           1           self._init_tags_from_db()
   752  19052.3 MiB  18761.7 MiB           1           self._init_shapes_from_db()
   753  19052.3 MiB      0.0 MiB           1           self._init_tracks_from_db()
   754  19052.3 MiB      0.0 MiB           1           self._init_version_from_db()

With iterator:
Memory usage: 5.5GB

Profile details

filename: /home/maya/Documents/cvat/cvat/apps/dataset_manager/task.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   116    290.9 MiB    290.9 MiB           1   @profile
   117                                         def _merge_table_rows(rows, keys_for_merge, field_id):
   118                                             # It is necessary to keep a stable order of original rows
   119                                             # (e.g. for tracked boxes). Otherwise prev_box.frame can be bigger
   120                                             # than next_box.frame.
   121    290.9 MiB      0.0 MiB           1       merged_rows = OrderedDict()
   122                                         
   123                                             # Group all rows by field_id. In grouped rows replace fields in
   124                                             # accordance with keys_for_merge structure.
   125   4345.7 MiB   3783.7 MiB     1721804       for row in rows.iterator():
   126   4345.7 MiB     24.2 MiB     1721803           row_id = row[field_id]
   127   4345.7 MiB      0.0 MiB     1721803           if not row_id in merged_rows:
   128   4345.7 MiB     78.9 MiB      373063               merged_rows[row_id] = dotdict(row)
   129   4345.7 MiB      5.9 MiB      746126               for key in keys_for_merge:
   130   4345.7 MiB      0.0 MiB      373063                   merged_rows[row_id][key] = []
   131                                         
   132   4345.7 MiB      0.3 MiB     3443606           for key in keys_for_merge:
   133   4345.7 MiB    152.9 MiB    10330818               item = dotdict({v.split('__', 1)[-1]:row[v] for v in keys_for_merge[key]})
   134   4345.7 MiB      9.0 MiB     1721803               if item.id is not None:
   135   4345.7 MiB      0.0 MiB     1573530                   merged_rows[row_id][key].append(item)
   136                                         
   137                                             # Remove redundant keys from final objects
   138   4345.7 MiB      0.0 MiB           7       redundant_keys = [item for values in keys_for_merge.values() for item in values]
   139   4345.7 MiB      0.0 MiB      373064       for i in merged_rows:
   140   4345.7 MiB      0.0 MiB     1492252           for j in redundant_keys:
   141   4345.7 MiB      0.0 MiB     1119189               del merged_rows[i][j]
   142                                         
   143   4348.5 MiB      2.8 MiB           1       return list(merged_rows.values())

Filename: /home/maya/Documents/cvat/cvat/apps/dataset_manager/task.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   590    290.9 MiB    290.9 MiB           1       @profile
   591                                             def _init_shapes_from_db(self):
   592    290.9 MiB      0.0 MiB           3           db_shapes = self.db_job.labeledshape_set.prefetch_related(
   593    290.9 MiB      0.0 MiB           1               "label",
   594    290.9 MiB      0.0 MiB           1               "labeledshapeattributeval_set"
   595    290.9 MiB      0.0 MiB           2           ).values(
   596    290.9 MiB      0.0 MiB           1               'id',
   597    290.9 MiB      0.0 MiB           1               'label_id',
   598    290.9 MiB      0.0 MiB           1               'type',
   599    290.9 MiB      0.0 MiB           1               'frame',
   600    290.9 MiB      0.0 MiB           1               'group',
   601    290.9 MiB      0.0 MiB           1               'source',
   602    290.9 MiB      0.0 MiB           1               'occluded',
   603    290.9 MiB      0.0 MiB           1               'outside',   
   604    290.9 MiB      0.0 MiB           1               'z_order',
   605    290.9 MiB      0.0 MiB           1               'rotation',
   606    290.9 MiB      0.0 MiB           1               'points',
   607    290.9 MiB      0.0 MiB           1               'parent',
   608    290.9 MiB      0.0 MiB           1               'labeledshapeattributeval__spec_id',
   609    290.9 MiB      0.0 MiB           1               'labeledshapeattributeval__value',
   610    290.9 MiB      0.0 MiB           1               'labeledshapeattributeval__id',
   611    290.9 MiB      0.0 MiB           1               ).order_by('frame')                         
   618   4328.6 MiB   4037.7 MiB       2           db_shapes = _merge_table_rows(
   619    290.9 MiB      0.0 MiB           1               rows=db_shapes,
   620    290.9 MiB      0.0 MiB           1               keys_for_merge={
   621    290.9 MiB      0.0 MiB           1                   'labeledshapeattributeval_set': [
   622                                                                       'labeledshapeattributeval__spec_id',
   623                                                                       'labeledshapeattributeval__value',
   624                                                                       'labeledshapeattributeval__id',
   625                                                                  ],
   626                                                                  },
   627    290.9 MiB      0.0 MiB           1               field_id='id',
   628                                                 )
   649   4385.6 MiB      0.0 MiB           1           serializer = serializers.LabeledShapeSerializerFromDB(list(shapes.values()), many=True)
   650   5990.2 MiB   1604.6 MiB           1           self.ir_data.shapes = serializer.data

Filename: /home/maya/Documents/cvat/cvat/apps/dataset_manager/task.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   749    290.6 MiB    290.6 MiB           1       @profile
   750                                             def init_from_db(self):
   751    290.9 MiB      0.3 MiB           1           self._init_tags_from_db()
   752   5990.2 MiB   5699.3 MiB           1           self._init_shapes_from_db()
   753   5990.2 MiB      0.0 MiB           1           self._init_tracks_from_db()
   754   5990.2 MiB      0.0 MiB           1           self._init_version_from_db()

Checklist

I submit my changes into the develop branch
I have created a changelog fragment
~~- [ ] I have updated the documentation accordingly~~
~~- [ ] I have added tests to cover my changes~~
~~- [ ] I have linked related issues (see GitHub docs)~~
- [ ] I have increased versions of npm packages if it is necessary
(cvat-canvas,
cvat-core,
cvat-data and
cvat-ui)

License

I submit my code changes under the same MIT License that covers the project.
Feel free to contact the maintainers if that's a concern.

Marishka17 · 2024-04-18T11:57:39Z

@azhavoro, @zhiltsov-max
I've re-measured metrics again. Besides that, I figured out that the prefetche_related used here was useless since Django implementation. To put it briefly, prefetch_related(...) will not work if values(...) is used because the cache is saved to _prefetched_objects_cache object attribute, but after using values(...) ValuesIterable is used as _iterable_class, returned queryset contains dicts and it is not possible to set an attribute to dict. (Link to the code block)

Queryset	Time for retrieving annotations (min)	Number of Queries	Memory usage (GB)	Chunk size
original	17.8	17	18.3	default 100
original + iterator	18.3	17	5.6	default 2000
original - prefetch + iterator	18	17	5.6	default 2000
original - prefetch + iterator	18.5	17	5.6	10000

cvat/apps/dataset_manager/task.py

azhavoro · 2024-04-22T09:43:27Z

Also, please add a note to the changelog

Use iterator when evaluating queryset to disable queryset caching

925559d

Marishka17 requested a review from azhavoro April 12, 2024 12:02

Marishka17 marked this pull request as ready for review April 12, 2024 12:02

Marishka17 requested a review from zhiltsov-max as a code owner April 12, 2024 12:02

Marishka17 added 2 commits April 12, 2024 14:02

Merge branch 'develop' into mk/optimize-retrieving-annotations

a49bef9

Remove unused prefetch_related && set iterator chunk_size 2000

b42e7c9

Merge branch 'develop' into mk/optimize-retrieving-annotations

eaf74bd

azhavoro reviewed Apr 22, 2024

View reviewed changes

cvat/apps/dataset_manager/task.py Show resolved Hide resolved

Add changelog fragment

3e57300

Marishka17 requested a review from nmanovic as a code owner April 22, 2024 11:40

Merge branch 'develop' into mk/optimize-retrieving-annotations

c650933

azhavoro approved these changes Apr 23, 2024

View reviewed changes

azhavoro merged commit 2c64721 into develop Apr 23, 2024
32 checks passed

Marishka17 deleted the mk/optimize-retrieving-annotations branch April 23, 2024 10:00

cvat-bot bot mentioned this pull request Apr 26, 2024

Release v2.12.1 #7811

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the code for retrieving annotations #7748

Optimize the code for retrieving annotations #7748

Marishka17 commented Apr 10, 2024 •

edited

Loading

Marishka17 commented Apr 18, 2024 •

edited

Loading

azhavoro commented Apr 22, 2024

Optimize the code for retrieving annotations #7748

Optimize the code for retrieving annotations #7748

Conversation

Marishka17 commented Apr 10, 2024 • edited Loading

Motivation and context

How has this been tested?

Checklist

License

Marishka17 commented Apr 18, 2024 • edited Loading

azhavoro commented Apr 22, 2024

Marishka17 commented Apr 10, 2024 •

edited

Loading

Marishka17 commented Apr 18, 2024 •

edited

Loading