Skip to content

Draft: Resolve "Accumulation processing leads to frame drop"

Closes #204

The proposed solution consists in parallelizing the accumulation computation (with OpenMP) to reduce the latency. Benchmarks compare the previous accumulation implementation (including the SSE2 variant) to check for performance regression.

The results on a Lima HPC computer (lbm18det02) are given for implementation/image types/pixel type with pixel type [2 = Bpp8, 4 = Bpp16, 6 = Bpp32].

-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
new_accumulation/256/2        22643 ns        22643 ns        40380
new_accumulation/512/2        48078 ns        48078 ns        14654
new_accumulation/1024/2      251763 ns       251766 ns         2869
new_accumulation/2048/2     1451334 ns      1451329 ns          487
new_accumulation/4096/2    12922450 ns     12490339 ns           48
new_accumulation/8192/2    64303693 ns     64302455 ns           10
new_accumulation/256/4        22073 ns        22074 ns        31269
new_accumulation/512/4        46212 ns        46211 ns        15506
new_accumulation/1024/4      332070 ns       332074 ns         2099
new_accumulation/2048/4     1936410 ns      1936296 ns          361
new_accumulation/4096/4    15943537 ns     15943429 ns           42
new_accumulation/8192/4    68159469 ns     60116026 ns           10
new_accumulation/256/6        28222 ns        28221 ns        24431
new_accumulation/512/6        73269 ns        73267 ns         9666
new_accumulation/1024/6      485025 ns       485032 ns         1446
new_accumulation/2048/6     3106539 ns      3106583 ns          224
new_accumulation/4096/6    21079117 ns     21078442 ns           33
new_accumulation/8192/6    91660830 ns     91596113 ns            8
old_accumulation/256/2        31891 ns        31889 ns        22046
old_accumulation/512/2       218290 ns       218286 ns         3208
old_accumulation/1024/2     1178676 ns      1178683 ns          594
old_accumulation/2048/2     7042587 ns      7042259 ns           99
old_accumulation/4096/2    34643647 ns     34635841 ns           20
old_accumulation/8192/2   141520690 ns    141520852 ns            5
old_accumulation/256/4        30143 ns        30141 ns        23446
old_accumulation/512/4       301168 ns       301163 ns         2316
old_accumulation/1024/4     1395829 ns      1395710 ns          501
old_accumulation/2048/4     8508620 ns      8508321 ns           81
old_accumulation/4096/4    39624346 ns     39619240 ns           18
old_accumulation/8192/4   163721332 ns    163714306 ns            4
old_accumulation/256/6        55187 ns        55184 ns        12692
old_accumulation/512/6       461330 ns       461325 ns         1519
old_accumulation/1024/6     1958700 ns      1958593 ns          357
old_accumulation/2048/6    11719967 ns     11719743 ns           59
old_accumulation/4096/6    52775585 ns     52768370 ns           13
old_accumulation/8192/6   215689088 ns    215688728 ns            3
mean_accumulation/256/2       63211 ns        63208 ns        10949
mean_accumulation/512/2      201760 ns       201749 ns         3509
mean_accumulation/1024/2     847474 ns       847445 ns          824
mean_accumulation/2048/2    3933608 ns      3933521 ns          177
mean_accumulation/4096/2   25083964 ns     25083968 ns           30
mean_accumulation/8192/2  102817259 ns    102792224 ns            6
mean_accumulation/256/4       78008 ns        77996 ns         8718
mean_accumulation/512/4      231844 ns       231788 ns         2733
mean_accumulation/1024/4    1025582 ns      1025511 ns          689
mean_accumulation/2048/4    4834133 ns      4832675 ns          132
mean_accumulation/4096/4   26618747 ns     26617998 ns           26
mean_accumulation/8192/4  115617255 ns    115402342 ns            6
mean_accumulation/256/6       69077 ns        69077 ns        10076
mean_accumulation/512/6      227790 ns       227790 ns         3103
mean_accumulation/1024/6    1096889 ns      1096796 ns          639
mean_accumulation/2048/6    5771428 ns      5770878 ns          114
mean_accumulation/4096/6   30447064 ns     30445621 ns           23
mean_accumulation/8192/6  130108632 ns    129381404 ns            5

Focusing on the BM18 Iris use case (16Mpx 30fps@16bit), 4096/4:

-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
old_accumulation/4096/4    39624346 ns     39619240 ns           18 # single thread
new_accumulation/4096/4    38430731 ns     38428978 ns           18 # single thread
new_accumulation/4096/4    15943537 ns     15943429 ns           42 # multi thread
mean_accumulation/4096/4   81741949 ns     81728061 ns            8 # single thread
mean_accumulation/4096/4   26618747 ns     26617998 ns           26 # multi thread

The acquisition period is 30ms. The accumulation "with mean" takes 80ms with a single thread, 26m s with 4 threads.

Edited by Samuel Debionne

Merge request reports