Draft: Resolve "Accumulation processing leads to frame drop"
Closes #204
The proposed solution consists in parallelizing the accumulation computation (with OpenMP) to reduce the latency. Benchmarks compare the previous accumulation implementation (including the SSE2 variant) to check for performance regression.
The results on a Lima HPC computer (lbm18det02) are given for implementation/image types/pixel type
with pixel type [2 = Bpp8, 4 = Bpp16, 6 = Bpp32].
-------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------
new_accumulation/256/2 22643 ns 22643 ns 40380
new_accumulation/512/2 48078 ns 48078 ns 14654
new_accumulation/1024/2 251763 ns 251766 ns 2869
new_accumulation/2048/2 1451334 ns 1451329 ns 487
new_accumulation/4096/2 12922450 ns 12490339 ns 48
new_accumulation/8192/2 64303693 ns 64302455 ns 10
new_accumulation/256/4 22073 ns 22074 ns 31269
new_accumulation/512/4 46212 ns 46211 ns 15506
new_accumulation/1024/4 332070 ns 332074 ns 2099
new_accumulation/2048/4 1936410 ns 1936296 ns 361
new_accumulation/4096/4 15943537 ns 15943429 ns 42
new_accumulation/8192/4 68159469 ns 60116026 ns 10
new_accumulation/256/6 28222 ns 28221 ns 24431
new_accumulation/512/6 73269 ns 73267 ns 9666
new_accumulation/1024/6 485025 ns 485032 ns 1446
new_accumulation/2048/6 3106539 ns 3106583 ns 224
new_accumulation/4096/6 21079117 ns 21078442 ns 33
new_accumulation/8192/6 91660830 ns 91596113 ns 8
old_accumulation/256/2 31891 ns 31889 ns 22046
old_accumulation/512/2 218290 ns 218286 ns 3208
old_accumulation/1024/2 1178676 ns 1178683 ns 594
old_accumulation/2048/2 7042587 ns 7042259 ns 99
old_accumulation/4096/2 34643647 ns 34635841 ns 20
old_accumulation/8192/2 141520690 ns 141520852 ns 5
old_accumulation/256/4 30143 ns 30141 ns 23446
old_accumulation/512/4 301168 ns 301163 ns 2316
old_accumulation/1024/4 1395829 ns 1395710 ns 501
old_accumulation/2048/4 8508620 ns 8508321 ns 81
old_accumulation/4096/4 39624346 ns 39619240 ns 18
old_accumulation/8192/4 163721332 ns 163714306 ns 4
old_accumulation/256/6 55187 ns 55184 ns 12692
old_accumulation/512/6 461330 ns 461325 ns 1519
old_accumulation/1024/6 1958700 ns 1958593 ns 357
old_accumulation/2048/6 11719967 ns 11719743 ns 59
old_accumulation/4096/6 52775585 ns 52768370 ns 13
old_accumulation/8192/6 215689088 ns 215688728 ns 3
mean_accumulation/256/2 63211 ns 63208 ns 10949
mean_accumulation/512/2 201760 ns 201749 ns 3509
mean_accumulation/1024/2 847474 ns 847445 ns 824
mean_accumulation/2048/2 3933608 ns 3933521 ns 177
mean_accumulation/4096/2 25083964 ns 25083968 ns 30
mean_accumulation/8192/2 102817259 ns 102792224 ns 6
mean_accumulation/256/4 78008 ns 77996 ns 8718
mean_accumulation/512/4 231844 ns 231788 ns 2733
mean_accumulation/1024/4 1025582 ns 1025511 ns 689
mean_accumulation/2048/4 4834133 ns 4832675 ns 132
mean_accumulation/4096/4 26618747 ns 26617998 ns 26
mean_accumulation/8192/4 115617255 ns 115402342 ns 6
mean_accumulation/256/6 69077 ns 69077 ns 10076
mean_accumulation/512/6 227790 ns 227790 ns 3103
mean_accumulation/1024/6 1096889 ns 1096796 ns 639
mean_accumulation/2048/6 5771428 ns 5770878 ns 114
mean_accumulation/4096/6 30447064 ns 30445621 ns 23
mean_accumulation/8192/6 130108632 ns 129381404 ns 5
Focusing on the BM18 Iris use case (16Mpx 30fps@16bit), 4096/4
:
-------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------
old_accumulation/4096/4 39624346 ns 39619240 ns 18 # single thread
new_accumulation/4096/4 38430731 ns 38428978 ns 18 # single thread
new_accumulation/4096/4 15943537 ns 15943429 ns 42 # multi thread
mean_accumulation/4096/4 81741949 ns 81728061 ns 8 # single thread
mean_accumulation/4096/4 26618747 ns 26617998 ns 26 # multi thread
The acquisition period is 30ms. The accumulation "with mean" takes 80ms with a single thread, 26m s with 4 threads.
Edited by Samuel Debionne