Analyzing missing vectorization speedup

Hello,

I am using the icc v.16 compiler to parallelize this part of my program:

#pragma simd assert
for(int i=0; i<nParticlesInUse; i++) {
  if (particles[i].id == INVALID)
    continue;
  particles[i].ax = GRAVITY_X;
  particles[i].ay = GRAVITY_Y;

  for (int j=0; j<nParticlesInUse; j++) {
    if (particles[j].id != INVALID) {
      double dx = particles[j].x - particles[i].x;
      double dy = particles[j].y - particles[i].y;
      double r2 = dx * dx + dy * dy;
      if( r2 > cutoff*cutoff ) {

      } else {

        r2 = fmax( r2, min_r*min_r );
        double r = sqrt( r2 );
        double coef = ( 1 - cutoff / r ) / r2 / mass;
        particles[i].ax += coef * dx;
        particles[i].ay += coef * dy;

        if (particles[i].ax*particles[i].ax + particles[i].ay*particles[i].ay > 10000000) {
          particles[i].ax = 0;
          particles[i].ay = 0;
        }
      }
    }
  }
}

gprof indicated that this is the most used function by far.
So, the vectorization report indicates a speedup of 2.29x:

LOOP BEGIN at noc8x8.cpp(1813,6)
   remark #15328: vectorization support: gather was emulated for the variable this: strided by 14   [ noc8x8.cpp(1815,7) ]
   remark #15329: vectorization support: scatter was emulated for the variable this: masked, strided by 7   [ noc8x8.cpp(1817,3) ]
   remark #15329: vectorization support: scatter was emulated for the variable this: masked, strided by 7   [ noc8x8.cpp(1818,3) ]
   remark #15328: vectorization support: gather was emulated for the variable this: masked, strided by 7   [ noc8x8.cpp(1827,33) ]
   remark #15328: vectorization support: gather was emulated for the variable this: masked, strided by 7   [ noc8x8.cpp(1828,33) ]
   remark #15329: vectorization support: scatter was emulated for the variable this: masked, strided by 7   [ noc8x8.cpp(1840,5) ]
   remark #15328: vectorization support: gather was emulated for the variable this: masked, strided by 7   [ noc8x8.cpp(1840,5) ]
   remark #15329: vectorization support: scatter was emulated for the variable this: masked, strided by 7   [ noc8x8.cpp(1841,5) ]
   remark #15328: vectorization support: gather was emulated for the variable this: masked, strided by 7   [ noc8x8.cpp(1841,5) ]
   remark #15328: vectorization support: gather was emulated for the variable this: masked, strided by 7   [ noc8x8.cpp(1843,9) ]
   remark #15328: vectorization support: gather was emulated for the variable this: masked, strided by 7   [ noc8x8.cpp(1843,25) ]
   remark #15328: vectorization support: gather was emulated for the variable this: masked, strided by 7   [ noc8x8.cpp(1843,43) ]
   remark #15328: vectorization support: gather was emulated for the variable this: masked, strided by 7   [ noc8x8.cpp(1843,59) ]
   remark #15329: vectorization support: scatter was emulated for the variable this: masked, strided by 7   [ noc8x8.cpp(1845,6) ]
   remark #15329: vectorization support: scatter was emulated for the variable this: masked, strided by 7   [ noc8x8.cpp(1846,6) ]
   remark #15305: vectorization support: vector length 4
   remark #15309: vectorization support: normalized vectorization overhead 0.019
   remark #15301: SIMD LOOP WAS VECTORIZED
   remark #15452: unmasked strided loads: 8
   remark #15453: unmasked strided stores: 6
   remark #15460: masked strided loads: 1
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 679
   remark #15477: vector loop cost: 296.000
   remark #15478: estimated potential speedup: 2.290
   remark #15482: vectorized math library calls: 1
   remark #15488: --- end vector loop cost summary ---
   remark #25015: Estimate of max trip count of loop=15000
   LOOP BEGIN at noc8x8.cpp(1819,3)
      remark #25460: No loop optimizations reported
      remark #25015: Estimate of max trip count of loop=60000
   LOOP END
LOOP END

Unfortunately, the vectorization slows down the execution: without simd: 22.26 seconds; with simd: 24.28 seconds.
Can somebody give me some pointers what is best practice to debug this behavior?

Best regards,

Tim

Analyzing missing vectorization speedup

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...