The following code should produce a rather simple output of increasing real parts and decreasing imaginary parts, someting along the lines of:
0+0i
1-1i
2-2i
3-3i
4-4i
5-5i
6-6i
etc...
99-99i
The code in question:
#include <iostream> #include <iomanip> #include <new> #include <omp.h> #include <mkl/mkl.h> using namespace std; typedef long long ll; ostream& operator<<(ostream& lhs, MKL_Complex16& rhs) { lhs << fixed << setprecision(20) << rhs.real << '+'<< fixed << setprecision(20) << rhs.imag << "*I"; return lhs; } int main(void) { omp_set_num_threads(8); volatile ll ena = 1; //#define ena 1 ll temp_it = 100; ll ld = 2; MKL_Complex16** temp1 = new MKL_Complex16*[ld + ld](); for (ll i = 0; i < ld; i++) { temp1[ld + i] = (MKL_Complex16*)mkl_calloc(temp_it, sizeof(MKL_Complex16), 64); for (ll j = 0; j < temp_it; j++) { temp1[ld + i][j].real = i * j; temp1[ld + i][j].imag = -i * j; } } MKL_Complex16* M = (MKL_Complex16*)mkl_calloc(temp_it, sizeof(MKL_Complex16), 64); #pragma omp parallel for for (ll i = 0; i < temp_it; i++) { M[i].real = temp1[ld + 0][i].real + temp1[ld + ena][i].real; M[i].imag = temp1[ld + 0][i].imag + temp1[ld + ena][i].imag; } for (ll j = 0; j < ld; j++) { mkl_free(temp1[ld + j]); } delete[] temp1; for (ll i = 0; i < temp_it; i++) { cout << M[i] << endl; } mkl_free(M); system("pause"); return 0; }
Which is the case with the code shown above. The generated assembly for the addition is then:
for (ll i = 0; i < temp_it; i++) { 00007FF769FB1530 inc r11 M[i].real = temp1[ld + 0][i].real + temp1[ld + ena][i].real; 00007FF769FB1533 mov rbp,qword ptr [r10] 00007FF769FB1536 shl rbp,3 00007FF769FB153A movsd xmm0,mmword ptr [rcx+r9] 00007FF769FB1540 lea rbp,[rbp+r14*8] 00007FF769FB1545 mov rbp,qword ptr [r8+rbp] 00007FF769FB1549 addsd xmm0,mmword ptr [rbp+r9] 00007FF769FB1550 movsd mmword ptr [rax+r9],xmm0 M[i].imag = temp1[ld + 0][i].imag + temp1[ld + ena][i].imag; 00007FF769FB1556 mov rbp,qword ptr [r10] 00007FF769FB1559 shl rbp,3 00007FF769FB155D movsd xmm1,mmword ptr [rcx+r9+8] 00007FF769FB1564 lea rbp,[rbp+r14*8] 00007FF769FB1569 mov rbp,qword ptr [r8+rbp] 00007FF769FB156D addsd xmm1,mmword ptr [rbp+r9+8] 00007FF769FB1574 movsd mmword ptr [rax+r9+8],xmm1
That looks about right, as is the result.
However, assume we change the commented definition of "ena". Like so:
//volatile ll ena = 1; #define ena 1
Doing so means that the compiler no longer has to check the value of "ena" at every step, and the code should be ever so slightly faster, but yield the same result. Instead, we get the following:
0.00000000000000000000+0.00000000000000000000*I
0.00000000000000000000+-1.00000000000000000000*I
0.00000000000000000000+-2.00000000000000000000*I
0.00000000000000000000+-3.00000000000000000000*I
0.00000000000000000000+-4.00000000000000000000*I
0.00000000000000000000+-5.00000000000000000000*I
0.00000000000000000000+-6.00000000000000000000*I
0.00000000000000000000+-7.00000000000000000000*I
8.00000000000000000000+-8.00000000000000000000*I
9.00000000000000000000+-9.00000000000000000000*I
10.00000000000000000000+-10.00000000000000000000*I
11.00000000000000000000+-11.00000000000000000000*I
12.00000000000000000000+-12.00000000000000000000*I
0.00000000000000000000+-13.00000000000000000000*I
0.00000000000000000000+-14.00000000000000000000*I
0.00000000000000000000+-15.00000000000000000000*I
0.00000000000000000000+-16.00000000000000000000*I
0.00000000000000000000+-17.00000000000000000000*I
0.00000000000000000000+-18.00000000000000000000*I
0.00000000000000000000+-19.00000000000000000000*I
0.00000000000000000000+-20.00000000000000000000*I
21.00000000000000000000+-21.00000000000000000000*I
22.00000000000000000000+-22.00000000000000000000*I
23.00000000000000000000+-23.00000000000000000000*I
24.00000000000000000000+-24.00000000000000000000*I
25.00000000000000000000+-25.00000000000000000000*I
0.00000000000000000000+-26.00000000000000000000*I
0.00000000000000000000+-27.00000000000000000000*I
0.00000000000000000000+-28.00000000000000000000*I
0.00000000000000000000+-29.00000000000000000000*I
0.00000000000000000000+-30.00000000000000000000*I
0.00000000000000000000+-31.00000000000000000000*I
0.00000000000000000000+-32.00000000000000000000*I
0.00000000000000000000+-33.00000000000000000000*I
34.00000000000000000000+-34.00000000000000000000*I
35.00000000000000000000+-35.00000000000000000000*I
36.00000000000000000000+-36.00000000000000000000*I
37.00000000000000000000+-37.00000000000000000000*I
38.00000000000000000000+-38.00000000000000000000*I
0.00000000000000000000+-39.00000000000000000000*I
0.00000000000000000000+-40.00000000000000000000*I
0.00000000000000000000+-41.00000000000000000000*I
0.00000000000000000000+-42.00000000000000000000*I
0.00000000000000000000+-43.00000000000000000000*I
0.00000000000000000000+-44.00000000000000000000*I
0.00000000000000000000+-45.00000000000000000000*I
0.00000000000000000000+-46.00000000000000000000*I
47.00000000000000000000+-47.00000000000000000000*I
48.00000000000000000000+-48.00000000000000000000*I
49.00000000000000000000+-49.00000000000000000000*I
50.00000000000000000000+-50.00000000000000000000*I
51.00000000000000000000+-51.00000000000000000000*I
0.00000000000000000000+-52.00000000000000000000*I
0.00000000000000000000+-53.00000000000000000000*I
0.00000000000000000000+-54.00000000000000000000*I
0.00000000000000000000+-55.00000000000000000000*I
0.00000000000000000000+-56.00000000000000000000*I
0.00000000000000000000+-57.00000000000000000000*I
0.00000000000000000000+-58.00000000000000000000*I
0.00000000000000000000+-59.00000000000000000000*I
60.00000000000000000000+-60.00000000000000000000*I
61.00000000000000000000+-61.00000000000000000000*I
62.00000000000000000000+-62.00000000000000000000*I
63.00000000000000000000+-63.00000000000000000000*I
0.00000000000000000000+-64.00000000000000000000*I
0.00000000000000000000+-65.00000000000000000000*I
0.00000000000000000000+-66.00000000000000000000*I
0.00000000000000000000+-67.00000000000000000000*I
0.00000000000000000000+-68.00000000000000000000*I
0.00000000000000000000+-69.00000000000000000000*I
0.00000000000000000000+-70.00000000000000000000*I
0.00000000000000000000+-71.00000000000000000000*I
72.00000000000000000000+-72.00000000000000000000*I
73.00000000000000000000+-73.00000000000000000000*I
74.00000000000000000000+-74.00000000000000000000*I
75.00000000000000000000+-75.00000000000000000000*I
0.00000000000000000000+-76.00000000000000000000*I
0.00000000000000000000+-77.00000000000000000000*I
0.00000000000000000000+-78.00000000000000000000*I
0.00000000000000000000+-79.00000000000000000000*I
0.00000000000000000000+-80.00000000000000000000*I
0.00000000000000000000+-81.00000000000000000000*I
0.00000000000000000000+-82.00000000000000000000*I
0.00000000000000000000+-83.00000000000000000000*I
84.00000000000000000000+-84.00000000000000000000*I
85.00000000000000000000+-85.00000000000000000000*I
86.00000000000000000000+-86.00000000000000000000*I
87.00000000000000000000+-87.00000000000000000000*I
0.00000000000000000000+-88.00000000000000000000*I
0.00000000000000000000+-89.00000000000000000000*I
0.00000000000000000000+-90.00000000000000000000*I
0.00000000000000000000+-91.00000000000000000000*I
0.00000000000000000000+-92.00000000000000000000*I
0.00000000000000000000+-93.00000000000000000000*I
0.00000000000000000000+-94.00000000000000000000*I
0.00000000000000000000+-95.00000000000000000000*I
96.00000000000000000000+-96.00000000000000000000*I
97.00000000000000000000+-97.00000000000000000000*I
98.00000000000000000000+-98.00000000000000000000*I
99.00000000000000000000+-99.00000000000000000000*I
As we can see we get a pattern of 8 wrong answers (only the real parts) followed by 5 correct answers. Setting the number of threads to lower values only makes this worse, with more of the results being wrong. Namely, 96 wrong and 4 correct, in that order. In this case, the calculation of the imaginary part doesn't change much, although the two (which start in the same loop) end up in two different loops, the following is the part that computes most of the real values (but not all, not the ones that are correct):
for (ll i = 0; i < temp_it; i++) { 00007FF68B4D14F0 add r13,8 M[i].real = temp1[ld + 0][i].real + temp1[ld + ena][i].real; 00007FF68B4D14F4 mov rdx,qword ptr [r8+rsi*8] 00007FF68B4D14F8 add rdx,r14 00007FF68B4D14FB movsd xmm1,mmword ptr [rdx+r15] 00007FF68B4D1501 movhpd xmm1,qword ptr [rdx+r15+8] 00007FF68B4D1508 movsd xmm0,mmword ptr [rdx+r15+10h] 00007FF68B4D150F movaps xmm2,xmm1 00007FF68B4D1512 movhpd xmm0,qword ptr [rdx+r15+18h] 00007FF68B4D1519 unpcklpd xmm2,xmm0 00007FF68B4D151D unpckhpd xmm1,xmm0 00007FF68B4D1521 addpd xmm2,xmm1 00007FF68B4D1525 movsd mmword ptr [r15+r11],xmm2 00007FF68B4D152B movhpd qword ptr [r15+r11+10h],xmm2 00007FF68B4D1532 mov rdx,qword ptr [r8+rsi*8] 00007FF68B4D1536 add rdx,r14 00007FF68B4D1539 movsd xmm4,mmword ptr [rdx+r15+20h] 00007FF68B4D1540 movhpd xmm4,qword ptr [rdx+r15+28h] 00007FF68B4D1547 movsd xmm3,mmword ptr [rdx+r15+30h] 00007FF68B4D154E movaps xmm5,xmm4 00007FF68B4D1551 movhpd xmm3,qword ptr [rdx+r15+38h] 00007FF68B4D1558 unpcklpd xmm5,xmm3 00007FF68B4D155C unpckhpd xmm4,xmm3 00007FF68B4D1560 addpd xmm5,xmm4 00007FF68B4D1564 movsd mmword ptr [r15+r11+20h],xmm5 00007FF68B4D156B movhpd qword ptr [r15+r11+30h],xmm5 00007FF68B4D1572 mov rdx,qword ptr [r8+rsi*8] 00007FF68B4D1576 add rdx,r14 00007FF68B4D1579 movsd xmm1,mmword ptr [rdx+r15+40h] 00007FF68B4D1580 movhpd xmm1,qword ptr [rdx+r15+48h] 00007FF68B4D1587 movsd xmm0,mmword ptr [rdx+r15+50h] 00007FF68B4D158E movaps xmm2,xmm1 00007FF68B4D1591 movhpd xmm0,qword ptr [rdx+r15+58h] 00007FF68B4D1598 unpcklpd xmm2,xmm0 00007FF68B4D159C unpckhpd xmm1,xmm0 00007FF68B4D15A0 addpd xmm2,xmm1 00007FF68B4D15A4 movsd mmword ptr [r15+r11+40h],xmm2 00007FF68B4D15AB movhpd qword ptr [r15+r11+50h],xmm2 00007FF68B4D15B2 mov rdx,qword ptr [r8+rsi*8] 00007FF68B4D15B6 add rdx,r14 00007FF68B4D15B9 movsd xmm4,mmword ptr [rdx+r15+60h] 00007FF68B4D15C0 movhpd xmm4,qword ptr [rdx+r15+68h] 00007FF68B4D15C7 movsd xmm3,mmword ptr [rdx+r15+70h] 00007FF68B4D15CE movaps xmm0,xmm4 00007FF68B4D15D1 movhpd xmm3,qword ptr [rdx+r15+78h] 00007FF68B4D15D8 unpcklpd xmm0,xmm3 00007FF68B4D15DC unpckhpd xmm4,xmm3 00007FF68B4D15E0 addpd xmm0,xmm4 00007FF68B4D15E4 movsd mmword ptr [r15+r11+60h],xmm0 00007FF68B4D15EB movhpd qword ptr [r15+r11+70h],xmm0
To the best of my understanding, the loop is unrolled several times, although rdx is updated every time one of these... operations... takes place (there's no reason to do that, the array isn't going anywhere). However, the mindblowing thing is what a single operation here does (I might have gotten something wrong, but the result is consistent with my understanding of this code).
It starts off by loading a number from temp1[0] or temp1[1] (can't quite tell which one) into both parts of the register xmm1 (real and imaginary part accordingly). So far so good. It then loads the next element of temp1[0] or temp1[1] (THE SAME ARRAY, this is no longer good...) into xmm0. So far each of the two registers holds a complex number, but not the ones we'd want. It then makes a copy of xmm1 and stores it in xmm2. The next move is somewhat amazing, the following two routines separate these values, so that xmm2 now holds both the imaginary parts (the part from xmm0 in the lower section and xmm1 in the higher) and xmm1 holds both real parts in the same order. It then proceeds to add the two values (so, real + imaginary, something that rarely makes sense, clearly the result will be zero, due to the way we filled the initial arrays) and saves them into two consequent real parts of the array M.
This was tested on i7-4720HQ with the following command line:
/MP /GS- /Qopenmp-offload:gfx /Qopenmp /GA /W3 /Gy /Zc:wchar_t /Zi /O2 /Fd"x64\Release\vc140.pdb" /Quse-intel-optimized-headers /Qgpu-arch=haswell /D "_MBCS" /Qipo /Qoffload-arch:haswell:visa3.1 /Zc:forScope /Qopt-matmul /Oi /MT /Fa"x64\Release\" /EHsc /nologo /Qparallel /Fo"x64\Release\" /Qprof-dir "x64\Release\" /Ot /Fp"x64\Release\DMRG.pch"
Adding AVX2 support simply vectorizes some of these operations, but the logic behind them (or rather lack of) remains the same, as does the result.
Changing the order within the code (imaginary first) also has no effect on the end results.
For reference, I am using Windows 10 with Visual Studio 2015 Update 3 and Intel Parallel Studio XE Update 3 (so, the latest compiler).
Thank you for your time.