Compiler bug - openmp

The following code should produce a rather simple output of increasing real parts and decreasing imaginary parts, someting along the lines of:
0+0i
1-1i
2-2i
3-3i
4-4i
5-5i
6-6i
etc...
99-99i

The code in question:

#include <iostream>
#include <iomanip>
#include <new>
#include <omp.h>
#include <mkl/mkl.h>

using namespace std;

typedef long long ll;

ostream& operator<<(ostream& lhs, MKL_Complex16& rhs) {
	lhs << fixed << setprecision(20) << rhs.real << '+'<< fixed << setprecision(20) << rhs.imag << "*I";
	return lhs;
}

int main(void) {
	omp_set_num_threads(8);
	volatile ll ena = 1;
//#define ena 1
	ll temp_it = 100;
	ll ld = 2;
	MKL_Complex16** temp1 = new MKL_Complex16*[ld + ld]();
	for (ll i = 0; i < ld; i++) {
		temp1[ld + i] = (MKL_Complex16*)mkl_calloc(temp_it, sizeof(MKL_Complex16), 64);
		for (ll j = 0; j < temp_it; j++) {
			temp1[ld + i][j].real = i * j;
			temp1[ld + i][j].imag = -i * j;
		}
	}
	MKL_Complex16* M = (MKL_Complex16*)mkl_calloc(temp_it, sizeof(MKL_Complex16), 64);
#pragma omp parallel for
	for (ll i = 0; i < temp_it; i++) {
		M[i].real = temp1[ld + 0][i].real + temp1[ld + ena][i].real;
		M[i].imag = temp1[ld + 0][i].imag + temp1[ld + ena][i].imag;
	}
	for (ll j = 0; j < ld; j++) {
		mkl_free(temp1[ld + j]);
	}
	delete[] temp1;
	for (ll i = 0; i < temp_it; i++) {
		cout << M[i] << endl;
	}
	mkl_free(M);
	system("pause");
	return 0;
}

Which is the case with the code shown above. The generated assembly for the addition is then:

	for (ll i = 0; i < temp_it; i++) {
00007FF769FB1530  inc         r11
		M[i].real = temp1[ld + 0][i].real + temp1[ld + ena][i].real;
00007FF769FB1533  mov         rbp,qword ptr [r10]
00007FF769FB1536  shl         rbp,3
00007FF769FB153A  movsd       xmm0,mmword ptr [rcx+r9]
00007FF769FB1540  lea         rbp,[rbp+r14*8]
00007FF769FB1545  mov         rbp,qword ptr [r8+rbp]
00007FF769FB1549  addsd       xmm0,mmword ptr [rbp+r9]
00007FF769FB1550  movsd       mmword ptr [rax+r9],xmm0
		M[i].imag = temp1[ld + 0][i].imag + temp1[ld + ena][i].imag;
00007FF769FB1556  mov         rbp,qword ptr [r10]
00007FF769FB1559  shl         rbp,3
00007FF769FB155D  movsd       xmm1,mmword ptr [rcx+r9+8]
00007FF769FB1564  lea         rbp,[rbp+r14*8]
00007FF769FB1569  mov         rbp,qword ptr [r8+rbp]
00007FF769FB156D  addsd       xmm1,mmword ptr [rbp+r9+8]
00007FF769FB1574  movsd       mmword ptr [rax+r9+8],xmm1

That looks about right, as is the result.

However, assume we change the commented definition of "ena". Like so:

	//volatile ll ena = 1;
#define ena 1

Doing so means that the compiler no longer has to check the value of "ena" at every step, and the code should be ever so slightly faster, but yield the same result. Instead, we get the following:

0.00000000000000000000+0.00000000000000000000*I
0.00000000000000000000+-1.00000000000000000000*I
0.00000000000000000000+-2.00000000000000000000*I
0.00000000000000000000+-3.00000000000000000000*I
0.00000000000000000000+-4.00000000000000000000*I
0.00000000000000000000+-5.00000000000000000000*I
0.00000000000000000000+-6.00000000000000000000*I
0.00000000000000000000+-7.00000000000000000000*I
8.00000000000000000000+-8.00000000000000000000*I
9.00000000000000000000+-9.00000000000000000000*I
10.00000000000000000000+-10.00000000000000000000*I
11.00000000000000000000+-11.00000000000000000000*I
12.00000000000000000000+-12.00000000000000000000*I
0.00000000000000000000+-13.00000000000000000000*I
0.00000000000000000000+-14.00000000000000000000*I
0.00000000000000000000+-15.00000000000000000000*I
0.00000000000000000000+-16.00000000000000000000*I
0.00000000000000000000+-17.00000000000000000000*I
0.00000000000000000000+-18.00000000000000000000*I
0.00000000000000000000+-19.00000000000000000000*I
0.00000000000000000000+-20.00000000000000000000*I
21.00000000000000000000+-21.00000000000000000000*I
22.00000000000000000000+-22.00000000000000000000*I
23.00000000000000000000+-23.00000000000000000000*I
24.00000000000000000000+-24.00000000000000000000*I
25.00000000000000000000+-25.00000000000000000000*I
0.00000000000000000000+-26.00000000000000000000*I
0.00000000000000000000+-27.00000000000000000000*I
0.00000000000000000000+-28.00000000000000000000*I
0.00000000000000000000+-29.00000000000000000000*I
0.00000000000000000000+-30.00000000000000000000*I
0.00000000000000000000+-31.00000000000000000000*I
0.00000000000000000000+-32.00000000000000000000*I
0.00000000000000000000+-33.00000000000000000000*I
34.00000000000000000000+-34.00000000000000000000*I
35.00000000000000000000+-35.00000000000000000000*I
36.00000000000000000000+-36.00000000000000000000*I
37.00000000000000000000+-37.00000000000000000000*I
38.00000000000000000000+-38.00000000000000000000*I
0.00000000000000000000+-39.00000000000000000000*I
0.00000000000000000000+-40.00000000000000000000*I
0.00000000000000000000+-41.00000000000000000000*I
0.00000000000000000000+-42.00000000000000000000*I
0.00000000000000000000+-43.00000000000000000000*I
0.00000000000000000000+-44.00000000000000000000*I
0.00000000000000000000+-45.00000000000000000000*I
0.00000000000000000000+-46.00000000000000000000*I
47.00000000000000000000+-47.00000000000000000000*I
48.00000000000000000000+-48.00000000000000000000*I
49.00000000000000000000+-49.00000000000000000000*I
50.00000000000000000000+-50.00000000000000000000*I
51.00000000000000000000+-51.00000000000000000000*I
0.00000000000000000000+-52.00000000000000000000*I
0.00000000000000000000+-53.00000000000000000000*I
0.00000000000000000000+-54.00000000000000000000*I
0.00000000000000000000+-55.00000000000000000000*I
0.00000000000000000000+-56.00000000000000000000*I
0.00000000000000000000+-57.00000000000000000000*I
0.00000000000000000000+-58.00000000000000000000*I
0.00000000000000000000+-59.00000000000000000000*I
60.00000000000000000000+-60.00000000000000000000*I
61.00000000000000000000+-61.00000000000000000000*I
62.00000000000000000000+-62.00000000000000000000*I
63.00000000000000000000+-63.00000000000000000000*I
0.00000000000000000000+-64.00000000000000000000*I
0.00000000000000000000+-65.00000000000000000000*I
0.00000000000000000000+-66.00000000000000000000*I
0.00000000000000000000+-67.00000000000000000000*I
0.00000000000000000000+-68.00000000000000000000*I
0.00000000000000000000+-69.00000000000000000000*I
0.00000000000000000000+-70.00000000000000000000*I
0.00000000000000000000+-71.00000000000000000000*I
72.00000000000000000000+-72.00000000000000000000*I
73.00000000000000000000+-73.00000000000000000000*I
74.00000000000000000000+-74.00000000000000000000*I
75.00000000000000000000+-75.00000000000000000000*I
0.00000000000000000000+-76.00000000000000000000*I
0.00000000000000000000+-77.00000000000000000000*I
0.00000000000000000000+-78.00000000000000000000*I
0.00000000000000000000+-79.00000000000000000000*I
0.00000000000000000000+-80.00000000000000000000*I
0.00000000000000000000+-81.00000000000000000000*I
0.00000000000000000000+-82.00000000000000000000*I
0.00000000000000000000+-83.00000000000000000000*I
84.00000000000000000000+-84.00000000000000000000*I
85.00000000000000000000+-85.00000000000000000000*I
86.00000000000000000000+-86.00000000000000000000*I
87.00000000000000000000+-87.00000000000000000000*I
0.00000000000000000000+-88.00000000000000000000*I
0.00000000000000000000+-89.00000000000000000000*I
0.00000000000000000000+-90.00000000000000000000*I
0.00000000000000000000+-91.00000000000000000000*I
0.00000000000000000000+-92.00000000000000000000*I
0.00000000000000000000+-93.00000000000000000000*I
0.00000000000000000000+-94.00000000000000000000*I
0.00000000000000000000+-95.00000000000000000000*I
96.00000000000000000000+-96.00000000000000000000*I
97.00000000000000000000+-97.00000000000000000000*I
98.00000000000000000000+-98.00000000000000000000*I
99.00000000000000000000+-99.00000000000000000000*I

As we can see we get a pattern of 8 wrong answers (only the real parts) followed by 5 correct answers. Setting the number of threads to lower values only makes this worse, with more of the results being wrong. Namely, 96 wrong and 4 correct, in that order. In this case, the calculation of the imaginary part doesn't change much, although the two (which start in the same loop) end up in two different loops, the following is the part that computes most of the real values (but not all, not the ones that are correct):

	for (ll i = 0; i < temp_it; i++) {
00007FF68B4D14F0  add         r13,8
		M[i].real = temp1[ld + 0][i].real + temp1[ld + ena][i].real;
00007FF68B4D14F4  mov         rdx,qword ptr [r8+rsi*8]
00007FF68B4D14F8  add         rdx,r14
00007FF68B4D14FB  movsd       xmm1,mmword ptr [rdx+r15]
00007FF68B4D1501  movhpd      xmm1,qword ptr [rdx+r15+8]
00007FF68B4D1508  movsd       xmm0,mmword ptr [rdx+r15+10h]
00007FF68B4D150F  movaps      xmm2,xmm1
00007FF68B4D1512  movhpd      xmm0,qword ptr [rdx+r15+18h]
00007FF68B4D1519  unpcklpd    xmm2,xmm0
00007FF68B4D151D  unpckhpd    xmm1,xmm0
00007FF68B4D1521  addpd       xmm2,xmm1
00007FF68B4D1525  movsd       mmword ptr [r15+r11],xmm2
00007FF68B4D152B  movhpd      qword ptr [r15+r11+10h],xmm2
00007FF68B4D1532  mov         rdx,qword ptr [r8+rsi*8]
00007FF68B4D1536  add         rdx,r14
00007FF68B4D1539  movsd       xmm4,mmword ptr [rdx+r15+20h]
00007FF68B4D1540  movhpd      xmm4,qword ptr [rdx+r15+28h]
00007FF68B4D1547  movsd       xmm3,mmword ptr [rdx+r15+30h]
00007FF68B4D154E  movaps      xmm5,xmm4
00007FF68B4D1551  movhpd      xmm3,qword ptr [rdx+r15+38h]
00007FF68B4D1558  unpcklpd    xmm5,xmm3
00007FF68B4D155C  unpckhpd    xmm4,xmm3
00007FF68B4D1560  addpd       xmm5,xmm4
00007FF68B4D1564  movsd       mmword ptr [r15+r11+20h],xmm5
00007FF68B4D156B  movhpd      qword ptr [r15+r11+30h],xmm5
00007FF68B4D1572  mov         rdx,qword ptr [r8+rsi*8]
00007FF68B4D1576  add         rdx,r14
00007FF68B4D1579  movsd       xmm1,mmword ptr [rdx+r15+40h]
00007FF68B4D1580  movhpd      xmm1,qword ptr [rdx+r15+48h]
00007FF68B4D1587  movsd       xmm0,mmword ptr [rdx+r15+50h]
00007FF68B4D158E  movaps      xmm2,xmm1
00007FF68B4D1591  movhpd      xmm0,qword ptr [rdx+r15+58h]
00007FF68B4D1598  unpcklpd    xmm2,xmm0
00007FF68B4D159C  unpckhpd    xmm1,xmm0
00007FF68B4D15A0  addpd       xmm2,xmm1
00007FF68B4D15A4  movsd       mmword ptr [r15+r11+40h],xmm2
00007FF68B4D15AB  movhpd      qword ptr [r15+r11+50h],xmm2
00007FF68B4D15B2  mov         rdx,qword ptr [r8+rsi*8]
00007FF68B4D15B6  add         rdx,r14
00007FF68B4D15B9  movsd       xmm4,mmword ptr [rdx+r15+60h]
00007FF68B4D15C0  movhpd      xmm4,qword ptr [rdx+r15+68h]
00007FF68B4D15C7  movsd       xmm3,mmword ptr [rdx+r15+70h]
00007FF68B4D15CE  movaps      xmm0,xmm4
00007FF68B4D15D1  movhpd      xmm3,qword ptr [rdx+r15+78h]
00007FF68B4D15D8  unpcklpd    xmm0,xmm3
00007FF68B4D15DC  unpckhpd    xmm4,xmm3
00007FF68B4D15E0  addpd       xmm0,xmm4
00007FF68B4D15E4  movsd       mmword ptr [r15+r11+60h],xmm0
00007FF68B4D15EB  movhpd      qword ptr [r15+r11+70h],xmm0

To the best of my understanding, the loop is unrolled several times, although rdx is updated every time one of these... operations... takes place (there's no reason to do that, the array isn't going anywhere). However, the mindblowing thing is what a single operation here does (I might have gotten something wrong, but the result is consistent with my understanding of this code).

It starts off by loading a number from temp1[0] or temp1[1] (can't quite tell which one) into both parts of the register xmm1 (real and imaginary part accordingly). So far so good. It then loads the next element of temp1[0] or temp1[1] (THE SAME ARRAY, this is no longer good...) into xmm0. So far each of the two registers holds a complex number, but not the ones we'd want. It then makes a copy of xmm1 and stores it in xmm2. The next move is somewhat amazing, the following two routines separate these values, so that xmm2 now holds both the imaginary parts (the part from xmm0 in the lower section and xmm1 in the higher) and xmm1 holds both real parts in the same order. It then proceeds to add the two values (so, real + imaginary, something that rarely makes sense, clearly the result will be zero, due to the way we filled the initial arrays) and saves them into two consequent real parts of the array M.

This was tested on i7-4720HQ with the following command line:

/MP /GS- /Qopenmp-offload:gfx /Qopenmp /GA /W3 /Gy /Zc:wchar_t /Zi /O2 /Fd"x64\Release\vc140.pdb" /Quse-intel-optimized-headers /Qgpu-arch=haswell /D "_MBCS" /Qipo /Qoffload-arch:haswell:visa3.1 /Zc:forScope /Qopt-matmul /Oi /MT /Fa"x64\Release\" /EHsc /nologo /Qparallel /Fo"x64\Release\" /Qprof-dir "x64\Release\" /Ot /Fp"x64\Release\DMRG.pch"

Adding AVX2 support simply vectorizes some of these operations, but the logic behind them (or rather lack of) remains the same, as does the result.

Changing the order within the code (imaginary first) also has no effect on the end results.

For reference, I am using Windows 10 with Visual Studio 2015 Update 3 and Intel Parallel Studio XE Update 3 (so, the latest compiler).

Thank you for your time.

Thread Topic:

Bug Report