Intel C++ compiler for academic non-commercial use

March 17, 2017, 6:49 am

Latest and popular articles on Intel Technologies

≪ Previous: ICC 17.0.2 rejects valid C++ template function specialization

Hello,

about three years ago I could download the Intel Composer XE 2013 for free for academic non-commercial use. Now I can see that only some libraries (MKL, etc.) are free, but I have read in some places that the C++ compiler could be free for academia in the future. Is it true?

↧

Code optimization fails

March 18, 2017, 7:02 am

Latest and popular articles on Intel Technologies

≫ Next: __builtin_clrsbl undefined

≪ Previous: Intel C++ compiler for academic non-commercial use

Hi everyone,

I recently ran into an issue I don't understand. I wrote an iterative flow solver that carries out some calculations in each, then calculates a residual and starts over if the residual is still too large. I moved the calculation part into a function that is called in each iteration. It looks like this;

for (int j = 1; j < (ny - 1); j++)
	{
		for (int i = boundLeft; i < (bruttoLength-boundRight); i++)
		{
			i0 = i + j * bruttoLength;

			utilde[i0] = 0;

			// differences in x-direction

			if (i == 1) {
				utilde[i0] = utilde[i0] - (-v / (12 * a * dx) - 1 / (12 * pow(dx, 2))) * u[i0 + 3];

				utilde[i0] = utilde[i0] - (v / (2 * a * dx) + 1 / (3 * pow(dx, 2))) * u[i0 + 2];

				utilde[i0] = utilde[i0] - (-3 * v / (2 * a*dx) + 1 / (2 * pow(dx, 2))) * u[i0 + 1];

				utilde[i0] = utilde[i0] - (v / (4 * a*dx) + 11 / (12 * pow(dx, 2))) * utilde[i0 - 1];

				own_share = own_share + (5 * v / (6 * a*dx) - 5 / (3 * pow(dx, 2)));
			}
			else
			{
				if (i == (bruttoLength - 2)) {

					utilde[i0] = utilde[i0] - (v / (12 * a * dx) - 1 / (12 * pow(dx, 2))) * utilde[i0 - 3];

					utilde[i0] = utilde[i0] - (-v / (2 * a * dx) + 1 / (3 * pow(dx, 2))) * utilde[i0 - 2];

					utilde[i0] = utilde[i0] - (3 * v / (2 * a * dx) + 1 / (2 * pow(dx, 2))) * utilde[i0 - 1];

					utilde[i0] = utilde[i0] - (-v / (4 * a * dx) + 11 / (12 * pow(dx, 2))) * u[i0 + 1];

					own_share = own_share + (-5 * v / (6 * a*dx) - 5 / (3 * pow(dx, 2)));
				}
				else
				{
					utilde[i0] = utilde[i0] - (v / (12 * a * dx) - 1 / (12 * pow(dx, 2))) * u[i0 + 2];

					utilde[i0] = utilde[i0] - (-2 * v / (3 * a * dx) + 4 / (3 * pow(dx, 2))) * u[i0 + 1];

					utilde[i0] = utilde[i0] - (2 * v / (3 * a * dx) + 4 / (3 * pow(dx, 2))) * utilde[i0 - 1];

					utilde[i0] = utilde[i0] - (-v / (12 * a * dx) - 1 / (12 * pow(dx, 2))) * utilde[i0 - 2];

					own_share = own_share + (-5 / (2 * pow(dx, 2)));
				}
			}

			// repeat equivalent code for y-direction. It has the same structure as above

			// SOR-Share
			utilde[i0] = utilde[i0] / own_share * w + (1 - w) * u[i0];
			own_share = 0;
		}
	}

As long as the code is executed as a function call (arguments are 3 integers and 2 double pointers) the performance is really bad. As soon as I copy the code directly into my loop there is a massive speed up.

I tried both versions with and without the code optimization enabled (/O2) and measured the average execution time of the code snippet above. It looks like there is only minor code optimization for the version with the function call as the execution time did not improve much (3x faster, compared to 12x faster withou the function call).

I'm not sure if this is the root of the problem though. Can anybody give me some advise? Of course I could leave the whole calculation part inside my while-loop, but that looks very confusing. It would be much clearer to move it into a separate function.

I'm using the compiler that comes with Intel Parallel Studio XE 2017.

Best regards.

Thread Topic:

Question

↧

__builtin_clrsbl undefined

March 20, 2017, 11:46 pm

Latest and popular articles on Intel Technologies

≫ Next: __STDC_NO_THREADS__ undefined

≪ Previous: Code optimization fails

__builtin_clrsb and __builtin_clrsbll exist, but __builtin_clrsbl seems to be missing. Quick test:

#include <stdlib.h>
#include <stdio.h>

int main (void) {
  printf("int: %d\n", __builtin_clrsb(-1));
  printf("long: %d\n", __builtin_clrsbl(-1));
  printf("long long: %d\n", __builtin_clrsbll(-1));

  return EXIT_SUCCESS;
}

When attempting to compile:

nemequ@peltast:~/t$ icc -o clrsb clrsb.c
clrsb.c(6): warning #266: function "__builtin_clrsbl" declared implicitly
    printf("long: %d\n", __builtin_clrsbl(-1));
                         ^

/tmp/icc9KCY5p.o: In function `main':
clrsb.c:(.text+0x43): undefined reference to `__builtin_clrsbl'
nemequ@peltast:~/t$ icc --version
icc (ICC) 17.0.2 20170213
Copyright (C) 1985-2017 Intel Corporation.  All rights reserved.

Thread Topic:

Bug Report

↧

__STDC_NO_THREADS__ undefined

March 21, 2017, 12:23 am

Latest and popular articles on Intel Technologies

≫ Next: How to specify __mm512 to coreside with __mm256 or _mm128

≪ Previous: __builtin_clrsbl undefined

ICC defines __STDC_VERSION__ to 201112L, but doesn't define __STDC_NO_THREADS__. glibc doesn't currently support the C11 threads API, so it should be defined (per § 6.10.8.3 of the C11 spec).

I know this is partially a libc problem. I believe GCC resolves this by including <stdc-predef.h> from glibc (which includes a definition of __STDC_NO_THREADS__ in glibc >= __STDC_NO_THREADS__).

I'm working around this right now by checking __STDC_NO_THREADS__ after including <limits.h> (<limits.h> includes <features.h> which includes <stdc-predef.h>).

Testing is simple, just put this before any includes:

#if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 201102L) && !defined(__STDC_NO_THREADS__)
#  include <threads.h>
#endif

Thread Topic:

Bug Report

↧

How to specify mm512 to coreside with mm256 or _mm128

March 21, 2017, 10:27 am

Latest and popular articles on Intel Technologies

≫ Next: omp_get_num_procs() doesn't returns all processor

≪ Previous: __STDC_NO_THREADS__ undefined

In working with the intrinsics guide, and as for use with AVX512 one of the intrinsics I am wanting to use is

__m512d _mm512_broadcastsd_pd(__m128d a)

However, I'd like to avoid using a mov to move data from register to register, I'd rather use a cast.

The __m128 registers are co-resident with the low 8 __m512 registers. So, what I am asking for is

__m512d foo compilerDirective(use a register in range of 0:7);

Or one to specifically specify the zmm register to use (in range of 0:7). (I'd rather not regress to assembler)

There would be a similar issue with specifying zmm to overlay with ymm.

Jim Dempsey

↧

omp_get_num_procs() doesn't returns all processor

March 21, 2017, 1:08 pm

Latest and popular articles on Intel Technologies

≫ Next: Linking issue with Intel Debug build ...

≪ Previous: How to specify __mm512 to coreside with __mm256 or _mm128

Hi, I have an application in c++ that is using Qt libraries.

When I call omp_get_num_procs() from any part of the program, doesn't return the maximum number of processors that my machine has, so all the threads are distributed in the available processors.

But, If I call omp_get_num_procs() from main.cpp, before to QApplication constructor, I obtain all processor and the thread are distributed in all processors that the machine has.

I've tried to find out what exactly does omp_get_num_procs() that is changing the available processors fro the application.

Best Regards,

Leonardo

Zone:

Windows*

Thread Topic:

Question

↧

Linking issue with Intel Debug build ...

March 22, 2017, 3:18 am

Latest and popular articles on Intel Technologies

≫ Next: icpc HUNGS on small program

≪ Previous: omp_get_num_procs() doesn't returns all processor

Hello,

I get the following linking error when building a debug build with Intel 16.0 compiler:

error LNK2019: unresolved external symbol ___intel_ssse3_strncpy referenced in function

Does anyone know the corresponding lib where this symbol resides?

Many thanks,

Andrew.

ps> Using windows with VS2013, Intel Compiler 16.0.

↧

icpc HUNGS on small program

March 28, 2017, 11:27 am

Latest and popular articles on Intel Technologies

≫ Next: Issue with O2/O3 optimisation using Intel compiler 2017 update 2

≪ Previous: Linking issue with Intel Debug build ...

Hi!

I found icpc from Intel Parallel Studio XE 2017 hungs on my Linux machine compiling my project. For some tries, I reduced my code to the following:

#include <list>

class Outer {
public:
	class InnerBase {
	public:
	};

	class InnerDerrive : public InnerBase {} ;

	struct Item {
		InnerBase* x;
	};

private:
	InnerDerrive obj;
	using List = std::list<Item>;
	List saversMap{ { Item{&obj } } };

public:
};

compiling with /opt/intel/bin/icpc -std=c++14 -c file.cpp

If I replace InnerBase with InnerDerrive at line 12, the compiler will work properly.

Please, tell, is it Intel Compiler bug and if so, how to report it to Intel?

↧

Issue with O2/O3 optimisation using Intel compiler 2017 update 2

March 12, 2017, 4:01 pm

Latest and popular articles on Intel Technologies

≫ Next: _GFX_offload weird behaviour

≪ Previous: icpc HUNGS on small program

Hello,

While trying to compare parallelism between OMP, MPI, Cuda and OpenACC, I've encountered some dangerous behavior using the "Intel ICPC 17.0.2 20170213". The code below is solving the classical temperature distribution in 2D (just for test):

// /opt/intel/bin/icpc -O2 main.cpp && time ./a.out
#include <iostream>
#include <cmath>

int main()
{

  const int nx = 2800;
  const int ny = 2800;
  const float lx = 1.f;
  const float ly = 1.f;
  float dx = lx / (float)nx;
  float dy = ly / (float)ny;
  const float lambda = 0.001f;
  float dt = 0.01f/lambda/(1.0f/(dx*dx)/12.f+1.0f/(dy*dy)/12.f);
  float dtdx = dt*lambda/dx/dx/12.f;
  float dtdy = dt*lambda/dy/dy/12.f;

  int nxt = nx + 4;
  int nyt = ny + 4;

  float* T = new float[nxt*nyt];
  float* Told = new float[nxt*nyt];

  for (int i = 0; i < nxt; i++) {
    for (int j = 0; j < nyt; j++) {
      int ind = i*nyt + j;
      T[ind] = 1.0f + exp(-32.f*(pow((float)(j-nyt/2)/(float)nyt,2)+pow((float)(i-nxt/2)/(float)nxt,2)));
      Told[ind] = T[ind];
    }
  }

  for (int step=0; step<1000; step++) {

    for (int i = 2; i < nxt-2; i++) {
      for (int j = 2; j < nyt-2; j++) {
        int ind = i*nyt + j;
        T[ind] = Told[ind] + dtdx * (-Told[(i-2)*nyt + (j)]+16.f*Told[(i-1)*nyt + (j)]-30.f*Told[(i)*nyt + (j)]+16.f*Told[(i+1)*nyt + (j)]-Told[(i+2)*nyt + (j)])
        + dtdy * (-Told[(i)*nyt + (j-2)]+16.f*Told[(i)*nyt + (j-1)]-30.f*Told[(i)*nyt + (j)]+16.f*Told[(i)*nyt + (j+1)]-Told[(i)*nyt + (j+2)]);
      }
    }

    for (int i = 0; i < nxt*nyt; i++) {
        Told[i] = T[i];
    }

  }

  float sum = 0.0f;
  for (int i = 0; i < nxt*nyt; i++) {
      sum += T[i];
  }

  std::cout << sum/(float)(nxt*nyt) << std::endl;

  return 0;
}

After 1000 "time" iterations, the code is supposed to give the results: 1.08712 (using O0 or O1 compile flag)

Without any parallelism (no openmp), the code is compiled with O2 optimisation and gives the following (wrong) results: 1.09783

/opt/intel/bin/icpc -O2 main.cpp && time ./a.out

Using LLVM C++ (Apple) or GNU G++, with O2 or even O3 optimization flag, gives good results (1.08712).

It seems that Intel compiler with O2 or O3 compile flag does aggressive optimizations on floating-point data. In order to get the good result, -fp-model precise flag needs to be added to the command line.

Why does the O2 flag create such aggressive optimizations? (while GNU or LLVM does not) I mean, I know that O3 can be a dangerous flag to use but I thought O2 was, a least, usable. If you are not aware of the -fp-model flag, your results may be completely wrong...

Thank you.

Thread Topic:

Question

↧

_GFX_offload weird behaviour

March 20, 2017, 8:17 am

Latest and popular articles on Intel Technologies

≫ Next: Compiler doesn't vectorize even with simd directive

≪ Previous: Issue with O2/O3 optimisation using Intel compiler 2017 update 2

Hi,

I'm targeting Intel Graphics Technology with the API-Based offloading for asynchronous offloading. To begin, I try to offload this algorithm :

for (int i = 0; i < size; i++){
  A[i] = i;
}

So I wrote this code :

__declspec(target(gfx_kernel))
void fill(int * A, int size){
  _Cilk_for(int i = 0; i < size; i++){
    A[i] = i;
  }
}

int main() {
  int N = 1024;
  int * A = malloc(sizeof(int) * N);

  _GFX_share(A,N);

  _GFX_offload((void*)fill, A, N);
  _GFX_wait(0,-1);

  _GFX_unshare(A);
  free(A)

  return 0;
}

This code compiles and executes, but only the 780 firsts elements of A are effectively changed. I guess that's because of the max value of groups and threads but the number seems weird to me (_GFX_get_device_hardware_thread_count() returns 336).

So I have two questions : why 780 ? and how can I write a kernel that I can call with

_GFX_offload((void *)fill, A, N);

that does what I want it to do ?

Thanks, and have a nice day

Mathieu

Thread Topic:

Help Me

↧

Compiler doesn't vectorize even with simd directive

March 31, 2017, 2:22 am

Latest and popular articles on Intel Technologies

≫ Next: Problem integrate Intel C++ compiler with IDE

≪ Previous: _GFX_offload weird behaviour

I have this function taken from [here][1]:

    bool interpolate(const Mat &im, float ofsx, float ofsy, float a11, float a12, float a21, float a22, Mat &res)
    {
       bool ret = false;
       // input size (-1 for the safe bilinear interpolation)
       const int width = im.cols-1;
       const int height = im.rows-1;
       // output size
       const int halfWidth  = res.cols >> 1;
       const int halfHeight = res.rows >> 1;
       float *out = res.ptr<float>(0);
       for (int j=-halfHeight; j<=halfHeight; ++j)
       {
          const float rx = ofsx + j * a12;
          const float ry = ofsy + j * a22;
          for(int i=-halfWidth; i<=halfWidth; ++i)
          {
             float wx = rx + i * a11;
             float wy = ry + i * a21;
             const int x = (int) floor(wx);
             const int y = (int) floor(wy);
             if (x >= 0 && y >= 0 && x < width && y < height)
             {
                // compute weights
                wx -= x; wy -= y;
                // bilinear interpolation
                *out++ =
                   (1.0f - wy) * ((1.0f - wx) * im.at<float>(y,x)   + wx * im.at<float>(y,x+1)) +
                   (       wy) * ((1.0f - wx) * im.at<float>(y+1,x) + wx * im.at<float>(y+1,x+1));
             } else {
                *out++ = 0;
                ret =  true; // touching boundary of the input
             }
          }
       }
       return ret;
    }

As suggested by [Intel Advisor][2], I added:

#pragma omp simd
for(int i=-halfWidth; i<=halfWidth; ++i)

However, while compiling I got:

warning #15552: loop was not vectorized with "simd"

Googling it, I found [this][3], but it's still not clear to me how I could solve this and vectorize this loop.

[1]: https://github.com/perdoch/hesaff/blob/master/helpers.cpp
[2]: https://www.google.it/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&u...
[3]: https://software.intel.com/en-us/articles/fdiag13379

↧

Problem integrate Intel C++ compiler with IDE

April 1, 2017, 5:16 pm

Latest and popular articles on Intel Technologies

≫ Next: "typedef double v4df __attribute__((vector_size(32)));" with OpenMP makes strange output.

≪ Previous: Compiler doesn't vectorize even with simd directive

Hello,

I'm trying to integrate intel c++ compiler with either Xcode or Eclipse on Mac OS. However, neither intel c++ compiler show up in my Xcode build rules settings nor Eclipse extension exist in my installation directory.

My Mac OS version is 10.12.4, Xcode version is 8.3. My installed intel version is parallel studio composer 2017.

Is there a compatible problem between my Intel compiler and system?

Thread Topic:

Question

↧

"typedef double v4df attribute((vector_size(32)));" with OpenMP makes strange output.

April 4, 2017, 1:31 am

Latest and popular articles on Intel Technologies

≫ Next: Build fails using ICC but not GCC

≪ Previous: Problem integrate Intel C++ compiler with IDE

Hello,

"typedef double v4df __attribute__((vector_size(32)));" with OpenMP makes strange output.

Please decompress the attached file and do "make, make test", then you will find the following.

"icpc & -DV4DF=auto", "g++ & -DV4DF=auto", "g++ & -DV4DF=v4df" make right output.

On the other hand, "icpc & -DV4DF=v4df" make wrong output.

Thank you

Attachment	Size
Download test_program.tar.gz	2.44 KB

Thread Topic:

Bug Report

↧

Build fails using ICC but not GCC

April 5, 2017, 1:28 pm

Latest and popular articles on Intel Technologies

≫ Next: C++14 recursive lambda missing operator()

≪ Previous: "typedef double v4df __attribute__((vector_size(32)));" with OpenMP makes strange output.

Hi all,

I am building ITK from source. I have had the same issue on Centos 7 and Ubuntu 14.04 systems running on Xeon Broadwell systems.

Building with GCC

When I use cmake to configure for a standard build using gcc, it use this method below. The compilation completes and works well.

mkdir build; cd build;
cmake -D Module_PerformanceBenchmarking:BOOL=ON ..
make

Building with ICC (latest release 2017; update 2 16 Feb 2017)

When I use cmake to configure for a standard build using Intel compiler, it use this method below. The compilation completes and works well.

mkdir build; cd build;
CC=/opt/bin/icc CXX=/opt/bin/icc cmake -DCMAKE_CXX_COMPILER:FILEPATH=/opt/bin/icc -DCMAKE_C_COMPILER:FILEPATH=/opt/bin/icc -DModule_PerformanceBenchmarking:BOOL=ON ..
cmake -D Module_PerformanceBenchmarking:BOOL=ON ..
make

There is this problem with Centos 7

Scanning dependencies of target ITKTransform-all
[ 58%] Built target ITKTransform-all
[ 58%] Building CXX object Modules/Core/ImageFunction/test/CMakeFiles/ITKImageFunctionTestDriver.dir/itkRayCastInterpolateImageFunctionTest.cxx.o
/src/ITK/Modules/Core/ImageFunction/include/itkRayCastInterpolateImageFunction.hxx(38): error: member "<unnamed>::RayCastHelper<TInputImage, TCoordRep>::InputImageDimension [with TInputImage=itk::Image<unsigned char, 3U>, TCoordRep=double]" was referenced but not defined
    itkStaticConstMacro(InputImageDimension, unsigned int,
    ^

compilation aborted for /src/ITK/Modules/Core/ImageFunction/test/itkRayCastInterpolateImageFunctionTest.cxx (code 2)
make[2]: *** [Modules/Core/ImageFunction/test/CMakeFiles/ITKImageFunctionTestDriver.dir/itkRayCastInterpolateImageFunctionTest.cxx.o] Error 2
make[1]: *** [Modules/Core/ImageFunction/test/CMakeFiles/ITKImageFunctionTestDriver.dir/all] Error 2
make: *** [all] Error 2

I get this problem when I use Ubuntu 14.04

Building CXX object Modules/Filtering/LabelMap/test/CMakeFiles/ITKLabelMapTestDriver.dir/itkBinaryImageToLabelMapFilterTest2.cxx.o
[ 68%] icc: error #10106: Fatal error in /opt/compilers_and_libraries_2017.2.174/linux/bin/intel64/mcpcom, terminated by kill signal
compilation aborted for /path/ITK/ITK-IntelCompiler/build/Modules/Segmentation/SignedDistanceFunction/test/ITKSignedDistanceFunctionTestDriver.cxx (code 1)
icc: error #10106: Fatal error in /opt/compilers_and_libraries_2017.2.174/linux/bin/intel64/mcpcom, terminated by kill signal
compilation aborted for /path/ITK/ITK-IntelCompiler/build/Modules/IO/BioRad/test/ITKIOBioRadTestDriver.cxx (code 1)
make[2]: *** [Modules/IO/XML/test/CMakeFiles/ITKIOXMLTestDriver.dir/ITKIOXMLTestDriver.cxx.o] Error 1

I need help getting past this issue. Thank you.

Thread Topic:

Help Me

↧

C++14 recursive lambda missing operator()

April 5, 2017, 4:09 pm

Latest and popular articles on Intel Technologies

≫ Next: Using SVM for Intel Graphics Technology

≪ Previous: Build fails using ICC but not GCC

void foo() {
  auto bar = [](auto& self) -> void {
    return self(self);
  };
  bar(bar);
}

icc 17.0.2 produces the following error:

main.cc(3): error: call of an object of a class type without appropriate operator() or conversion functions to pointer-to-function type
      return self(self);
             ^
          detected during instantiation of function "lambda [](auto &)->void [with <auto-1>=lambda [](auto &)->void]" at line 5

Thread Topic:

Bug Report

↧

Using SVM for Intel Graphics Technology

April 6, 2017, 7:27 am

Latest and popular articles on Intel Technologies

≫ Next: crash using GCC style vector types

≪ Previous: C++14 recursive lambda missing operator()

Hi,

I'm writing a benchmark to compare different technologies and their performance accross various platforms (on Linux). One of the platform is an Intel Broadwell-H (Core i7-5775C with integrated GPU Iris Pro 6200), so I'm testing the various ways to offload a code on my GPU using Cilk Plus. Right now, I'm trying to use SVM so I followed this tutorial but I'm facing some problems. Here's my code :

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cilk/cilk.h>
#include <gfx/gfx_rt.h>

#define SIZE 64

int main(){
  int * in = (int*)_GFX_svm_alloc(sizeof(int)*SIZE);

#pragma offload target(gfx)
  _Cilk_for (int i = 0; i < SIZE; i++){
    in[i] = 1;
  }

  for (int i = 0; i < SIZE; i++){
    assert(in[i] == 1);
  }

  _GFX_svm_free(in);
  return 0;
}

Then I compile with

 - $ icc -qoffload-svm test.c
test.c(12): error: *GFX* pointer variable "in" in this offload region must be specified in an in/out/inout/nocpy clause
  #pragma offload target(gfx)
  ^

compilation aborted for test.c (code 2)

I thought maybe SVM is not allowed on all patforms, so compiled with

- $ icc -qoffload-arch=broadwell -qoffload-svm test.c
- $ ./a.out
libva info: VA-API version 0.99.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /opt/intel/mediasdk/lib64/iHD_drv_video.so
libva info: Found init function __vaDriverInit_0_32
libva info: va_openDriver() returns 0
a.out test.c:18: main: Assertion 'in[i] == 1' failed.
Abandon (core dumped)

So I guess specifying the platform helps to compile, but the execution failed.

If I change the pragma adding inout(in : length(SIZE)), the compilation/execution works well with the first one, but with the second one we have the same execution problem. The point is : I don't want to add the inout clause, I shouldn't have to. I assume my compilation line is wrong but I can't say in which way.

So my question is : do you see something wrong in my code/compilation ?

Thanks a lot for your time,

Mathieu

↧

crash using GCC style vector types

April 7, 2017, 3:53 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel C++ compiler 2017 MPI 64-bit C++ support

≪ Previous: Using SVM for Intel Graphics Technology

I'm seeing a strange compiler crash when using GCC style vector types. I'm using vector types to load/store because using _mm_loadu_si128 to load vectors seems to ignore the restrict qualifier, causing constant values to be reloaded from memory. I'll make another post for that. My function calculates the mean & stdev. of an array. The clever part is reusing the loop body to process the remainder to reduce the cache foot print. But it seems it's this complex control flow that's causing the crash. If I comment out the goto handleRemainder or comment out the inner most do-while loop, it compiles. And of course, using __m128i instead of vector types makes the crash go away. The crash happens in both ICC 14 & the latest ICC 17. Appreciate a reasonable workaround, explanation, or patch.

#include <immintrin.h>
#include <stdint.h>
#include <unistd.h>
#include <math.h>
#include <algorithm>
using namespace std;

#define CAST_VSHORT(x) x
#define ROUND_DOWN(a, b) (a & (~(b - 1)))
#define MAX_INTENSITY 4096
#define FORCE_INLINE inline __attribute__ ((always_inline))

#if 1
// crashes
typedef int32_t __attribute__((vector_size(16))) VINT;
typedef int16_t __attribute__((vector_size(16))) VSHORT;
typedef int16_t __attribute__((vector_size(16), aligned(1))) UNALIGNED_VSHORT;
#else
typedef __m128i VINT;
typedef __m128i VSHORT;
#endif

FORCE_INLINE __m128i PartialVectorMask(ssize_t n)
{
  return _mm_set1_epi16(0xffff);   // incomplete for brevity
}

FORCE_INLINE int64_t VectorSum(VINT x)
{
  __m128i lo = _mm_cvtepi32_epi64(x),
          hi = _mm_cvtepi32_epi64(_mm_srli_si128(x, 8));
  __m128i sum = _mm_add_epi64(lo, hi);
  return _mm_extract_epi64(_mm_add_epi64(sum, _mm_srli_si128(sum, 8)), 0);
}

__m128i
void CalculateMeanAndStdev(float &mean, float &stdev,
                           int16_t *in, ssize_t size)
{
  ssize_t i;
  double sum = 0, squareSum = 0;
  VINT zero = _mm_set1_epi32(0),
    vSquareSum = zero,
    vSum = zero;
    VSHORT data;
    ssize_t blockEnd;
    const ssize_t VECTOR_WIDTH = 8;
    // elements you can accumulate before square sum can overflow
    const ssize_t BLOCK_SIZE = ROUND_DOWN((UINT32_MAX / ((MAX_INTENSITY - 1) * (MAX_INTENSITY - 1))) * 4, VECTOR_WIDTH);
    ssize_t roundedSize = ROUND_DOWN(size, VECTOR_WIDTH);
    for (i = 0; i <= size - VECTOR_WIDTH; )
    {
      blockEnd = min(i + BLOCK_SIZE, roundedSize);
      // process a block whos size is a multiple of 8, except when processing the SIMD remainder
      do
      {
          data = _mm_loadu_si128((__m128i *)&in[i]);
          //data = *(UNALIGNED_VSHORT *)&in[i];
      handleRemainder:
          VINT unpacked0 = _mm_srai_epi32(_mm_unpacklo_epi16(data, data), 16),
               unpacked1 = _mm_srai_epi32(_mm_unpackhi_epi16(data, data), 16);

          vSquareSum = _mm_add_epi32(_mm_madd_epi16(data, data), vSquareSum);
          vSum = _mm_add_epi32(unpacked0, vSum);
          vSum = _mm_add_epi32(unpacked1, vSum);
          i += VECTOR_WIDTH;
      } while (i < blockEnd);

      squareSum += VectorSum(vSquareSum);
      sum += VectorSum(vSum);
      vSum = zero;
      vSquareSum = zero;
  }
  if (i < size)
  {
      // handle remainder by setting invalid elements to 0
      data = _mm_and_si128(_mm_loadu_si128((__m128i *)&in[i]), PartialVectorMask((size % VECTOR_WIDTH) * sizeof(int16_t)));
      blockEnd = size;
      goto handleRemainder;     // share code to reduce machine code size
  }
  mean = sum / size;
  stdev = sqrtf((squareSum - sum * sum / size) / (size - 1));
}

int main()
{
  const size_t N = 4096;
  int16_t __attribute__((aligned(16))) image[N];
  float mean, stdev;
  for (int i = 0; i < 1000000; ++i)
  {
    CalculateMeanAndStdev(mean, stdev, image, N);
  }
  return mean;
}

Zone:

Thread Topic:

Bug Report

↧

Intel C++ compiler 2017 MPI 64-bit C++ support

April 9, 2017, 9:57 am

Latest and popular articles on Intel Technologies

≫ Next: Passing a stack size value to a linker at compile time using -Xlinker compiler option

≪ Previous: crash using GCC style vector types

Hey,

Is there any way to get 64-bit support for the Intel MPI libraries in 2017 for C++ applications. It used to be part of the 2016 package and it was removed. I need some of the newer features of many of the other packages provided by Cluster XE but the MPI in my opinion has gone through a downgrade.

Thanks,

Will

↧

Passing a stack size value to a linker at compile time using -Xlinker compiler option

April 10, 2017, 11:36 am

Latest and popular articles on Intel Technologies

≫ Next: Preprocessor Macro and _Quad (float128)

≪ Previous: Intel C++ compiler 2017 MPI 64-bit C++ support

Passing a stack size value to a linker at compile time using -Xlinker compiler option

↧

Preprocessor Macro and _Quad (float128)

April 10, 2017, 1:10 pm

Latest and popular articles on Intel Technologies

≫ Next: Testing SIMD on KNL

≪ Previous: Passing a stack size value to a linker at compile time using -Xlinker compiler option

Hey,

Is there a preprocessor macro that tells if the argument "/Qoption,cpp,--extended_float_type" was set?

I'm searching for some macro like "__LONG_DOUBLE_SIZE__" in mathimf.h if "/Qlong-double" was set. I want to remove some overloaded methods and some classes if the compiler doesn't support _Quad, or if the flag was not set. So it would also be nice if I could get it gcc compatible.

The long double didn't give exactly the same results as a double, but the number of significant digits didn't seem to increase. It seems the _Quad provides enough precision.

Thanks,
Christian

Zone:

Windows*

↧