Quantcast
Channel: Intel® Software - Intel® C++ Compiler
Viewing all 2797 articles
Browse latest View live

Intel C++ compiler for academic non-commercial use

$
0
0

Hello,

about three years ago I could download the Intel Composer XE 2013 for free for academic non-commercial use. Now I can see that only some libraries (MKL, etc.) are free, but I have read in some places that the C++ compiler could be free for academia in the future. Is it true?


Code optimization fails

$
0
0

Hi everyone,

I recently ran into an issue I don't understand. I wrote an iterative flow solver that carries out some calculations in each, then calculates a residual and starts over if the residual is still too large. I moved the calculation part into a function that is called in each iteration. It looks like this;

for (int j = 1; j < (ny - 1); j++)
	{
		for (int i = boundLeft; i < (bruttoLength-boundRight); i++)
		{
			i0 = i + j * bruttoLength;

			utilde[i0] = 0;

			// differences in x-direction

			if (i == 1) {
				utilde[i0] = utilde[i0] - (-v / (12 * a * dx) - 1 / (12 * pow(dx, 2))) * u[i0 + 3];

				utilde[i0] = utilde[i0] - (v / (2 * a * dx) + 1 / (3 * pow(dx, 2))) * u[i0 + 2];

				utilde[i0] = utilde[i0] - (-3 * v / (2 * a*dx) + 1 / (2 * pow(dx, 2))) * u[i0 + 1];

				utilde[i0] = utilde[i0] - (v / (4 * a*dx) + 11 / (12 * pow(dx, 2))) * utilde[i0 - 1];

				own_share = own_share + (5 * v / (6 * a*dx) - 5 / (3 * pow(dx, 2)));
			}
			else
			{
				if (i == (bruttoLength - 2)) {

					utilde[i0] = utilde[i0] - (v / (12 * a * dx) - 1 / (12 * pow(dx, 2))) * utilde[i0 - 3];

					utilde[i0] = utilde[i0] - (-v / (2 * a * dx) + 1 / (3 * pow(dx, 2))) * utilde[i0 - 2];

					utilde[i0] = utilde[i0] - (3 * v / (2 * a * dx) + 1 / (2 * pow(dx, 2))) * utilde[i0 - 1];

					utilde[i0] = utilde[i0] - (-v / (4 * a * dx) + 11 / (12 * pow(dx, 2))) * u[i0 + 1];

					own_share = own_share + (-5 * v / (6 * a*dx) - 5 / (3 * pow(dx, 2)));
				}
				else
				{
					utilde[i0] = utilde[i0] - (v / (12 * a * dx) - 1 / (12 * pow(dx, 2))) * u[i0 + 2];

					utilde[i0] = utilde[i0] - (-2 * v / (3 * a * dx) + 4 / (3 * pow(dx, 2))) * u[i0 + 1];

					utilde[i0] = utilde[i0] - (2 * v / (3 * a * dx) + 4 / (3 * pow(dx, 2))) * utilde[i0 - 1];

					utilde[i0] = utilde[i0] - (-v / (12 * a * dx) - 1 / (12 * pow(dx, 2))) * utilde[i0 - 2];

					own_share = own_share + (-5 / (2 * pow(dx, 2)));
				}
			}

			// repeat equivalent code for y-direction. It has the same structure as above

			// SOR-Share
			utilde[i0] = utilde[i0] / own_share * w + (1 - w) * u[i0];
			own_share = 0;
		}
	}

As long as the code is executed as a function call (arguments are 3 integers and 2 double pointers) the performance is really bad. As soon as I copy the code directly into my loop there is a massive speed up.

I tried both versions with and without the code optimization enabled (/O2) and measured the average execution time of the code snippet above. It looks like there is only minor code optimization for the version with the function call as the execution time did not improve much (3x faster, compared to 12x faster withou the function call).

I'm not sure if this is the root of the problem though. Can anybody give me some advise? Of course I could leave the whole calculation part inside my while-loop, but that looks very confusing. It would be much clearer to move it into a separate function.

I'm using the compiler that comes with Intel Parallel Studio XE 2017.

Best regards.

Thread Topic: 

Question

__builtin_clrsbl undefined

$
0
0

__builtin_clrsb and __builtin_clrsbll exist, but __builtin_clrsbl seems to be missing.  Quick test:

#include <stdlib.h>
#include <stdio.h>

int main (void) {
  printf("int: %d\n", __builtin_clrsb(-1));
  printf("long: %d\n", __builtin_clrsbl(-1));
  printf("long long: %d\n", __builtin_clrsbll(-1));

  return EXIT_SUCCESS;
}

When attempting to compile:

nemequ@peltast:~/t$ icc -o clrsb clrsb.c
clrsb.c(6): warning #266: function "__builtin_clrsbl" declared implicitly
    printf("long: %d\n", __builtin_clrsbl(-1));
                         ^

/tmp/icc9KCY5p.o: In function `main':
clrsb.c:(.text+0x43): undefined reference to `__builtin_clrsbl'
nemequ@peltast:~/t$ icc --version
icc (ICC) 17.0.2 20170213
Copyright (C) 1985-2017 Intel Corporation.  All rights reserved.

Thread Topic: 

Bug Report

__STDC_NO_THREADS__ undefined

$
0
0

ICC defines __STDC_VERSION__ to 201112L, but doesn't define __STDC_NO_THREADS__.  glibc doesn't currently support the C11 threads API, so it should be defined (per § 6.10.8.3 of the C11 spec).

I know this is partially a libc problem.  I believe GCC resolves this by including <stdc-predef.h> from glibc (which includes a definition of __STDC_NO_THREADS__ in glibc >= __STDC_NO_THREADS__).

I'm working around this right now by checking __STDC_NO_THREADS__ after including <limits.h> (<limits.h> includes <features.h> which includes <stdc-predef.h>).

Testing is simple, just put this before any includes:

#if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 201102L) && !defined(__STDC_NO_THREADS__)
#  include <threads.h>
#endif

 

Thread Topic: 

Bug Report

How to specify __mm512 to coreside with __mm256 or _mm128

$
0
0

In working with the intrinsics guide, and as for use with AVX512 one of the intrinsics I am wanting to use is

__m512d _mm512_broadcastsd_pd(__m128d a)

However, I'd like to avoid using a mov to move data from register to register, I'd rather use a cast.

The __m128 registers are co-resident with the low 8 __m512 registers. So, what I am asking for is

__m512d foo compilerDirective(use a register in range of 0:7);

Or one to specifically specify the zmm register to use (in range of 0:7). (I'd rather not regress to assembler)

There would be a similar issue with specifying zmm to overlay with ymm.

Jim Dempsey

omp_get_num_procs() doesn't returns all processor

$
0
0

Hi, I have an application in c++ that is using Qt libraries.

When I call omp_get_num_procs() from any part of the program, doesn't return the maximum number of processors that my machine has, so all the threads are distributed in the available processors.

But, If I call  omp_get_num_procs() from main.cpp, before to QApplication constructor, I obtain all processor and the thread are distributed in all processors that the machine has.

I've tried to find out what exactly does  omp_get_num_procs() that is changing the available processors fro the application.

Best Regards,

Leonardo

Zone: 

Thread Topic: 

Question

Linking issue with Intel Debug build ...

$
0
0

Hello,

I get the following linking error when building a debug build with Intel 16.0 compiler:

 error LNK2019: unresolved external symbol ___intel_ssse3_strncpy referenced in function

Does anyone know the corresponding lib where this symbol resides?  

Many thanks,

Andrew.

ps>  Using windows with VS2013, Intel Compiler 16.0.

icpc HUNGS on small program

$
0
0

Hi!

I found icpc from Intel Parallel Studio XE 2017 hungs on my Linux machine compiling my project. For some tries, I reduced my code to the following:

#include <list>

class Outer {
public:
	class InnerBase {
	public:
	};

	class InnerDerrive : public InnerBase {} ;

	struct Item {
		InnerBase* x;
	};

private:
	InnerDerrive obj;
	using List = std::list<Item>;
	List saversMap{ { Item{&obj } } };

public:
};

compiling with /opt/intel/bin/icpc -std=c++14 -c file.cpp

If I replace InnerBase with InnerDerrive at line 12, the compiler will work properly.

Please, tell, is it Intel Compiler bug and if so, how to report it to Intel?


Issue with O2/O3 optimisation using Intel compiler 2017 update 2

$
0
0

Hello,

While trying to compare parallelism between OMP, MPI, Cuda and OpenACC, I've encountered some dangerous behavior using the "Intel ICPC 17.0.2 20170213". The code below is solving the classical temperature distribution in 2D (just for test):

// /opt/intel/bin/icpc -O2 main.cpp && time ./a.out
#include <iostream>
#include <cmath>

int main()
{

  const int nx = 2800;
  const int ny = 2800;
  const float lx = 1.f;
  const float ly = 1.f;
  float dx = lx / (float)nx;
  float dy = ly / (float)ny;
  const float lambda = 0.001f;
  float dt = 0.01f/lambda/(1.0f/(dx*dx)/12.f+1.0f/(dy*dy)/12.f);
  float dtdx = dt*lambda/dx/dx/12.f;
  float dtdy = dt*lambda/dy/dy/12.f;

  int nxt = nx + 4;
  int nyt = ny + 4;

  float* T = new float[nxt*nyt];
  float* Told = new float[nxt*nyt];

  for (int i = 0; i < nxt; i++) {
    for (int j = 0; j < nyt; j++) {
      int ind = i*nyt + j;
      T[ind] = 1.0f + exp(-32.f*(pow((float)(j-nyt/2)/(float)nyt,2)+pow((float)(i-nxt/2)/(float)nxt,2)));
      Told[ind] = T[ind];
    }
  }

  for (int step=0; step<1000; step++) {

    for (int i = 2; i < nxt-2; i++) {
      for (int j = 2; j < nyt-2; j++) {
        int ind = i*nyt + j;
        T[ind] = Told[ind] + dtdx * (-Told[(i-2)*nyt + (j)]+16.f*Told[(i-1)*nyt + (j)]-30.f*Told[(i)*nyt + (j)]+16.f*Told[(i+1)*nyt + (j)]-Told[(i+2)*nyt + (j)])
        + dtdy * (-Told[(i)*nyt + (j-2)]+16.f*Told[(i)*nyt + (j-1)]-30.f*Told[(i)*nyt + (j)]+16.f*Told[(i)*nyt + (j+1)]-Told[(i)*nyt + (j+2)]);
      }
    }

    for (int i = 0; i < nxt*nyt; i++) {
        Told[i] = T[i];
    }

  }

  float sum = 0.0f;
  for (int i = 0; i < nxt*nyt; i++) {
      sum += T[i];
  }

  std::cout << sum/(float)(nxt*nyt) << std::endl;

  return 0;
}

After 1000 "time" iterations, the code is supposed to give the results: 1.08712 (using O0 or O1 compile flag)

Without any parallelism (no openmp), the code is compiled with O2 optimisation and gives the following (wrong) results: 1.09783

/opt/intel/bin/icpc -O2 main.cpp && time ./a.out

Using LLVM C++ (Apple) or GNU G++, with O2 or even O3 optimization flag, gives good results (1.08712).

It seems that Intel compiler with O2 or O3 compile flag does aggressive optimizations on floating-point data. In order to get the good result, -fp-model precise flag needs to be added to the command line.

Why does the O2 flag create such aggressive optimizations? (while GNU or LLVM does not) I mean, I know that O3 can be a dangerous flag to use but I thought O2 was, a least, usable. If you are not aware of the -fp-model flag, your results may be completely wrong...

Thank you.

Thread Topic: 

Question

_GFX_offload weird behaviour

$
0
0

Hi,

I'm targeting Intel Graphics Technology with the API-Based offloading for asynchronous offloading. To begin, I try to offload this algorithm :

for (int i = 0; i < size; i++){
  A[i] = i;
}

So I wrote this code :

__declspec(target(gfx_kernel))
void fill(int * A, int size){
  _Cilk_for(int i = 0; i < size; i++){
    A[i] = i;
  }
}

int main() {
  int N = 1024;
  int * A = malloc(sizeof(int) * N);

  _GFX_share(A,N);

  _GFX_offload((void*)fill, A, N);
  _GFX_wait(0,-1);

  _GFX_unshare(A);
  free(A)

  return 0;
}

This code compiles and executes, but only the 780 firsts elements of A are effectively changed. I guess that's because of the max value of groups and threads but the number seems weird to me (_GFX_get_device_hardware_thread_count() returns 336).

So I have two questions : why 780 ? and how can I write a kernel that I can call with

_GFX_offload((void *)fill, A, N);

that does what I want it to do ?

Thanks, and have a nice day

Mathieu

Thread Topic: 

Help Me

Compiler doesn't vectorize even with simd directive

$
0
0

I have this function taken from [here][1]:

    bool interpolate(const Mat &im, float ofsx, float ofsy, float a11, float a12, float a21, float a22, Mat &res)
    {
       bool ret = false;
       // input size (-1 for the safe bilinear interpolation)
       const int width = im.cols-1;
       const int height = im.rows-1;
       // output size
       const int halfWidth  = res.cols >> 1;
       const int halfHeight = res.rows >> 1;
       float *out = res.ptr<float>(0);
       for (int j=-halfHeight; j<=halfHeight; ++j)
       {
          const float rx = ofsx + j * a12;
          const float ry = ofsy + j * a22;
          for(int i=-halfWidth; i<=halfWidth; ++i)
          {
             float wx = rx + i * a11;
             float wy = ry + i * a21;
             const int x = (int) floor(wx);
             const int y = (int) floor(wy);
             if (x >= 0 && y >= 0 && x < width && y < height)
             {
                // compute weights
                wx -= x; wy -= y;
                // bilinear interpolation
                *out++ =
                   (1.0f - wy) * ((1.0f - wx) * im.at<float>(y,x)   + wx * im.at<float>(y,x+1)) +
                   (       wy) * ((1.0f - wx) * im.at<float>(y+1,x) + wx * im.at<float>(y+1,x+1));
             } else {
                *out++ = 0;
                ret =  true; // touching boundary of the input
             }
          }
       }
       return ret;
    }

As suggested by [Intel Advisor][2], I added:

    #pragma omp simd
    for(int i=-halfWidth; i<=halfWidth; ++i)

However, while compiling I got:

    warning #15552: loop was not vectorized with "simd"

Googling it, I found [this][3], but it's still not clear to me how I could solve this and vectorize this loop. 

  [1]: https://github.com/perdoch/hesaff/blob/master/helpers.cpp
  [2]: https://www.google.it/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&u...
  [3]: https://software.intel.com/en-us/articles/fdiag13379

Problem integrate Intel C++ compiler with IDE

$
0
0

Hello,

I'm trying to integrate intel c++ compiler with either Xcode or Eclipse on Mac OS. However, neither intel c++ compiler show up in my Xcode build rules settings nor Eclipse extension exist in my installation directory. 

My Mac OS version is 10.12.4, Xcode version is 8.3. My installed intel version is parallel studio composer 2017.

Is there a compatible problem between my Intel compiler and system?

 

 

Thread Topic: 

Question

"typedef double v4df __attribute__((vector_size(32)));" with OpenMP makes strange output.

$
0
0

Hello,

"typedef double v4df __attribute__((vector_size(32)));" with OpenMP makes strange output.

Please decompress the attached file and do "make, make test", then you will find the following.

"icpc & -DV4DF=auto",  "g++ & -DV4DF=auto", "g++ & -DV4DF=v4df" make right output.

On the other hand, "icpc & -DV4DF=v4df" make wrong output.

Thank you

AttachmentSize
Downloadapplication/x-gziptest_program.tar.gz2.44 KB

Thread Topic: 

Bug Report

Build fails using ICC but not GCC

$
0
0

Hi all,

I am building ITK from source. I have had the same issue on Centos 7 and Ubuntu 14.04 systems running on Xeon Broadwell systems.

Building with GCC

When I use cmake to configure for a standard build using gcc, it use this method below. The compilation completes and works well.

mkdir build; cd build;
cmake -D Module_PerformanceBenchmarking:BOOL=ON ..
make 

Building with ICC (latest release 2017; update 2 16 Feb 2017)

When I use cmake to configure for a standard build using Intel compiler, it use this method below. The compilation completes and works well.

mkdir build; cd build;
CC=/opt/bin/icc CXX=/opt/bin/icc cmake -DCMAKE_CXX_COMPILER:FILEPATH=/opt/bin/icc -DCMAKE_C_COMPILER:FILEPATH=/opt/bin/icc -DModule_PerformanceBenchmarking:BOOL=ON ..
cmake -D Module_PerformanceBenchmarking:BOOL=ON ..
make 

There is this problem with Centos 7

Scanning dependencies of target ITKTransform-all
[ 58%] Built target ITKTransform-all
[ 58%] Building CXX object Modules/Core/ImageFunction/test/CMakeFiles/ITKImageFunctionTestDriver.dir/itkRayCastInterpolateImageFunctionTest.cxx.o
/src/ITK/Modules/Core/ImageFunction/include/itkRayCastInterpolateImageFunction.hxx(38): error: member "<unnamed>::RayCastHelper<TInputImage, TCoordRep>::InputImageDimension [with TInputImage=itk::Image<unsigned char, 3U>, TCoordRep=double]" was referenced but not defined
    itkStaticConstMacro(InputImageDimension, unsigned int,
    ^

compilation aborted for /src/ITK/Modules/Core/ImageFunction/test/itkRayCastInterpolateImageFunctionTest.cxx (code 2)
make[2]: *** [Modules/Core/ImageFunction/test/CMakeFiles/ITKImageFunctionTestDriver.dir/itkRayCastInterpolateImageFunctionTest.cxx.o] Error 2
make[1]: *** [Modules/Core/ImageFunction/test/CMakeFiles/ITKImageFunctionTestDriver.dir/all] Error 2
make: *** [all] Error 2

I get this problem when I use Ubuntu 14.04

Building CXX object Modules/Filtering/LabelMap/test/CMakeFiles/ITKLabelMapTestDriver.dir/itkBinaryImageToLabelMapFilterTest2.cxx.o
[ 68%] icc: error #10106: Fatal error in /opt/compilers_and_libraries_2017.2.174/linux/bin/intel64/mcpcom, terminated by kill signal
compilation aborted for /path/ITK/ITK-IntelCompiler/build/Modules/Segmentation/SignedDistanceFunction/test/ITKSignedDistanceFunctionTestDriver.cxx (code 1)
icc: error #10106: Fatal error in /opt/compilers_and_libraries_2017.2.174/linux/bin/intel64/mcpcom, terminated by kill signal
compilation aborted for /path/ITK/ITK-IntelCompiler/build/Modules/IO/BioRad/test/ITKIOBioRadTestDriver.cxx (code 1)
make[2]: *** [Modules/IO/XML/test/CMakeFiles/ITKIOXMLTestDriver.dir/ITKIOXMLTestDriver.cxx.o] Error 1

I need help getting past this issue. Thank you.

Thread Topic: 

Help Me

C++14 recursive lambda missing operator()

$
0
0
void foo() {
  auto bar = [](auto& self) -> void {
    return self(self);
  };
  bar(bar);
}

 

icc 17.0.2 produces the following error:

main.cc(3): error: call of an object of a class type without appropriate operator() or conversion functions to pointer-to-function type
      return self(self);
             ^
          detected during instantiation of function "lambda [](auto &)->void [with <auto-1>=lambda [](auto &)->void]" at line 5

 

Thread Topic: 

Bug Report

Using SVM for Intel Graphics Technology

$
0
0

Hi,

I'm writing a benchmark to compare different technologies and their performance accross various platforms (on Linux). One of the platform is an Intel Broadwell-H  (Core i7-5775C with integrated GPU Iris Pro 6200), so I'm testing the various ways to offload a code on my GPU using Cilk Plus. Right now, I'm trying to use SVM so I followed this tutorial but I'm facing some problems. Here's my code :

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cilk/cilk.h>
#include <gfx/gfx_rt.h>

#define SIZE 64

int main(){
  int * in = (int*)_GFX_svm_alloc(sizeof(int)*SIZE);

#pragma offload target(gfx)
  _Cilk_for (int i = 0; i < SIZE; i++){
    in[i] = 1;
  }

  for (int i = 0; i < SIZE; i++){
    assert(in[i] == 1);
  }

  _GFX_svm_free(in);
  return 0;
}

Then I compile with

 - $ icc -qoffload-svm test.c
test.c(12): error: *GFX* pointer variable "in" in this offload region must be specified in an in/out/inout/nocpy clause
  #pragma offload target(gfx)
  ^

compilation aborted for test.c (code 2)

I thought maybe SVM is not allowed on all patforms, so compiled with

- $ icc -qoffload-arch=broadwell -qoffload-svm test.c
- $ ./a.out
libva info: VA-API version 0.99.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /opt/intel/mediasdk/lib64/iHD_drv_video.so
libva info: Found init function __vaDriverInit_0_32
libva info: va_openDriver() returns 0
a.out test.c:18: main: Assertion 'in[i] == 1' failed.
Abandon (core dumped)

So I guess specifying the platform helps to compile, but the execution failed.

If I change the pragma adding inout(in : length(SIZE)), the compilation/execution works well with the first one, but with the second one we have the same execution problem. The point is : I don't want to add the inout clause, I shouldn't have to. I assume my compilation line is wrong but I can't say in which way.

So my question is : do you see something wrong in my code/compilation ?

Thanks a lot for your time,

Mathieu

crash using GCC style vector types

$
0
0

I'm seeing a strange compiler crash when using GCC style vector types. I'm using vector types to load/store because using _mm_loadu_si128 to load vectors seems to ignore the restrict qualifier, causing constant values to be reloaded from memory. I'll make another post for that. My function calculates the mean & stdev. of an array. The clever part is reusing the loop body to process the remainder to reduce the cache foot print. But it seems it's this complex control flow that's causing the crash. If I comment out the goto handleRemainder or comment out the inner most do-while loop, it compiles. And of course, using __m128i instead of vector types makes the crash go away. The crash happens in both ICC 14 & the latest ICC 17. Appreciate a reasonable workaround, explanation, or patch.

#include <immintrin.h>
#include <stdint.h>
#include <unistd.h>
#include <math.h>
#include <algorithm>
using namespace std;

#define CAST_VSHORT(x) x
#define ROUND_DOWN(a, b) (a & (~(b - 1)))
#define MAX_INTENSITY 4096
#define FORCE_INLINE inline __attribute__ ((always_inline))

#if 1
// crashes
typedef int32_t __attribute__((vector_size(16))) VINT;
typedef int16_t __attribute__((vector_size(16))) VSHORT;
typedef int16_t __attribute__((vector_size(16), aligned(1))) UNALIGNED_VSHORT;
#else
typedef __m128i VINT;
typedef __m128i VSHORT;
#endif

FORCE_INLINE __m128i PartialVectorMask(ssize_t n)
{
  return _mm_set1_epi16(0xffff);   // incomplete for brevity
}

FORCE_INLINE int64_t VectorSum(VINT x)
{
  __m128i lo = _mm_cvtepi32_epi64(x),
          hi = _mm_cvtepi32_epi64(_mm_srli_si128(x, 8));
  __m128i sum = _mm_add_epi64(lo, hi);
  return _mm_extract_epi64(_mm_add_epi64(sum, _mm_srli_si128(sum, 8)), 0);
}

__m128i
void CalculateMeanAndStdev(float &mean, float &stdev,
                           int16_t *in, ssize_t size)
{
  ssize_t i;
  double sum = 0, squareSum = 0;
  VINT zero = _mm_set1_epi32(0),
    vSquareSum = zero,
    vSum = zero;
    VSHORT data;
    ssize_t blockEnd;
    const ssize_t VECTOR_WIDTH = 8;
    // elements you can accumulate before square sum can overflow
    const ssize_t BLOCK_SIZE = ROUND_DOWN((UINT32_MAX / ((MAX_INTENSITY - 1) * (MAX_INTENSITY - 1))) * 4, VECTOR_WIDTH);
    ssize_t roundedSize = ROUND_DOWN(size, VECTOR_WIDTH);
    for (i = 0; i <= size - VECTOR_WIDTH; )
    {
      blockEnd = min(i + BLOCK_SIZE, roundedSize);
      // process a block whos size is a multiple of 8, except when processing the SIMD remainder
      do
      {
          data = _mm_loadu_si128((__m128i *)&in[i]);
          //data = *(UNALIGNED_VSHORT *)&in[i];
      handleRemainder:
          VINT unpacked0 = _mm_srai_epi32(_mm_unpacklo_epi16(data, data), 16),
               unpacked1 = _mm_srai_epi32(_mm_unpackhi_epi16(data, data), 16);

          vSquareSum = _mm_add_epi32(_mm_madd_epi16(data, data), vSquareSum);
          vSum = _mm_add_epi32(unpacked0, vSum);
          vSum = _mm_add_epi32(unpacked1, vSum);
          i += VECTOR_WIDTH;
      } while (i < blockEnd);

      squareSum += VectorSum(vSquareSum);
      sum += VectorSum(vSum);
      vSum = zero;
      vSquareSum = zero;
  }
  if (i < size)
  {
      // handle remainder by setting invalid elements to 0
      data = _mm_and_si128(_mm_loadu_si128((__m128i *)&in[i]), PartialVectorMask((size % VECTOR_WIDTH) * sizeof(int16_t)));
      blockEnd = size;
      goto handleRemainder;     // share code to reduce machine code size
  }
  mean = sum / size;
  stdev = sqrtf((squareSum - sum * sum / size) / (size - 1));
}

int main()
{
  const size_t N = 4096;
  int16_t __attribute__((aligned(16))) image[N];
  float mean, stdev;
  for (int i = 0; i < 1000000; ++i)
  {
    CalculateMeanAndStdev(mean, stdev, image, N);
  }
  return mean;
}

Zone: 

Thread Topic: 

Bug Report

Intel C++ compiler 2017 MPI 64-bit C++ support

$
0
0

Hey,

Is there any way to get 64-bit support for the Intel MPI libraries in 2017 for C++ applications. It used to be part of the 2016 package and it was removed. I need some of the newer features of many of the other packages provided by Cluster XE but the MPI in my opinion has gone through a downgrade.

Thanks,

Will

Passing a stack size value to a linker at compile time using -Xlinker compiler option

$
0
0

Passing a stack size value to a linker at compile time using -Xlinker compiler option

Preprocessor Macro and _Quad (float128)

$
0
0

Hey,

Is there a preprocessor macro that tells if the argument "/Qoption,cpp,--extended_float_type" was set?

I'm searching for some macro like "__LONG_DOUBLE_SIZE__" in mathimf.h if "/Qlong-double" was set. I want to remove some overloaded methods and some classes if the compiler doesn't support _Quad, or if the flag was not set. So it would also be nice if I could get it gcc compatible.

The long double didn't give exactly the same results as a double, but the number of significant digits didn't seem to increase. It seems the _Quad provides enough precision.

Thanks,
Christian

Zone: 

Viewing all 2797 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>