C++ Accelerated Massive Parallelism: Digest

C++ AMP is coming with Visual Studio 2011. Here are information and good articles introducing it:

Herb Sutter: Heterogeneous Computing and C++ AMP

MSDN overview page

PDF: “C++ AMP : Language and Programming Model” – the specification

A list of articles:

http://blogs.msdn.com/b/vcblog/archive/2011/06/15/introducing-amp.aspx

http://msdn.microsoft.com/en-us/magazine/hh882446.aspx

http://msdn.microsoft.com/en-us/magazine/hh882447.aspx

Notes about improving your Visual C++ debugging skill

Good Start Point

One useful link to start: PDB Files: What Every Developer Must Know

Summary from the blog article:

1) PDB files are as important as source code!

2) Every development shop must set up a Symbol Server.

Other useful things

Microsoft Windows Symbol Server is introduced here.

Source Server A tool to further facilitate debugging public build.

Symbol Proxy Advanced tool to give you centralised symbol store and performance boost on debugging. 

 

MFC GetParentFrame() returns NULL

I write this little post to record a small fact that you may have to pay attention to when you create your new CView inside your CFrameWnd.

By default, when you use Visual Studio’s resource editor to add a new dialog template for your CView class, the window style of that dialog template includes WS_POPUP. This normally doesn’t crash your program. However, we may want to call GetParentFrame() function within our CView from time to time, and if WS_POPUP is present, the function will return NULL pointer, causing memory access violation exception.

So, whenever you create CView within a frame window, pay attention to window style given by Visual Studio, and change WS_POPUP to WS_CHILD if needed.

Link: 2011-March

Visual C++ exception handling

This article describes a problem with the default handling of exceptions in Microsoft’s Visual C++ compilers. The problem is caused by the compiler’s extension to the C++ exception handling mechanism. I then give a technique that properly exploits this extension, bringing it into line with normal C++ exception handling. This allows programs to deal with exceptions that are normally very difficult to handle, such as memory access violations.

SXS Activation Context — Activate and Deactivate

C++ Standard Notes – Program Execution

C++ Standard (Draft n3126)

Program Execution (1.9)

Context of this section is single-thread execution. Single-thread execution environment consists of full-expressions.

1. Full-expression

A full-expression is an expression that is not a subexpression of another expression.
Normally a full-expression is a single line of C++ code that ends with ‘;’, or expression in conditional clause like “if”, “while”. In contrast, a subexpression is always part of a full-expression.

If a language construct is defined to produce an implicit call of a function, a use of the language construct is considered to be an expression for the purposes of this definition.
For example, call of casting operator, or class constructor function.

A call to a destructor generated at the end of the lifetime of an object other than a temporary object is an implicit full-expression.
This means normal implicit destructor call is full expression. A temporary object may be created and destroyed, in situation like function argument value computation. In this case, destructor call to the temporary object is subexpression.

Conversions applied to the result of an expression in order to satisfy the requirements of the language construct in which the expression appears are also considered to be part of the full-expression.
This includes glvalue-to-prvalue conversion, array-to-pointer conversion, and so on.

2. Evaluation of expression

Evaluation of an expression (or a sub-expression) in general includes both value computations and initiation of side effects.
Side effect itself is not part of evaluation of expression. Side effect can be understood as any action that changes memory content or triggers external I/O.

The execution of unsequenced evaluations can overlap.
How to understand this? Because every evaluation can be translated into multiple assembly codes, assembly codes from unsequenced evaluations can interleave.

Except where noted, evaluations of operands of individual operators and of subexpressions of individual expressions are unsequenced.(1) The value computations of the operands of an operator are sequenced before the value computation of the result of the operator.(2) If a side effect on a scalar object is unsequenced relative to either another side effect on the same scalar object or a value computation using the value of the same scalar object, the behavior is undefined.(3)
We can understand this by understanding example given below it:

void f(int, int);
void g(int i, int* v) {
	i = v[i++]; // the behavior is undefined
	// It is undefined because of statement (3). On right side of assignment, i++ has side effect of modifying i, and i is also used in subscripting expression(5.2.1) v[...].

	i = 7, i++, i++; // i becomes 9
	// This is equivalently (i = 7), i++, i++. It is well-formed because comma operator(5.18) is sequenced from left to right.

	i = i++ + 1; // the behavior is undefined
	// It is undefined because of statement (3). This is same as first example, because subscripting operator is essentially pointer + integral type.

	i = i + 1; // the value of i is incremented
	// It is well-formed. No side effect.

	f(i = -1, i = -1); // the behavior is undefined
	// It is undefined because of statement (1) and (3). But behaviour is expected in reality because both side effects assign -1 to i.
}

Link below is a thread discussion about why behaviour of i = v[i++] is undefined:
http://www.rhinocerus.net/forum/language-c/645839-i-v-i-results-undefined-behavior-cant-understand-why.html#post2656227

In my opinion, the reasoning there is overkilling and sometimes wrong.

C++ Standard Notes – Memory Model

C++ Standard (Draft n3126)

This series of notes is for C++ beginners to get more fundamental understanding of C++ by reading parts of C++ Standard (or draft), sometimes with small code experiment. C++ Standard is not a linear textbook. It often references concepts and definitions explained in later sections. And sometimes concepts or rules defined are simply reflection of underlying hardware design or constraint.

1.7 The C++ memory model

Some definitions we need to understand this section is summarised below:

Bit-fields (9.6)
Detail definition can be found in “9.6 Bit-fields”. It is always class member.

Arithmetic Types (3.9.1)
Integral and floating types.

Scalar Types (3.9)
Arithmetic types (3.9.1), enumeration types, pointer types, pointer to member types (3.9.2), std::nullptr_t, and cv-qualified versions of these types.

Key concept defined in this section is Memory Location, which is “either an object of scalar type or a maximal sequence of adjacent bit-fields all having non-zero width”, and principle of accessing separate memory locations, that is “two threads of execution (1.10) can update and access separate memory locations without interfering with each other. … It is not safe to concurrently update two bit-fields in the same struct if all fields between them are also bit-fields of non-zero width.”

It indicates that, in parallel programming, if an object is shared among threads, we don’t always need to synchronise access to it, provided that no two threads access/modify same memory location in it. However, if some contiguous memory locations are within a small enough memory area, heavily accessing them by threads will cause False Sharing.

Effectively, memory location reflects the fact that smallest access unit of memory is byte.

Below is a small test to demonstrate how memory location affects C++ compiler, using Visual Studio 2010.

void TestMemoryLocation()
{
	struct BitField
	{
		int a;
		int b:3;
		int:0; // Comment out this line will have b and c in same memory location
		int c:29;
	};

	BitField bit = {1, 3, 16};

	bit.b = 5;
	bit.c = 19;
}

int main(int argc, char* argv[])
{
	TestMemoryLocation();
	return 0;
}
// Assembly code, when zero bit-field is commented out

	BitField bit = {1, 3, 16};
0010139E mov dword ptr [bit],1
001013A5 mov dword ptr [ebp-8],83h

	bit.b = 5;
001013AC mov eax,dword ptr [ebp-8]
001013AF and eax,0FFFFFFF8h
001013B2 or eax,5
001013B5 mov dword ptr [ebp-8],eax // Write to b from address DWORD PTR [ebp-8]
	bit.c = 19;
001013B8 mov eax,dword ptr [ebp-8]
001013BB and eax,7
001013BE or eax,98h
001013C3 mov dword ptr [ebp-8],eax // Write to c also from address DWORD PTR [ebp-8]
// Assembly code, when zero bit-field is declared

	BitField bit = {1, 3, 16};
00FD13A8 mov dword ptr [ebp-14h],1
00FD13AF mov dword ptr [ebp-10h],3
00FD13B6 mov dword ptr [ebp-0Ch],10h

	bit.b = 5;
00FD13BD mov eax,dword ptr [ebp-10h]
00FD13C0 and eax,0FFFFFFF8h
00FD13C3 or eax,5
00FD13C6 mov dword ptr [ebp-10h],eax // Write to b from address DWORD PTR [ebp-10h]
	bit.c = 19;
00FD13C9 mov eax,dword ptr [ebp-0Ch]
00FD13CC and eax,0E0000000h
00FD13D1 or eax,13h
00FD13D4 mov dword ptr [ebp-0Ch],eax // Write to c from address DWORD PTR [ebp-0Ch]

回调(Callback), 委托(Delegate), 事件(Event)

本文通过认识Observer Pattern,以及对比C++,C#对这个模式的实现来理解C#中的委托(delegate)和事件(event)。

Observer Pattern设计模式

http://en.wikipedia.org/wiki/Observer_pattern Observer Pattern绝不是什么新东西。大家能想到的最早最低级别的实际例子是什么?Intel x86系统中的中断向量表是我能想到的最低级别的实现。整个计算机就是subject,可以随时产生各种事件,INT指令就是事件触发,中断向量表中对应的中断处理函数就是observer对特定事件的反应。 下面是我写的两个小例子程序,用来演示Observer Pattern在C++和C#中是如何实现的。只是一个概念展示,程序不见得有实际用途。两个程序类似,都定义了一个class Subject,和对应的两个Observer。很容易看懂,就不做解释了。

回调函数(C++)

#include <iostream>
#include <list>
#include <algorithm>
using namespace std;

// Subject class that accepts callback function pointer
class Subject
{
	// Type declaration
	// Define Callback function type, which is essentially function pointer
	typedef void (*Callback)(void* subject, void* arg);
private:
	// List of Callback function pointers 
	list<Callback> _callbacks;
public:
	void RegisterNotification(Callback callback)
	{
		list<Callback>::iterator it = find(_callbacks.begin(), _callbacks.end(), callback);
		if (it == _callbacks.end())
			_callbacks.push_back(callback);
	}
	
	void UnregisterNotification(Callback callback)
	{
		list<Callback>::iterator it = find(_callbacks.begin(), _callbacks.end(), callback);
		if (it != _callbacks.end())
			_callbacks.erase(it);
	}
	
	// Invoke all callback functions
	void Notify()
	{
		if (_callbacks.empty())
			cout << "Nobody to notify" << endl;
		
		for (list<Callback>::iterator it = _callbacks.begin(); it != _callbacks.end(); ++it)
			(*it)(this, 0);
	}
	
	void Run()
	{
		// Do some work
		Notify();
	}
};

// Concrete observer
class Observer1
{
public:
	static void Notify(void* subject, void* arg)
	{
		cout << "Observer1 notified" << endl;
	}
};

// Concrete observer
class Observer2
{
public:
	static void Notify(void* subject, void* arg)
	{
		cout << "Observer2 notified" << endl;
	}
};

int main(int argc, char* argv[])
{
	Subject subject;
	
	subject.RegisterNotification(Observer1::Notify);
	subject.RegisterNotification(Observer2::Notify);
	
	subject.Run();
	
	subject.UnregisterNotification(Observer1::Notify);
	subject.UnregisterNotification(Observer2::Notify); subject.Run();
	
	return 0;
}

委托和事件(C#)

using System;
using System.Collections.Generic;
using System.Text;

namespace ObserverCs
{
	// Subject class that accepts callback function pointer
	class Subject
	{
		// Type declaration
		// Define Callback function type, which is essentially function pointer
		public delegate void Callback(object subject, object arg);
		
		// Event object public event Callback C;
		public void Run()
		{
			// Do some work
			if (C != null)
				C.Invoke(this, null);
			else Console.WriteLine("Nobody to notify");
		}
	}
	
	// Concrete observer
	class Observer1
	{
		public static void Notify(object subject, object arg)
		{
			Console.WriteLine("Observer1 notified");
		}
	}
	
	// Concrete observer
	class Observer2
	{
		public static void Notify(object subject, object arg)
		{
			Console.WriteLine("Observer2 notified");
		}
	}
	
	class Program
	{
		static void Main(string[] args)
		{
			Subject subject = new Subject();
			subject.C += Observer1.Notify;
			subject.C += Observer2.Notify;
			
			subject.Run();
			
			subject.C -= Observer1.Notify;
			subject.C -= Observer2.Notify;
			
			subject.Run();
		}
	}
}

C#委托和事件辨析

有些C#初学者对于委托和事件的理解不够明晰深刻。通过认识Observer Pattern和上面两个例子代码的对比,应该认识就会很深刻了。下面我们把两个例子中的关键代码行放在一起仔细对比一下:

// C++
typedef void (*Callback)(void* subject, void* arg);
// C#
delegate void Callback(object subject, object arg);

// C++
list&lt;Callback&gt; _callbacks;
// C#
event Callback C;

// C++
subject.RegisterNotification(Observer1::Notify);
// C#
subject.C += Observer1.Notify;

委托就是对拥有相同签名的函数/方法的类型定义;事件就是委托实例的集合,包含0个或多个委托函数实例。事件这个名字稍微有点迷惑性。他更多的表达的是这个委托集合在应用上的一般目的(处理事件)。

多处理器系统下的伪共享(false sharing)问题

从CSDN博客搬家过来,原文发表于2006年09月05日06:18:00

1. 背景介绍

首先简单说一下计算机中处理器-内存体系结构。由于CPU速度远大于内存访问速度,现代CPU设计中都引入了缓存(cache)作为CPU和内存两者之间交流的缓冲中介。缓存的速度也介于两者之间。缓存中存放了最经常被访问的内存数据,CPU在很大程度上只需要访问高速缓存,大大提高了系统性能。系统对缓存进行读写的单位被称作缓存行(cache line)。大家知道系统对内存的操作单位一般是word,如果对缓存操作也用word作单位,就显得太小,缺乏效率,因此一般cache line大概是8个word,这是缓存与内存沟通的最小单位。

在可以预见的几年内,计算机系统会逐渐向多核心CPU或者多CPU结构过渡,出现多个处理核心共享内存的局面。一个很自然的问题马上出现,那就是多个处理核心对单一内存资源的访问冲突。这个冲突本来不难解决,只要给内存访问加锁就可以了。但是,当把缓存纳入考虑范围时,情况就复杂了。缓存是集成在每个CPU内部的小内存,除了这个CPU,其他CPU不能访问。而按照单CPU系统的简单缓存设计,缓存并不能察觉除本CPU以外的外部因素对内存内容的修改。因此,假设出现下面的情况:处理器A将内存中某块内容C读入自己的缓存,并在缓存中修改了该内容,然后处理器B也将内存中这块内容C读入自己的缓存,那么,B看到的只是原始版本的内容,而看不到存在于A缓存中的更新的内容,这就产生了内容不一致的问题。多处理器系统一般是设计控制协议来协调各个CPU缓存读写,保证内容一致,以解决这种冲突。

2. 伪共享

顾名思义,“伪共享”就是“其实不是共享”。那什么是“共享”?多CPU同时访问同一块内存区域就是“共享”,就会产生冲突,需要控制协议来协调访问。会引起“共享”的最小内存区域大小就是一个cache line。因此,当两个以上CPU都要访问同一个cache line大小的内存区域时,就会引起冲突,这种情况就叫“共享”。但是,这种情况里面又包含了“其实不是共享”的“伪共享”情况。比如,两个处理器各要访问一个word,这两个word却存在于同一个cache line大小的区域里,这时,从应用逻辑层面说,这两个处理器并没有共享内存,因为他们访问的是不同的内容(不同的word)。但是因为cache line的存在和限制,这两个CPU要访问这两个不同的word时,却一定要访问同一个cache line块,产生了事实上的“共享”。显然,由于cache line大小限制带来的这种“伪共享”是我们不想要的,会浪费系统资源。

举例来说,当多进程程序操作同一个int型数组int a[100]时,如果进程0只访问a[0],进程1只访问a[1],进程2只访问a[2],… 那么,实际上每个进程不应该发生数据共享。但是,一般cache line可以包含几个int,因此访问同一个cache line内int数组元素的几个进程就需要系统花费额外资源和时间运用控制协议来协调,这是不必要的。在这种情况下,把每个数组元素单独放在一个cache line大小的内存区域里在时间上是最有效率的,然而空间上就变成最没效率的了。

计算机设计就是处理矛盾的艺术。

3. 程序演示

下面用一个简单的小程序来验证False Sharing对性能的影响。

这个小程序同时运行两个线程,对同一块内存区域的临近位置进行int大小的多次读写。线程1读写第1个int值,线程2读写第2个int值。按照预想,如果两个int相隔太近,落入同一个cache line,两个线程就会因为”伪共享”而花费额外的时间进行缓存一致性通信和维护,造成性能下降。

TestFalseSharing.cpp 是主程序,Thread.h/cpp 是对Win32 Thread API的简单封装,Timer.h/cpp是对CRT time函数的简单封装。

TestFalseSharing.cpp

// TestFalseSharing.cpp

#include "stdafx.h"
#include "Thread.h"
#include "Timer.h"

#define COUNT 3000000000
#define DISTANCE 4

DWORD WINAPI t1_p(void* a)
{
	int* arr = (int*)a;
	for (int i=0; ii++)
		arr[0] = 30;

	return 0;
}

DWORD WINAPI t2_p(void* a)
{
	int* arr = (int*)a;
	for (int i=0; i&lt;COUNT; i++)
		arr[DISTANCE] = 30;

	return 0;
}

int arr[64];

int main(int argc, char* argv[])
{
	yshi::Timer timer(std::cout);

	// Start threads
	yshi::WinThread t1(t1_p, arr);
	yshi::WinThread t2(t2_p, arr);

	// Wait for all threads to terminate
	t1.Wait();
	t2.Wait();

	timer.Stop();

	return 0;
}

Thread.h

// Thread.h

namespace yshi
{

typedef LPTHREAD_START_ROUTINE WinThreadProc; // int func_name(void*)

// This class is exported from the Win32.dll
class WIN32_API WinThread
{
public:
	WinThread(WinThreadProc aProc, void* aParam, bool aSuspend = false);
	void Resume();
	void Wait();
	virtual ~WinThread();
private:
	HANDLE _handle;
private:
	WinThread(const WinThread&);
	WinThread& operator=(const WinThread&);
};

}

Thread.cpp

// Thread.cpp

#include "stdafx.h"
#include "Thread.h"

namespace yshi
{

WinThread::WinThread(WinThreadProc aProc, void* aParam, bool aSuspend)
{
	HANDLE h = CreateThread(NULL, 0, aProc, aParam, aSuspend ? 0x04 : 0x00, NULL);
	if (h == NULL)
	{
		throw GetLastError();
	}

	_handle = h;
}

void WinThread::Resume()
{
	DWORD ret = ResumeThread(_handle);
	if (ret == -1)
	{
		throw GetLastError();
	}
}

void WinThread::Wait()
{
	DWORD ret = WaitForSingleObject(_handle, INFINITE);
	if (ret == WAIT_FAILED)
		throw GetLastError();
}

WinThread::~WinThread()
{
	if (_handle != 0)
		CloseHandle(_handle);
}

}

Timer.h

// Timer.h

namespace yshi
{

class DEBUGUTIL_API Timer
{
public:
	Timer(std::ostream& aOutput);
	~Timer();
	void Start();
	void Stop();
private:
	void Print(std::ostream& aOutput);
private:
	std::ostream& mOutput;
	unsigned long long mTimeDiff;
private:
	Timer&amp; operator=(const Timer&amp;);
};

}

Timer.cpp

// Timer.cpp

#include "Timer.h"

#include &lt;ctime&gt;

namespace yshi
{

Timer::Timer(std::ostream& aOutput) : mOutput(aOutput)
{
	Start();
}

Timer::~Timer()
{
}

void Timer::Start()
{
	mTimeDiff = time(0);
}

void Timer::Stop()
{
	mTimeDiff = time(0) - mTimeDiff;
	Print(mOutput);
}

void Timer::Print(std::ostream& aOutput)
{
	aOutput << "Time spent: " << mTimeDiff << " seconds\n";
}

}

4. 运行结果

数据距离(DISTANCE)    花费时间(秒)

4                                                    21
8                                                    21
16                                                  10
32                                                  10

可以看到,当距离达到16个int,也就是64字节的时候,”伪共享”消失,而64字节也就是我的机器的cache line大小。