64-bit integers on 32-bit Windows

I do a lot of programming involving almost real-time signal processing on Windows. And it involves a lot of 64-bit integer values for different purposes ranging from bit fields and values that exceed the 32-bit integers range.

Most of the of the problems when using 64-bit integers are related to the bit fields needed to store things like settings are object functionality controlling flags. For maximum source code simplicity and portability across 32-bit and 64-bit platforms I need to keep them as 64-bit. But the C/C++ compiler from latest Vista SDK, even if doing slightly better job than the older ones, is not implementing bit set, clear and test operations in an optimum way, leaving a lot of "OR ...,0", "AND ...,0" and "XOR ...,0" instructions behind.

The RDK combined with Visual Studio 2005 Express allowed me to experiment a little bit trying to optimize this code. But I encountered the following problems:

- After "Canonicalize" phase I can see if instructions need to be optimized, and even if some cleanup is done by my plugin, none of the normally following phases is doing any serious improvements.

- Changing the plugin to a "recheck if changes done" loop and calling after each pass the "Global optimization" and "Canonicalize" phases brings some serious improvements.

- In the end there are plenty of situations, even with a very simple example, where an "AND <reg>,<const>" instruction is immediately followed by a "CMP <reg>,0" instruction for a conditional jump. So far I don't want to mess with the generated assembly code as it may be affected by other optimizations but I couldn't find a way of changing the conditional branch so the compare instruction is no longer needed.

In the end, if I put here the plugin source code and whatever I also develop, could such optimizations be embedded in the released compilers? Most of the times these are really trivial and safe changes but they tend to increase code speed a lot and it's a shame not having them. Similar problems are affecting the 3DNow!, MMX, SSE and SSE2 instrutions, but I haven't got so far right now!

[2257 byte] By [Catalin_Ionesc_RO] at [2008-2-23]
# 1

Big breakthrough! By simply calling the "Global optimization" and "Canonicalize" phases before trying to do anything removes all the undesired code. All that is left over is the second assignment after 64-bit operations, leaving "a=b;b=a;" situations and those must be separately treated. And, of course, the third problem is still there!

So, a simple plugin that is providing a new phase after "Canonicalize" and does just the following in Execute method is giving already much better code:

Code Snippet

Phx::Phases::Phase ^ ph;

for (ph = this->Previous; ph != nullptr; ph = ph->Previous)

if (ph->NameString=="Global optimization")

{

ph->DoPhase(unit);

for (ph = this->Previous; ph != nullptr; ph = ph->Previous)

if (ph->NameString=="Canonicalize")

{

ph->DoPhase(unit);

break;

}

break;

}

Hopefully this will help other people!

Catalin_Ionesc_RO at 2007-10-2 > top of Msdn Tech,Visual Studio,Phoenix...
# 2

The RDK is a snaphshot of our ongoing develoment work. As such you will often see things that are still being worked on and improved by the Phoenix team. One of the areas we are busily improving is the quality of code generated by the compiler. The RDK includes some simple "post-lower" optimizations but we are working on more, including removing clearly redundant operations via idiom recognition.

I'm not exactly sure what your plugin does, but if you see large benefits from running gopt/canon afterwards, you might consider running your plugin earlier.

And, if you'd like to make your code available for others to see, you might look at setting up a Codeplex project (see www.codeplex.com).

AndyAyers-MSFT at 2007-10-2 > top of Msdn Tech,Visual Studio,Phoenix...
# 3

So far I have found out that "Canonicalize" is the phase responsible for change 64-bit integer operations to combinations of 32-bit integer operations. Up to that point I have found no information that could allow me to do some operands size transforming. Let's consider just the first line of the following C/C++ example that I actually use for testing:

Code Snippet

#include

__int64 Temp=0;

int main(void)
{ Temp|=1;
if (Temp&1) Temp++;
if (Temp&2) Temp=3;
printf("%ld\n",Temp);
return 0;
}

For plugin effect studying I have the following command line:

Code Snippet

cl -O2 -Fa -d2plugin:BitOps.dll -d2dumpallphases main.cpp > txt

And the intermediary code, just before "Canonicalize" looks like this (just the first two lines mentioned here to keep minimum code):

Code Snippet

IR after X86 scalar Sse (control Sse)

Function Unit #1
$L1: (references=0) #6
{*StaticTag}, {*NotAliasedTag} = START _main(T) #6
_main: (references=1) #6
ENTERFUNCTION #6
tv307- = BITOR ?Temp@@3_JA, 1 #6
?Temp@@3_JA = ASSIGN tv307- #6
t278 = BITAND tv307-, 1 #7
tv308- = ASSIGN 0 #7
t279 = COMPARE(NE) 0, t278 #7
CONDITIONALBRANCH(True) t279, $L7, $L6 #7
$L7: (references=1) #7
?Temp@@3_JA = ADD tv307-, 1 #7
GOTO $L6 #8

So far there is no info on the size of the operands, or at least I haven't found any! The above code is not optimizable beyond the already existing one. After "Canonicalize" though things are seriously changing, due to the follwoing intermediary code:

Code Snippet

IR after Canonicalize (control Canonicalize)

Function Unit #1
$L1: (references=0) #6
{*StaticTag}, {*NotAliasedTag} = START _main(T) #6
_main: (references=1) #6
ENTERFUNCTION #6
tv307-(RegisterCandidate) = BITOR ?Temp@@3_JA, 1 #6
tv307-(RegisterCandidate)+32 = BITOR ?Temp@@3_JA+32, 0 #6
?Temp@@3_JA = ASSIGN tv307-(RegisterCandidate) #6
?Temp@@3_JA+32 = ASSIGN tv307-(RegisterCandidate)+32 #6
tv278-(RegisterCandidate) = BITAND tv307-(RegisterCandidate), 1 #7
tv278-(RegisterCandidate)+32 = BITAND tv307-(RegisterCandidate)+32, 0 #7
tv308-(RegisterCandidate) = ASSIGN 0 #7
tv308-(RegisterCandidate)+32 = ASSIGN 0 #7
t309(RegisterCandidate) = COMPARE(NE) 0, tv278-(RegisterCandidate)+32 #7
CONDITIONALBRANCH(True) t309(RegisterCandidate), $L7, $L17 #7
$L17: (references=1) #7
t279(RegisterCandidate) = COMPARE(NE) 0, tv278-(RegisterCandidate) #7
CONDITIONALBRANCH(True) t279(RegisterCandidate), $L7, $L6 #7
$L7: (references=2) #7
?Temp@@3_JA = ADD tv307-(RegisterCandidate), 1 #7
?Temp@@3_JA+32 = ADDWITHCARRY tv307-(RegisterCandidate)+32, 0 #7
GOTO $L6 #8

With the above intermediary code, that isn't optimized at all no matter what options I use in the command line, it's clearly obvious that something really wrong happens... or something is not happening at all. After working several hours to find out a decent way to do the "BITAND ,0" and "BITOR ,0" instructions, with relatively good results, I had the idea to simply re-execute "Global optimization" and "Canonicalize" phases, without further bothering with any more fancier solution. So I have built a new plugin that has just the previously mentioned code into the Execute method, and the plugin executed immediately after the "Canonicalize" phase.

And this is the end result, after the plugin execution has ended:

Code Snippet
IR after Canonicalize (control Canonicalize)
Function Unit #1
$L1: (references=0) #6
{*StaticTag}, {*NotAliasedTag} = START _main(T) #6
_main: (references=1) #6
ENTERFUNCTION #6
tv348-(RegisterCandidate) = BITOR 1, ?Temp@@3_JA #6
tv349-(RegisterCandidate) = ASSIGN ?Temp@@3_JA+32 #6
?Temp@@3_JA = ASSIGN tv348-(RegisterCandidate) #6
?Temp@@3_JA+32 = ASSIGN tv349-(RegisterCandidate) #6
t324(RegisterCandidate) = BITAND 1, tv348-(RegisterCandidate) #7
t323(RegisterCandidate) = COMPARE(NE) 0, t324(RegisterCandidate) #7
CONDITIONALBRANCH(True) t323(RegisterCandidate), $L23, $L22 #7
$L7: (references=1) #7
tv350-(RegisterCandidate) = ADD 1, tv348-(RegisterCandidate) #7
?Temp@@3_JA = ASSIGN tv350-(RegisterCandidate) #7
?Temp@@3_JA+32 = ADDWITHCARRY 0, tv349-(RegisterCandidate) #7
tv352-(RegisterCandidate) = ASSIGN tv350-(RegisterCandidate) #8
GOTO $L6 #8

The difference is huge in both code quality and, most probably, speed! Because, in the end, if we get both smaller and faster code things are always better!

There are, of course, more optimizations to be done, some of them being easily visible!

Catalin_Ionesc_RO at 2007-10-2 > top of Msdn Tech,Visual Studio,Phoenix...
# 4

If you pass -dumptypes you'll see information about operand types, including sizes (via cl you need to pass -d2dumptypes). So here's your example code, before canonicalize, with -dumptypes:

Code Snippet

IR before Canonicalize (control Canonicalize)

Function Unit #1
$L1: (references=0) #6
{*StaticTag}, {*NotAliasedTag} = START _main(T) #6
_main: (references=1) #6
ENTERFUNCTION #6
tv307-.i64 = BITOR
?Temp@@3_JA.i64.a64, 1.i64 #6
?Temp@@3_JA.i64.a64 = ASSIGN tv307-.i64 #6
t278.i64 = BITAND tv307-.i64, 1.i64 #7
tv308-.i64 = ASSIGN 0.i64 #7
t279.i32 = COMPARE(NE) 0.i64, t278.i64 #7
CONDITIONALBRANCH(True) t279.i32, $L7, $L6 #7

The suffixes on the operands show both the type and the bit size. For example, the highlighted '.i64' means a signed-integer type 64 bits wide. After canonicalize the code shows only 32 bit operands. Here the (i64+0).i32 suffix shows that the operand is the lower 32 bits of a 64 bit object; likewise (i64+32).i32 means the upper 32 bits.

Code Snippet

IR after Canonicalize (control Canonicalize)

Function Unit #1
$L1: (references=0) #6
{*StaticTag}, {*NotAliasedTag} = START _main(T) #6
_main: (references=1) #6
ENTERFUNCTION #6
tv307-(RegisterCandidate)(i64+0).i32 = BITOR
?Temp@@3_JA(i64+0).i32.a32, 1(i64+0).i32 #6
tv307-(RegisterCandidate)(i64+32).i32 = BITOR
?Temp@@3_JA(i64+32).i32.a32, 0.i32 #6
?Temp@@3_JA(i64+0).i32.a32 = ASSIGN tv307-(RegisterCandidate)(i64+0).i32 #6
?Temp@@3_JA(i64+32).i32.a32 = ASSIGN tv307-(RegisterCandidate)(i64+32).i32 #6
tv278-(RegisterCandidate)(i64+0).i32 = BITAND tv307-(RegisterCandidate)(i64+0).i32, 1(i64+0).i32 #7
tv278-(RegisterCandidate)(i64+32).i32 = BITAND tv307-(RegisterCandidate)(i64+32).i32, 0.i32 #7
tv308-(RegisterCandidate)(i64+0).i32 = ASSIGN 0(i64+0).i32 #7
tv308-(RegisterCandidate)(i64+32).i32 = ASSIGN 0.i32 #7
t309(RegisterCandidate).i32 = COMPARE(NE) 0.i32, tv278-(RegisterCandidate)(i64+32).i32 #7
CONDITIONALBRANCH(True) t309(RegisterCandidate).i32, $L7, $L17 #7

You are correct that the canonicalize phase is the one that breaks up large-integer operations into register-sized chunks, and that the expansions it does could be better optimized. Because of this breakup the individual parts can now fit into x86 registers.

AndyAyers-MSFT at 2007-10-2 > top of Msdn Tech,Visual Studio,Phoenix...
# 5

Thank you for the tip! So far I wasn't able to get a full list of c2 command line options because it was taking too long and I was hitting Ctrl+C after few seconds. Now I see all that I can use!

Anyway, it's obvious that trying to fidle with operand sizes would be quite a dangerous thing to do as it might imply undesired side effects. So, from my point of view, in order to keep final code safety I shouldn't try to play with it before canonicalize.

Now I have moved on and want to test the optimization on my libraries and one of the projects, actually the most processor hungry one! As I need as many otpimizations as possible, and it proved to bring some extra speed, I use the /GL compiler option. Just that brought on one of the test PCs a speed increase with processor usage dropping from 14% to 12% and when tracking down the source of the improvement it proved to be the inlining of the float to integer conversions.

But with the RDK compiler I have a major problem when SSE2 instructions are used. It may be that the same problem is visible with other multimedia instructions, but so far I haven't reached them. While the release compiler crashes silently, the debug one shows the following message:

Code Snippet

Phoenix Assertion Failure: d:\phoenixrdkmarch2007\src\phx\ir\ir.cpp, Line 2146

this->IsImmediateOperand || this->Instruction->FunctionUnit->Architecture->TypesAreCompatible(this->Field-> EnclosingType, field->EnclosingType) : Field assignment must maintain compatible enclosing type.

in (Function number 1) ?Compute_SSE2@RC_FFT2F32@@IAIXXZ [line 479] during CxxIL Reader

in (Module) F:\Public\Src\RC_SignProc\rcfft.cpp

in (PEModule)

And the error message is displayed many times, for different lines in my code.

The initial compile finished with no problems and the above errors come up only when finally processing the object files for building the binary.

A second test without /GL option gives the same problem, but during compile. The first error message that I see now is:

Code Snippet

The following instruction failed to match a legal form:

[tv1058-(RegisterCandidate)+tv977-(RegisterCandidate)]* = fld $t277[FramePointerRegister] #715

Legal forms:

Form 285: FPStackRegisters(ST0).(f80+0).f80, RegisterSet(8) = Memory.(f32+0).f32

Form 286: FPStackRegisters(ST0).(f80+0).f80, RegisterSet(8) = Memory.(f64+0).f64

Form 287: FPStackRegisters(ST0).(f80+0).f80, RegisterSet(8) = Memory.(f80+0).f80

Form 288: FPStackRegisters(ST0).(f80+0).f80, RegisterSet(8) = FPStackRegisters.(f80+0).f80

Phoenix Assertion Failure: d:\phoenixrdkmarch2007\src\targets\architectures\base\legalize.cpp, Line 3808

value != -1 : No legal forms

in (Function number 5) ?ProcessIn_AMDK7@RC_ResamplerF32@@IAIIPAMI@Z [line 715] during Lower

in (Module) rcresampler.cpp

The question right now is simple... Can I safely mix RDK compiler for normal code with the SDK one for MMX, 3DNow!, SSE and SSE2 support?

Catalin_Ionesc_RO at 2007-10-2 > top of Msdn Tech,Visual Studio,Phoenix...
# 6

Hi Andy,

I have followed your suggestion and I have uploaded the current plugin, stripped of all code that would be useless at this point, at http://www.codeplex.com/RDKCodeOpt.

Catalin_Ionesc_RO at 2007-10-2 > top of Msdn Tech,Visual Studio,Phoenix...
# 7

Can you tell me what you are doing to enable use of SSE2? That is, are you setting /arch: SSE2 on the command line or are you creating these instructions yourself in a plugin?

Assuming the former, it may be that there are bugs in the SSE2 path. I know we test /arch: SSE2 but we may not test it as widely as we could. If you can narrow things down to a small test case please submit a bug and we'll look into it, though I can't promise anything.

The second assertion looks like a different problem -- we are apparently trying to lower a memory-memory float copy and can't find an instruction that will work. It might be interesting to rerun this case and pass -d2dumptypes to see the operand types in question.

And as you noted, when you compile /GL the real "compilation" is deferred until link time, so that's when you see the assertions come out.

AndyAyers-MSFT at 2007-10-2 > top of Msdn Tech,Visual Studio,Phoenix...
# 8

Hi Andy!

Sorry for the rather huge delay but my DNSs resolve forums.microsoft.com to 207.46.196.83 and there I see a totally blank page and some strange server error. In the end I have found a really strange combination of URLs that allows me replying to the forum.

In order to enable SSE2 I use /archTongue TiedSE2, but I see the very same issues even with MMX intrinsics. I need to use the intrinsics as my library is portable across Win32 and Win64 versions. I will try to find the minimum code showing the problem and I will pass you my comments.

More recent tests with /GL compiler option (PSDK compiler) show in finally generated assembly code files that there are plenty of optimizations not done at all, even if they are quite obvious. I will try to start treating them one by one as I have some code sequences that I use a lot in my source files.

But in order to be able to test the final result I would need to get over the MMX, SSE, SSE2 issue... Can I mix RDK compiler generated code with PSDK compiler generated code? Is it safe or is it a bomb waiting to explode?

Catalin_Ionesc_RO at 2007-10-2 > top of Msdn Tech,Visual Studio,Phoenix...
# 9
Andy, please forget about forums.microsoft.com error... it seems it has been something strictly related to my account! Now I can immediately log into it!

Catalin_Ionesc_RO at 2007-10-2 > top of Msdn Tech,Visual Studio,Phoenix...
# 10

You should be able to link together objects from the RDK and the PSDK or VS2005 compiler, provided that all the compilers are of similar vintage (for released compilers, the version string should start with 14).

AndyAyers-MSFT at 2007-10-2 > top of Msdn Tech,Visual Studio,Phoenix...
# 11

Hi Andy!

Sorry for the very long silence but a really demanding project kept me away. And that project also allows me to experiment more with RDK as processor usage is really critical.

So, after getting the code into usable state I tried again to build it with RDK. Once again I encountered serious problems with SSE, SSE3 and 3DNow!+ support. While mixing SDK with RDK is OK, here are my findings:

1) For SSE there is a problem that seems to appear when trying to type-cast values. Briefly, here is simplified situation that I can see going wrong:

Code Snippet

#define ForceAlign(a,b,x) a __declspec(align(x)) b

ForceAlign(const uint32_t,Negate_2_SSE[4],16)={0x00000000,0x00000000,0x80000000,0x00000000};

__m128 SC1;

SC1=_mm_xor_ps(SC1,*((const __m128 *)(const void *)Negate_2_SSE));

While the SDK compiler has no problem with the above code, the RDK one issues:

Phoenix Assertion Failure: d:\phoenixrdkmarch2007\src\phx\ir\ir.cpp, Line 2146

this->IsImmediateOperand || this->Instruction->FunctionUnit->Architecture->TypesAreCompatible(this->Field-> EnclosingType, field->EnclosingType) : Field assignment must maintain compatible enclosing type.

in (Function number 1) ?Compute_SSE2@RC_FFT2F32@@IAIXXZ [line 473] during CxxIL Reader

in (Module) F:\Public\Src\RC_SignProc\rcfft.cpp

in (PEModule)

2) For 3DNow!+ the situation seems to be more dramatic. For all lines where _m_pswapd and _m_pfpnacc instructions are found I get:

Phoenix Assertion Failure: d:\phoenixrdkmarch2007\src\targets\runtimes\vccrt\win32\x86\cil-intrin.cpp, Line 99

intrinsicEntry->type == IntrinsicType::None

in (Function number 2) ?Compute_AMDK7@RC_FFT2F32@@IAIXXZ [line 416] during CxxIL Reader

in (Module) F:\Public\Src\RC_SignProc\rcfft.cpp

in (PEModule)

This, most likely, means that I need to specify something extra to get 3DNow!+ instructions activated.

3) For SSE3 the question is rather simple... I can see the intrinsics in "intrin.h" in PSDK but I see no other include file that actually mentions them. For all others I can see mmintrin.h and so on. Is there a way to activate SSE3 intrinsics? There are pieces of code that would seriously benefit from them, if available

Catalin_Ionesc_RO at 2007-10-2 > top of Msdn Tech,Visual Studio,Phoenix...
# 12

I have reproduced the first two issues. At first glance these look to be "harmless asserts" within the compiler. From what I can tell the resulting code generation is correct, while internally the compiler suspects something is wrong, but I can't find anything that actually is wrong. So evidently the compiler is being overly paranoid: I filed a bug to get the asserts fixed. As a workaround you can try compiling with -d2assertlimit:0 to tell Phoenix not to abort a compilation because of an excess number of asserts (though of course you should verify that no other asserts are overlooked this way).

For the third issue, you can just use SSE3 intrisics straighforwardly; for instance see http://msdn2.microsoft.com/en-us/library/8tf3ka85(VS.80).aspx.

AndyAyers-MSFT at 2007-10-2 > top of Msdn Tech,Visual Studio,Phoenix...

Visual Studio

Site Classified