C++
Raiting:
42

Why can LLVM call a function that is never called?


<i> Whatever your dragon told you, he lied. Dragons are deceitful. You do not know what awaits you on the other side. <Tgsri>
Michael Swanvik. "The daughter of an iron dragon"
Not so long ago on Habr a post was published under the name "<a href="https://habrahabr.ru/company/infopulse/blog/338812/"> How can the function never called be called? <Tgsrcut>". The conclusion from the article is simple: in the case of undefined behaviour, the compiler has the right to take any actions, even if they are completely unexpected. However, I was interested in the very mechanism of this optimization. The result of my small research I want to share with the distinguished community of the hubra.

image

Let me remind you briefly what the essence was. In the source code below, the EraseAll function should not be called from main, and it is not really called when compiled with -O0, but is suddenly called with optimization -O1 and higher.

#include <cstdlib>

typedef int (*Function)();

static Function Do;

static int EraseAll() {
return system("rm -rf /");
}

void NeverCalled() {
Do = EraseAll;
}

int main() {
return Do();
}
This is explained as follows: in the given code the variable Do is a pointer to a function, and it has the initial value null. When we try to call a function on the null pointer, the behavior of the program can be undefined (undefined behaviour, UB) and the compiler has the right to optimize the UB as it is more convenient. In this case, the compiler immediately executed the Do = EraseAll assignment.

Why he did it, we'll try to figure it out now. In the rest of the text, LLVM and Clang version 5.0.0 are used as a compiler.

To begin with, let's look at the IR-code when optimizing with -O0 and -O1. Let's change the source code slightly to make it less dramatic:


#include <stdio.h>

typedef int (*Function)();

static Function Do;

static int PrintHello() {
return printf("hello world\n");
}

void NeverCalled() {
Do = PrintHello;
}

int main() {
return Do();
}
And we compile the IR-code at -O0 (the debugging information is omitted for clarity):

; ModuleID = 'test.c'
source_filename = "test.c"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@Do = internal global i32 (...)* null, align 8
@.str = private unnamed_addr constant [13 x i8] c"hello world\0A\00", align 1

; Function Attrs: noinline nounwind optnone uwtable
define void @NeverCalled() #0 {
entry:
store i32 (...)* bitcast (i32 ()* @PrintHello to i32 (...)*), i32 (...)** @Do, align 8
ret void
}

; Function Attrs: noinline nounwind optnone uwtable
define i32 @main() #0 {
entry:
%retval = alloca i32, align 4
store i32 0, i32* %retval, align 4
%0 = load i32 (...)*, i32 (...)** @Do, align 8
%call = call i32 (...) %0()
ret i32 %call
}

; Function Attrs: noinline nounwind optnone uwtable
define internal i32 @PrintHello() #0 {
entry:
%call = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([13 x i8], [13 x i8]* @.str, i32 0, i32 0))
ret i32 %call
}

declare i32 @printf(i8*, ...) #1
And at -O1:

; ModuleID = 'test.ll'
source_filename = "test.c"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@.str = private unnamed_addr constant [13 x i8] c"hello world\0A\00", align 1

; Function Attrs: noinline nounwind optnone uwtable
define void @NeverCalled() local_unnamed_addr #0 {
entry:
ret void
}

; Function Attrs: noinline nounwind optnone uwtable
define i32 @main() local_unnamed_addr #0 {
entry:
%retval = alloca i32, align 4
store i32 0, i32* %retval, align 4
%call = call i32 (...) bitcast (i32 ()* @PrintHello to i32 (...)*)()
ret i32 %call
}

; Function Attrs: noinline nounwind optnone uwtable
define internal i32 @PrintHello() unnamed_addr #0 {
entry:
%call = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([13 x i8], [13 x i8]* @.str, i32 0, i32 0))
ret i32 %call
}

declare i32 @printf(i8*, ...) local_unnamed_addr #1
You can compile the executables and make sure that in the first case, a segmentation error occurs, and in the second case, "hello world" is displayed. With other optimization options, the result is the same as for -O1.

Now find the part of the compiler code that performs this optimization. I remind you that in the architecture of LLVM, the frontend does not optimize itself, i.e. cfe (Clang Frontend) always generates the code without optimizations, which we see in the version for -O0, and optimizes the utility opt:

image

With -O1, the following optimization passes are performed:

Impressive please do not look - targetlibinfo -tti -tbaa -scoped-noalias -assumption-cache-tracker -profile-summary-info -forceattrs -inferattrs -ipsccp -globalopt -domtree -mem2reg -deadargelim -domtree -basicaa -aa -instcombine -simplifycfg -basiccg -global -a -prune-eh -always-inline -functionattrs -domtree -sroa -basicaa -aa -memoryssa -early-cse-memssa -speculative-execution -domtree -basicaa -aa -lazy-value-info -jump-threading -lazy-value-info -correlated-propagation -simplifycfg -domtree -basicaa -aa -instcombine -libcalls-shrinkwrap -loops -branch-prob -block-freq -pgo-memop-opt -domtree -basicaa -aa -tailcallelim -simplifycfg -reassociate -domtree -loops -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -loop-rotate -licm -loop-unswitch -simplifycfg -domtree -basicaa -aa -instcombine -loops -loop-simplify -lcssa-verification -lcssa -scalar-evolution -indvars -loop-idiom -loop-deletion -loop-un roll -memdep -memcpyopt -sccp -domtree -demanded-bits -bdce -basicaa -aa -instcombine -lazy-value-info -jump-threading -lazy-value-info -correlated-propagation -domtree -basicaa -aa -memdep - dse -loops -loop-simplify -lcssa-verification -lcssa -aa -scalar-evolution -licm -postdomtree -adce -simplifycfg -domtree -basicaa -aa -instcombine -barrier -basiccg -rpo-functionattrs -globals -aa -float2int - domtree -loops -loop-simplify -lcssa-verification -lcssa -basicaa -aa -scalar-evolution -loop-rotate -loop-accesses -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -loop- distribute -branch-prob -block-freq -scalar-evolution -basicaa -aa -loop-accesses -demanded-bits -lazy-branch -pro -lazy-block-freq -opt-remark-emitter -loop-vectorize -loop- simplify -scalar-evolution -aa -loop-accesses -loop-load-elim -basicaa -aa -instcombine -latesimplifycfg -domtree -basicaa -aa -instcombine -loops -loop-simplify -lcssa-verification -lcssa -scalar-evolution - loop-unroll -instcombine -loop-simplify - lcssa-verification -lcssa -scalar-evolution -licm -alignment-from-assumptions -strip-dead-prototypes -domtree -loops -branch-prob -block-freq -loop-simplify -lcssa-verification -lcssa -basicaa -aa - scalar-evolution -branch-prob -block-freq -loop-sink -lazy-branch-prob -lazy-block-freq -opt-remark-emitter -instsimplify -simplifycfg -verify
Turning off the passages one after another, find what you are looking for, this is the globalopt. We can leave only this optimization pass, and make sure that it, and no one else, generates the code we need. Its source is in the file /lib/Transforms/IPO/GlobalOpt.cpp. You can see the source code in the LLVM repository, and I will not give it here completely, confining myself only to functions important for understanding its work.
Let's see what makes this optimization pass. For starters, it implements the runOnModule method, i.e. when working, he sees and optimizes the entire module (which, incidentally, is logical in this case). The optimize function is optimizeGlobalsInModule:

static bool optimizeGlobalsInModule(
Module &M, const DataLayout &DL, TargetLibraryInfo *TLI,
function_ref<DominatorTree &(Function &)> LookupDomTree) {
SmallSet<const Comdat *, 8> NotDiscardableComdats;
bool Changed = false;
bool LocalChange = true;
while (LocalChange) {
LocalChange = false;

NotDiscardableComdats.clear();
for (const GlobalVariable &GV : M.globals())
if (const Comdat *C = GV.getComdat())
if (!GV.isDiscardableIfUnused() || !GV.use_empty())
NotDiscardableComdats.insert(C);
for (Function &F : M)
if (const Comdat *C = F.getComdat())
if (!F.isDefTriviallyDead())
NotDiscardableComdats.insert(C);
for (GlobalAlias &GA : M.aliases())
if (const Comdat *C = GA.getComdat())
if (!GA.isDiscardableIfUnused() || !GA.use_empty())
NotDiscardableComdats.insert(C);

// Delete functions that are trivially dead, ccc -> fastcc
LocalChange |=
OptimizeFunctions(M, TLI, LookupDomTree, NotDiscardableComdats);

// Optimize global_ctors list.
LocalChange |= optimizeGlobalCtorsList(M, [&](Function *F) {
return EvaluateStaticConstructor(F, DL, TLI);
});

// Optimize non-address-taken globals.
LocalChange |= OptimizeGlobalVars(M, TLI, LookupDomTree,
NotDiscardableComdats);

// Resolve aliases, when possible.
LocalChange |= OptimizeGlobalAliases(M, NotDiscardableComdats);

// Try to remove trivial global destructors if they are not removed
// already.
Function *CXAAtExitFn = FindCXAAtExit(M, TLI);
if (CXAAtExitFn)
LocalChange |= OptimizeEmptyGlobalCXXDtors(CXAAtExitFn);

Changed |= LocalChange;
}

// TODO: Move all global ctors functions to the end of the module for code
// layout.

return Changed;
}
Let's try to describe in words what this function does. For each global variable in the module, it requests a Comdat object.
What is Comdat The Comdat section is a section in an object file that contains objects that can be duplicated in other object files. Each object has information for the linker, indicating what it should do when duplicates are detected. The options can be any: Any - anything, ExactMatch - duplicates must completely match, otherwise an error occurs, Largest - take the object with the largest value, NoDublicates - there should not be a duplicate, SameSize - duplicates must have the same size, otherwise an error occurs.

In LLVM, Comdat data is represented by an enumeration:

enum SelectionKind {
Any, ///< The linker may choose any COMDAT.
ExactMatch, ///< The data referenced by the COMDAT must be the same.
Largest, ///< The linker will choose the largest COMDAT.
NoDuplicates, ///< No other Module may specify this COMDAT.
SameSize, ///< The data referenced by the COMDAT must be the same size.
};
, and the class Comdat actually represents a pair (Name, SelectionKind). ( Actually, everything is more complex. ) All variables that for some reason can not be deleted are placed in a set of NotDiscardableComdats. With functions and global aliases, we do the same - something that can not be deleted is placed in NotDiscardableComdats. Then, separate optimization functions for global constructors, global functions, global variables, global aliases, and global destructors are called. Optimizations continue in the cycle until no optimization is performed. At each iteration of the loop, the set of NotDiscardableComdats is set to zero.

Let's see what objects of the listed contains our test source.

Global variables:

1. @Do = internal global i32 (...)* null, align 8
2. @.str = private unnamed_addr constant [13 x i8] c"hello world\0A\00", align 1
(a little running ahead, I will say that the first variable will be deleted by the optimizer at the first iteration)

Functions:

define void @NeverCalled()
define i32 @main()
define internal i32 @PrintHello()
declare i32 @printf(i8*, ...)
Note that printf is only declared (declare), but not defined (define).
There are no global aliases.

Let's look at the example of this optimization pass and consider how this result turned out. Of course, to parse all the optimization options even in one pass is a very voluminous task. it provides for many different special cases of optimizations. We will concentrate on our example, simultaneously considering those functions and data structures that are important for understanding the work of this optimization pass.

Initially, the optimizer does various uninteresting checks in this case, and calls the processInternalGlobal function, which tries to optimize global variables. This function is also quite complex and does a lot of different things, but we are interested in one thing:


if (GS.StoredType == GlobalStatus::StoredOnce && GS.StoredOnceValue) {
...
// We try to optimize global variables, about which it is known that they are assigned
// value only once, not including the initializing value.
if (optimizeOnceStoredGlobal(GV, GS.StoredOnceValue, GS.Ordering, DL, TLI))
return true;
...
}
The information that the global variable is assigned the value one and only once is extracted from the GS structure (GlobalStatus). This structure is populated in the calling function:

static bool
processGlobal(GlobalValue &GV, TargetLibraryInfo *TLI,
function_ref<DominatorTree &(Function &)> LookupDomTree) {
if (GV.getName().startswith("llvm."))
return false;

GlobalStatus GS;

if (GlobalStatus::analyzeGlobal(&GV, GS))
return false;
...
Here we see one more interesting fact: objects whose names begin with "llvm." Are not subject to optimization (since they are system calls for llvm runtime). And, just in case, the names of variables in LLVM IR can contain points (and even consist of one point with the prefix @ or%). The function analyzeGlobal is a call to the LLVM API and we will not consider its internal work. The structure of GlobalStatus should be considered in more detail, since it contains very important information for optimization passes.

/// When we analyze a global variable, we store some information about it. If we
/// found that the address of the variable was being taken, no information from this structure
/// will not be valid
struct GlobalStatus {
  /// True if the address of the global variable is used in comparison operations
bool IsCompared = false;

  /// True if the global variable is assigned a value. If not, she
  /// can be deleted
bool IsLoaded = false;

  /// How is the assignment of the value
enum StoredType {
    /// No assignment of value. A variable can be marked as a constant
NotStored,

    /// Assignment occurs, but the same constant is assigned to which the variable
    /// was initialized. This is only tracked for scalar types.
InitializerStored,

    /// Assignment occurs, but only during initialization and one more time
    /// with a different value. If the variable isStoredOnce, we write a value,
    /// which was assigned in the StoredOnceValue field below. This is only checked for scalar
    /// variables.
StoredOnce,

    /// Assignment occurs many times or in such a way that
    /// we can not track.
Stored
} StoredType = NotStored;

  /// If only one value (except the initialization constant) is written
  /// in this global variable, save the value here.
Value *StoredOnceValue = nullptr;
...
};
It is probably necessary to explain why "if we found that the address of the variable is being taken, no information from this structure will be reliable." In fact, if we take the address of a global variable, and then write something at this address, and not by name, then tracking it will be extremely difficult, and it is better to leave such variables as is, without trying to optimize.

So, we get into the function optimizeOnceStoredGlobal, in the arguments of which the variable (GV) and the stored value (StoredOnceVal) are transferred. Here they are:

@ Do = internal unnamed_addr global i32 (...) * null, align 8 // variable
i32 (...) * bitcast (i32 () * @PrintHello to i32 (...) *) // value
Next, the value is deleted insignificant bitcast, and for the variable the following condition is checked:

if (GV->getInitializer()->getType()->isPointerTy() &&
GV->getInitializer()->isNullValue()) {
...
that is, the variable must be initialized with a null pointer. If this is the case, then we create a new SOVC variable corresponding to the StoredOnceVal value given to the GV type:

if (Constant *SOVC = dyn_cast<Constant>(StoredOnceVal)) {
if (GV->getInitializer()->getType() != SOVC->getType())
SOVC = ConstantExpr::getBitCast(SOVC, GV->getInitializer()->getType());
Here, getBitCast is the method that returns the bitcast command, which types the types in the LLVM IR language.
After that, the function OptimizeAwayTrappingUsesOfLoads is called. It transfers the global variable GV and the constant LV.
Direct optimization is performed by the function OptimizeAwayTrappingUsesOfValue (Value * V, Constant * NewV).
For each use of a variable:

for (auto UI = V->user_begin(), E = V->user_end(); UI != E; ) {
Instruction *I = cast<Instruction>(*UI++);
if this is a Load command, replace its operand with a new value:

if (LoadInst *LI = dyn_cast<LoadInst>(I)) {
LI->setOperand(0, NewV);
Changed = true;
}
If the variable is used in the function call or invoke (which is exactly what happens in our example), create a new function, replacing its argument with a new value:

if (isa<CallInst>(I) || isa<InvokeInst>(I)) {
CallSite CS(I);
if (CS.getCalledValue() == V) {
// Calling through the pointer! Turn into a direct call, but be careful
// that the pointer is not also being passed as an argument.
CS.setCalledFunction(NewV);
Changed = true;
bool PassedAsArg = false;
for (unsigned i = 0, e = CS.arg_size(); i != e; ++i)
if (CS.getArgument(i) == V) {
PassedAsArg = true;
CS.setArgument(i, NewV);
}
All other arguments to the function are simply copied.
Also, similar replacement algorithms are provided for the Cast and GEP instructions, but in our case this does not happen.

The further actions are as follows: we look through all uses of a global variable, trying to delete everything, except assignment of value. If this is successful, then we can delete the Do variable.

So, we briefly reviewed the work of the optimization pass LLVM on a specific example. In principle, nothing super complicated is here, but more careful programming is required to provide for all possible combinations of commands and variable types. Of course, all this must be covered by tests. Learning the source code of LLVM optimizers will help you write your optimizations, allowing you to improve the code for some specific cases.
Examples of some interesting optimizations in LLVM
Here are a few more examples of how LLVM can optimize code. These examples are not relevant to the example that we just dismantled, and are done in other optimization passages, nevertheless, they are rather unusual and interesting.
The first example is
Consider a code that sums numbers from 1 to n-1:

int sum(int n) {
int s = 0;
for(int i = 0; i < n; i++)
s += i;
return s;
}

Compile with -O1:

define i32 @sum(i32 %n) local_unnamed_addr #0 {
entry:
%cmp6 = icmp sgt i32 %n, 0
br i1 %cmp6, label %for.cond.cleanup.loopexit, label %for.cond.cleanup

for.cond.cleanup.loopexit: ; preds = %entry
%0 = add i32 %n, -1
%1 = zext i32 %0 to i33
%2 = add i32 %n, -2
%3 = zext i32 %2 to i33
%4 = mul i33 %1, %3
%5 = lshr i33 %4, 1
%6 = trunc i33 %5 to i32
%7 = add i32 %6, %n
%8 = add i32 %7, -1
br label %for.cond.cleanup

for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
%s.0.lcssa = phi i32 [ 0, %entry ], [ %8, %for.cond.cleanup.loopexit ]
ret i32 %s.0.lcssa
}
Suddenly, there is no cycle, but miraculous variables like i33 (that is, a 33-bit integer) have appeared. How did it happen? LLVM converted the sum of the series into the formula: (n-1) * (n-2) / 2 + n - 1. So, as in the calculation of intermediate variables, a 32-bit grid overflow may occur, LLVM inserted a variable of type i33. Note that he did this by analyzing the unoptimized assembler code, which is rather difficult. The spoiler lists the unoptimized code for the same function that is directly considered in the loop:

unoptimized code define i32 @sum (i32% n) # 0 {
entry:
%n.addr = alloca i32, align 4
%s = alloca i32, align 4
%i = alloca i32, align 4
store i32 %n, i32* %n.addr, align 4
store i32 0, i32* %s, align 4
store i32 0, i32* %i, align 4
br label %for.cond

for.cond: ; preds = %for.inc, %entry
%0 = load i32, i32* %i, align 4
%1 = load i32, i32* %n.addr, align 4
%cmp = icmp slt i32 %0, %1
br i1 %cmp, label %for.body, label %for.end

for.body: ; preds = %for.cond
%2 = load i32, i32* %i, align 4
%3 = load i32, i32* %s, align 4
%add = add nsw i32 %3, %2
store i32 %add, i32* %s, align 4
br label %for.inc

for.inc: ; preds = %for.body
%4 = load i32, i32* %i, align 4
%inc = add nsw i32 %4, 1
store i32 %inc, i32* %i, align 4
br label %for.cond

for.end: ; preds = %for.cond
%5 = load i32, i32* %s, align 4
ret i32 %5
}

It's even more interesting to see what happens with this example in the backend. The variable i33 is converted to i64, and if your processor is 32-bit, command sequences are generated to multiply and add 64-bit numbers in a 32-bit system. Even more interesting, if in the original example, change the data type to long. Then the argument and the return value will be of type i64, and the intermediate variables will become type i65!
The second example is
Suppose we decided to write a function that changes the sign of float, changing the 31st bit of the binary representation of float:

float sum(float x) {
int val = *((int*) &x);
int inv = val ^ (1 << 31);
return *((float*)&inv);
}
When compiling for x86_64, nothing particularly interesting happens:

.LCPI0_0:
.long 2147483648 # float -0
.long 2147483648 # float -0
.long 2147483648 # float -0
.long 2147483648 # float -0
.text
.globl sum
.p2align 4, 0x90
.type sum,@function
sum: # @sum
.cfi_startproc
# BB#0: # %entry
xorps .LCPI0_0(%rip), %xmm0
retq
But if we compile for ARM 64 (AARCH64):

invert: // @invert
// BB#0: // %entry
fneg s0, s0
ret
LLVM recognized the change in the 31st bit to the fneg command, which changes the float character. For comparison, GCC does not know how to do this, issuing a "literal" version.
GCC 6.3 (ARM 64):

invert(float):
fmov w0, s0
eor w0, w0, -2147483648
fmov s0, w0
ret
This is an example of target-specific optimization, which is done in the backend, rather than in the utility opt.

About this example, I need to say a few more words. These pointers violate the strict aliasing rule, which can lead to undefined behavior when compiled with the -strict-aliasing flag on some compilers and on some platforms (in fact, in a vanishingly small number of cases). For this example, the error occurs when compiling with gcc4.4.7 -m32 -O2, and disappears in the more recent versions of gcc. However, I put in the link list at the end a link to an interesting lecture on aliasing.
The third example is
Another example of target-specific optimization, this time for x86-64, more precisely, for Haswell architecture.
Let's write the function of counting single bits in a word:

int cntSetBits(int a) {
int cnt = 0;
while(a)
{
cnt++;
a &= (a-1);
}
return cnt;
}
Compile with -O1 -march = haswell:

cntSetBits(int): # @cntSetBits(int)
popcnt eax, edi
ret
The entire function fit into one popcnt statement, which counts the number of units in the word.
Let's look at IR:

; Function Attrs: norecurse nounwind readnone uwtable
define i32 @cntSetBits(i32 %a) local_unnamed_addr #0 {
entry:
%0 = call i32 @llvm.ctpop.i32(i32 %a), !range !2
ret i32 %0
}
Here we see that the built-in function llvm.ctpop.i32 is used. It was inserted already front-end using high-level information about the code, and the backend for this architecture knows about this function and can replace it with a special command.
Useful links
http://www.drdobbs.com/architecture-and-design/the- design-of-llvm / 240001128 </a> - Chris Lattner, "The Design of LLVM"
https://youtu.be/bSkpMdDe4g4 - Matt Godbolt, "What has my compiler done for me lately?"
https://youtu.be/ACW-7dktyDk Dmitry Kashitsyn "Trolley bus from a loaf: aliasing and vectorization in LLVM"
xially 6 november 2017, 11:47
Vote for this post
Bring it to the Main Page
 

Comments

Leave a Reply

B
I
U
S
Help
Avaible tags
  • <b>...</b>highlighting important text on the page in bold
  • <i>..</i>highlighting important text on the page in italic
  • <u>...</u>allocated with tag <u> text shownas underlined
  • <s>...</s>allocated with tag <s> text shown as strikethrough
  • <sup>...</sup>, <sub>...</sub>text in the tag <sup> appears as a superscript, <sub> - subscript
  • <blockquote>...</blockquote>For  highlight citation, use the tag <blockquote>
  • <code lang="lang">...</code>highlighting the program code (supported by bash, cpp, cs, css, xml, html, java, javascript, lisp, lua, php, perl, python, ruby, sql, scala, text)
  • <a href="http://...">...</a>link, specify the desired Internet address in the href attribute
  • <img src="http://..." alt="text" />specify the full path of image in the src attribute