Skip to content

Flags

  • Current CFLAGS:
-pipe -Os -fgcse-las -flto=auto -fuse-linker-plugin -ffunction-sections -fdata-sections -fstack-protector-strong -fstack-clash-protection -fno-unwind-tables -fno-asynchronous-unwind-tables -fno-ident -fno-plt -march=x86-64-v3 -mtls-dialect=gnu2
  • Current CXXFLAGS are identical to CFLAGS
  • Current LDFLAGS (CFLAGS are added for more effective LTO):
-Wl,-O1,-s,-z,defs,-z,noexecstack,-z,now,-z,pack-relative-relocs,-z,relro,-z,separate-code,-z,text,--as-needed,--gc-sections,--no-keep-memory,--relax,--sort-common,--enable-new-dtags,--hash-style=gnu,--build-id=none

CFLAGS (Ordered based on appearance in the GCC manual)

Section titled “CFLAGS (Ordered based on appearance in the GCC manual)”
  • Use pipes over temporary files; more RAM but less disk usage
  • Ignored by clang
  • Compiling with -g0 or without -g at all should in theory result in no debugging information in the binaries
  • Some build systems misinterpret -g0 as -g leaving debugging information in the binaries; this is a bug to be reported to the relative upstream
  • Speeds up compilation time when -g is used
  • Does not make sense when -g is not used
  • Enabled for -O2 and disabled for -Os and -Oz
  • clang does not support -foptimize-strlen; clang implicitly performs this optimization

-fmodulo-sched, -fmodulo-sched-allow-regmoves and -fgcse-sm

Section titled “-fmodulo-sched, -fmodulo-sched-allow-regmoves and -fgcse-sm”
  • Aggressive common subexpression elimination (cse) and scheduling (particularly modulo scheduling) can dramatically increase register pressure
  • This leads to more loads and stores, causing spills and increasing code size
  • This makes performance worse than without them
  • It makes sense to have these off by default on register-starved machines like x86
  • Removes redundant load instructions which can reduce register pressure by reusing loaded values
  • Might reduce code size
  • Ignored by clang
  • Enables simple constant folding optimizations
  • Enabled by default on most targets; no need to mess with it
  • x86-64 does not have delay slots rendering this “legacy” optimization irrelevant
  • Should in theory decrease register pressure before allocation
  • Can decrease code size by preventing register pressure and subsequent spills in register allocation
  • No idea if it works with -fschedule-insns (which is not enabled by default at -O2 and -Os)
  • No idea if it works with -fschedule-insns2 (which is enabled by default at -O2 and -Os)
  • Permits the speculative motion of some load instructions before register allocation to minimize execution stalls due to data dependencies
  • Works with -fschedule-insns
  • -fschedule-insns2 is enabled by default at -O2 and -Os
  • Similar to -fsched-spec-load
  • Avoid options with dangerous in the name..
  • Experimental option that might produce unreliable results and increase code size
  • Enabled by default at -O2
  • Do not enable manually; let PGO decide
  • Modulo scheduling is a software pipelining technique; thus it might increase code size with no proved performance gain
  • Most of x86-64-v3 have hardware pipelining

-fselective-scheduling, -fselective-scheduling2, -fsel-sched-pipelining and -fsel-sched-pipelining-outer-loops

Section titled “-fselective-scheduling, -fselective-scheduling2, -fsel-sched-pipelining and -fsel-sched-pipelining-outer-loops”
  • IA64 is probably the only target left requiring selective scheduling
  • Selective scheduling itself is in a poor state nowadays
  • -fsel-sched-pipelining has no effect without -fselective-scheduling or -fselective-scheduling2
  • -fsel-sched-pipelining-outer-loops has no effect without -fsel-sched-pipelining
  • https://gcc.gnu.org/pipermail/gcc-patches/2025-August/692322.html
  • This option is only for shared libraries/dynamic linking and breaks static binaries and libraries
  • Makes code built with -fPIC and LTO faster, and improves performance in general; might cause subtle ABI breakages
  • Breaks LD_PRELOAD which in turn breaks custom memory allocators like mimalloc
  • Contrary to popular belief, enabling this flag globally is safe (unless interposing symbols is required, for example when using different allocators on system libraries), but the reason for it not being enabled by default is to comply with the ELF standard. In contrast, this flag is part of the default when using Clang
  • https://maskray.me/blog/2021-05-09-fno-semantic-interposition
  • Abandoned and needs a major redesign
  • Does not scale, at least for now (according to openSUSE)
  • Increases memory usage and compilation time
  • Prone to having the compiler segfault with an internal compiler error which leads to all kinds of weird errors like duplicate case value (affected packages: bash, gcc, inetutils, libarchive, libedit, netbsd-curses, util-linux)
  • Prevents a lot of optimizations from gcc to produce output suitable for live-patching
  • Does not work with -flto
  • Graphite is not well maintained in gcc and will likely end up being removed entirely
  • Most of its developers moved to LLVM’s Polly
  • Can’t effectively optimize compared to baseline gcc
  • The optimizations it is supposed to perform are being implemented via other methods
  • Not necessarily buggy, but its benefits are rather doubtful nowadays
  • https://dl.acm.org/doi/full/10.1145/3674735
  • Required for isl to work
  • Previously called -floop-optimize-isl
  • The newer way to implement Graphite
  • Replaces -floop-interchange, -ftree-loop-linear, -floop-strip-mine and -floop-block
  • Still considered experimental
  • Increases register pressure and should not be used without -fsched-pressure
  • This takes a number and by default GCC only enables it for PowerPC, but disables it for other architectures
  • Not supported by clang
  • https://reviews.llvm.org/D4565
  • Has no use when -falign-functions is not used
  • Without the linker plugin LTO will not happen (particularly for static libraries as you will get the same code without -flto)
  • Will spawn n threads based on the number of threads; similar to make -j
  • Use instead of -flto alone to get rid of the 128 LTRANS serial jobs message
  • gcc’s version of ThinLTO is WHOPR, previously it was enabled by using -fwhopr, but now it has become the default mode for LTO and -fwhopr was removed from gcc’s options; -fno-fat-lto-objects is now the default
  • Available when zstd is the backend for LTO and it should in theory result in smaller binaries
  • clang does not support -flto-compression-level
  • Enabled by default if gcc is built with lto enabled
  • There is no actual guarantee that -fuse-linker-plugin will be used in cases where gcc is built without lto support and binutils is built without plugins support
  • This means that not using this flag as in the case above might cause -fwhole-program to be picked which is not a good idea; use -fuse-linker-plugin so programs can rely on a linker plugin and forward the lto stuff to some other linker (e.g. mold) successfully
  • Ignored by clang
  • Enabled after decisions by PGO and shouldn’t be manually used everywhere; may cause regressions and produce bigger code that may or may not be fast

-ffunction-sections, -fdata-sections and -Wl,--gc-sections

Section titled “-ffunction-sections, -fdata-sections and -Wl,--gc-sections”
  • -O2 uses the value 5, while -Os uses 1 which is more aggressive
  • gcc enables -funwind-tables by default and its documentation says that you normally do not need to enable this; instead, a language processor that needs this handling enables it on your behalf
  • It is better not to explicitly specify these options globally as we don’t know whether they will be passed to build an executable or a shared library (passing -fpie/-fPIE when building a shared library is not a good thing..)
  • It is better to have gcc configured with --enable-default-pie so that it knows when to pass these options
  • These options do not contradict with -fno-plt
  • -fno-pic can only be used by executables
  • -fpic can be used by both executables and shared objects
  • -fpie can only be used by executables
  • https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected
  • x86-64-v3 provides better performance and battery life
  • Automatically detected on modern 64-bit hosts and Linux targets
  • Automatically detected on modern 64-bit hosts and Linux targets
  • Remove -flto=auto -fuse-linker-plugin (and -flto-compression-level=3 if being used)

LDFLAGS (Ordered based on appearance in the GNU linker manual)

Section titled “LDFLAGS (Ordered based on appearance in the GNU linker manual)”
  • Enables linker optimizations which can reduce code size
  • bfd optimizes if a non-zero value was given with no differences between the values
  • lld has a higher level -O2, and it uses -O1 by default
  • Ignored by mold

-x, --discard-all and -X, --discard-locals

Section titled “-x, --discard-all and -X, --discard-locals”
  • Using -z,separate-code is good for security
  • Adding --rosegment when -z,separate-code is used makes resulting binaries smaller
  • Using -z,noseparate-code is a bad idea; remember how passing --disable-separate-code to binutils bloated every executable and shared library by at least 2 MB (for better huge page support)
  • With traditional -z,noseparate-code, bfd defaults to a RX/R/RW program header layout
  • With -z,separate-code (default on Linux/x86 from binutils 2.31 onwards), bfd defaults to a R/RX/R/RW program header layout
  • lld defaults to R/RX/RW(RELRO)/RW(non-RELRO)
  • With --rosegment, lld uses RX/RW(RELRO)/RW(non-RELRO)
  • Placing all R before RX is preferable as it can save one program header and reduce alignment costs
  • lld’s split of RW saves one maxpagesize alignment and can make the linked image smaller
  • This breaks some assumptions that the (so-called) “text segment” precedes the (so-called) “data segment”
  • If you use bfd’s noseparate-code or lld’s --no-rosegment, .rodata and .text will be placed in the same PT_LOAD segment
  • lld defaults to noseparate-code
  • --no-rosegment combines the read-only and the RX segments (output file will consume less address space at run-time)
  • AArch64 and PowerPC64 have a default MAXPAGESIZE of 65536 so -z noseparate-code default ensures that they will not experience unnecessary size increase
  • In -z noseparate-code layouts waste half a huge page on unrelated content and switching to -z separate-code reclaims the benefits of the half huge page but increases the file size
  • ld.bfd’s -z separate-code is essentially split into two options in lld: -z separate-code and —rosegment.
  • GitHub actions for rad uses ubuntu-latest which does not recognize --rosegment:
Nim Output /usr/bin/ld: unrecognized option '--rosegment'
... /usr/bin/ld: use the --help option for usage information
... collect2: error: ld returned 1 exit status
  • Provides “Stack Execution Protection”
  • Should be the default behavior by gcc; does not mark the stack as executable by default, and warns when that happens
  • Enforces Write XOR Execute (W^X)
  • lld defaults to -z,text
  • hardened_malloc builds with -z,text and -z,defs by default
  • lld does not support -z,x86-64-v3

-Wl,--gc-sections and -Wl,-z,start-stop-gc

Section titled “-Wl,--gc-sections and -Wl,-z,start-stop-gc”

--no-keep-memory and --reduce-memory-overheads

Section titled “--no-keep-memory and --reduce-memory-overheads”
  • Make memory consumption reasonable especially with the optimizations we are using (mainly LTO), at the expense of a slight increase in link time
  • lld and mold do not support --reduce-memory-overheads
  • Sorts COMMON symbols by decreasing alignment, which saves some padding resulting in minor size benefits
  • Can degrade performance if COMMON symbols in an object file have locality and --sort-common breaks that locality
  • Ignored by lld and mold
  • https://maskray.me/blog/2022-02-06-all-about-common-symbols
  • It does not make sense to compress “nonexistant” debug sections as we’re stripping everything with -s
  • The following flags are still being studied/tested:
Terminal window
# CFLAGS
-fgcse-las
-fsched-pressure
-fno-ident
# LDFLAGS
-z,defs
-z,separate-code
-z,text
--no-keep-memory
--build-id=none