Name

FDPR - Feedback-Directed Post-link Optimization for Linux on POWER


Synopsis

fdpr [--instrument file] [--train workload [--reset] ] [--optimize] [--log file] [-f, --profile-file file] [-o, --output-file file] [-V, --version] [-v, --verbose] [-h, --help] [fdprpro-options] [--] program

fdprpro -a action [fdprpro-options] program


Description

FDPR is a performance-tuning utility for reducing the execution time and the real-memory utilization of user-level application programs. The tool optimizes the executable image of a program by collecting information on the program's behavior under a typical workload and creating a new version of the program optimized for that workload. The new program generated by the post-link optimizer typically runs faster and uses less real memory than the original program.

Note: The post-link optimizer applies advanced optimization techniques to programs. Some aggressive optimizations may result in programs that do not behave as expected. It is recommended to test the optimized program, at least, with the same test suite used to test the original program. The optimized program is not supported as input to the optimizer.

The post-link optimizer builds an optimized executable program in three distinct phases:

See the corresponding options for further details.

The three-phase process can be achieved by using fdpr or fdprpro.

fdpr provides a convenient user interface, enabling the three phases, or any legal combination thereof, to be performed in one command.

More experienced users may prefer to use fdprpro, which performs the actual processing. fdprpro provides explicit control over the actual processing and requires a separate activation to perform either the instrumentation or the optimization phases. This is specified by the action option -a|--action action, where the action term is "instr" to perform instrumentation or "opt" to perform optimization.

Note: The instrumented executable, created in the instrumentation phase and run in the training phase, typically runs several times slower than the original program. Due to the increased execution time required by the instrumented program, the executable should be invoked in such a way as to minimize execution duration, while still fully exercising the required code areas.


fdpr options

--instrument [file]

Creates an instrumented executable program with the specified name (default program.instr). Default: no instrumentation phase.

--reset

Normally, each time the instrumented program runs, it accumulates profile information in the profile file (see --profile-file). Specifying this option causes the initial option file, saved in profile-name.template, to be copied to the profile file. This effectively resets the profile information to its empty state. The option requires --train to be specified as well.

--train workload

Runs the instrumented program and creates the profile data. The workload is a script that accepts one parameter: the executing program. fdpr invokes the script with the path to the instrumented program. If instrumentation phase is not specified, the instrumented program is assumed to be program.instr. Default: No profiling phase.

--optimize

Generates the optimized executable program file. Users can specify optimizations explicitly by passing optimization options to fdprpro (see fdprpro options below). If no fdprpro optimization option is specified, the fdprpro -O option is used.

-o output_file, --output output_file

The optimized output file. The default is program.fdpr

-f, --profile-file file

The profile file. This is used as an output file in the instrumentation phase and as an input file in the optimization phase. The default is program.nprof

-V, --version

Prints version information and exits.

-v, --verbose

Prints progress indication and statistical information during processing.

-h, --help

Prints usage information and exit.

The above options can be shortened to any unique sequence.

To disambiguate option parsing, separate the options from program by '--'. For example, because the parameter to --instrument is optional, the following command is illegal:

  $ fdpr --instr myprog

Instead, use the command:

  $ fdpr --instr -- myprog


Input files

The input file to fdpr should be an ELF executable or shared library (.so file). Both ELF32 and ELF64 are supported.

Note: The executable program should be built with relocation information. fdpr supports both the GCC and XLC compilers and the GNU linker. To leave the relocation information in the executable file, use the linker with the --emit-relocs (or -q) option. This can be specified in the GCC command by -Wl,-q.


Instrumentation and profiling

Along with the instrumented file, fdpr creates the profile file. The file is then filled with profile information (i.e., counts at various points in the program), while the instrumented program runs with its specified workload.

Note: The instrumented program requires a shared library called libfdprinst32.so (or libfdprinst64.so for ELF64 programs). A proper installation from the RPM file ensures the libraries are found. Alternatively, make sure the environment variable LD_LIBRARY_PATH is set to the directory containing these libraries.

The instrumented program expects the profile file to be in the same directory as the instrumented program. To override this, set the environment variable FDPR_PROF_DIR to the required directory. Having the profile file specified with its full pathname, either by the -fprofile_file or via FDPR_PROF_DIR, is important if the program changes its directory during runtime or if it is executed from a different directory then the one where it was built.


Optimizations

By default, fdpr performs code reordering optimization together with the optimizations of branch prediction, branch folding, code alignment, and removal of redundant NOOP instructions (see the fdprpro option -O below) .

Additional optimizations are available explicitly by indicating specific fdprpro options (see below).


Examples

The following are typical usage examples of fdpr.

  1. In this simple example, fdpr performs all three phases. Here, myprog is the input executable and test is a shell script that invokes myprog.

            $ fdpr --instr --train test --opt myprog

    The test script should look something like this:

            # code to exercise myprog
            $1 arg1 arg2 ...

    fdpr generates the instrumentation in myprog.instr, runs the script test, performs the default optimizations, and generates the output file in myprog.fdpr.

  2. Perform specific optimizations, producing the output in myprog.lro

            $ fdpr -opt --link-register-optimization -RC -o myprog.lro myprog

    This command performs only link-register optimization and code reordering using the profile information in myprog.nprof


fdprpro options

fdprpro accepts a host of optimization-specific options. In addition, there are several options that create auxiliary files for debugging purposes (e.g., code disassembly).

Analysis Options:
-[no]aawc, --[no]analyze-assembly-written-csects

Analyze objects written in Assembly.

-acf analysis-configuration-file, --analysis-configuration-file analysis-configuration-file

Provide a configuration file of analysis information (advanced option).

-asd, --analyze-static-data

Analyze static data objects as distinct data elements for data reordering (unsafe for certain compilers).

-esa, --extra-safe-analysis

Limit analysis phase to compiler generated code.

-fca, --funcsect-analysis

Apply special analysis for an input executable that was compiled with the -qfuncsect compiler option.

-ff string, --file-format string

Input file format: can be LM (load module) or PO (program object).

-ifl file, --ignored-function-list file

Set the ignored function list. The file contains names of functions that considered as unsafe and thus are not modified.

-iinf, --ignore-info

Ignore .info sections produced with the -qfdpr option during compile time.

Instrumentation Options:
-ei, --embedded-instrumentation

Perform embedded instrumentation. The profile will be collected into the application's global data area. When the application terminates, the collected data will be lost.

-fd Fdesc, --file-descriptor Fdesc

Set the file descriptor number to be used when opening the profile file. The default of Fdesc is set to the maximum-allowed number of open files.

-icvp, --instr-call-value-profiling

instrument the values of parameters passed in function calles.

-imullX, --mullX-instrumentation

perform value profiling of RA and RB operands in mullX instructions.

-[no]iderat, --[no]derat-instrumentation

Perform value profiling of RA and RB operands in load/store indexed instructions.

-issu, --instrumentation-safe-stack-usage

Ensure that additional stack space is properly allocated for the instrumented run. Use this option if your application uses the stack extensively (e.g., when the program uses alloca()). Note that this option adds extra overhead on instrumentation code.

-iso offset, --instrumentation-stack-offset offset

Set the offset from the stack, a negative number, where the instrumentation's area for saving registers is kept at runtime. Use with care.

-M addr, --profile-map addr

Set the shared memory segment address for profiling. Alternative shared memory addresses are needed when the instrumented program application creates a conflict with the shared-memory addresses preserved for the profiling. Typical alternative values are 0x40000000, 0x50000000, ... up to 0xC0000000. The default is set to 0x3000000.

-ptm, --profile-to-memory

Use shared memory key instead of file mapping to obtain a shared memory area for the profile data.

-[no]ri, --[no]register-instrumentation

Instrument the input program file to collect profile information about indirect branches via registers. The default is set to collect the profile information.

-[no]sfp, --[no]save-floating-point-registers

Save the floating point registers in the instrumented code. The default is set to save floating point registers.

-shmkey key-number, --shared-memory-key key-number

Specify a shared memory key to use when creating a shared memory area for the profile. The default key is created by hashing the profile file name (with ftok).

Profile Files Options:
-af prof_file, --ascii-profile-file prof_file

Set the name of a text format profile file containing profile information.

-aop, --accept-old-profile

Accept the old profile file collected on previous versions of the input program file (requires the -f flag).

-f prof_file, --profile-file prof_file

Set the profile file name. The profile file is created during the instrumentation phase and read during the optimization phase. The profile file is updated each time you run the instrumented program.

-fdir prof_file_dir, --profile-file-directory prof_file_dir

Set the run-time location of the profile file. The profile will be search during the profiling phase at this location. The default location is the path given in the profile file name (-f option). Applicable only at instrumentation phase.

Optimization Options:
-A alignment, --align-code alignment

Specify code alignment strategy. 1: Use grouping rules of target machine (default), 2: Same as 1 but consider also hotness of branch targets. See -m for the selected machine model.

-abb factor, --align-basic-blocks factor

Align basic blocks that are hotter than the average by a given (float) factor. This is a lower-level machine-specific alignment compared to --align-code. Value of -1 (the default) disables this option.

-bf, --branch-folding

Eliminate branch to branch instructions.

-ccc threshold, --cold-code-connector threshold

Preserves original order for code which is less frequently executed than given threshold.

-bldcg, --build-dcg

Build a Data Connectivity Graph (DCG) for enhanced data reordering (applicable only with the -RD flag).

-bp, --branch-prediction

Set branch prediction bit for conditional branches according to the collected profile.

-btcar, --branch-table-csect-anchor-removal

Eliminate load instructions used when accessing branch tables.

-cbsi, --chain-based-selective-inline

Perform selective inlining of functions that produce long hot chains of code.

-cbtd, --convert-bss-to-data

Convert BSS section into a data section. This is useful for more aggressive tocload and RD optimizations.

-cRD, --conservativeRD

Perform conservative static data reordering by packing together all frequently referenced static variables.

-dce, --dead-code-elimination

Eliminate instructions related to unused local variables within frequently executed functions. This is useful mainly after applying function inlining optimization.

-dp, --data-prefetch

Insert data-cache prefetch instructions to improve data-cache performance.

-dpht threshold, --data-placement-hotness-threshold threshold

Set data placement algorithm hotness threshold between (0,1), where 0 reorders the static variables in large groups based on the control flow, and 1 reorders the variables in very small groups based on their access frequency. (This is applicable only with the -RD flag).

-dpnf factor, --data-placement-normalization-factor factor

Set data placement algorithm normalization factor between (0,1), where 0 causes static variables to be reordered regardless of their size, and 1 locates only small sized variables first. (applicable only with the -RD flag).

-ece, --epilog-code-eliminate

Reduce code size by grouping common instructions in function epilogs, into a single unified code.

-fatc num_of_bytes, --fat-const num_of_bytes

Inflate constant areas in code section by adding num_of_bytes (entire set to 255) to each constant area.

-fatd num_of_bytes, --fat-data num_of_bytes

Inflate data section by adding num_of_bytes (entire set to 255) to each data basic unit.

-fatn num_of_nops, --fat-nop num_of_nops

Inflate code secion by adding num_of_nop to each code basic block.

-bined -binary_editor, --binary-editor -binary_editor

Edit existing binary code (advanced option).

-fc, --function-cloning

Enable function cloning phase only during function inlining optimizations (applicable only with function inlining flags: -i, -si, -ihf, -isf, -shci).

-hr, --hco-reschedule

Relocate instructions from frequently executed code to rarely executed code areas, when possible.

-hrf factor, --hco-resched-factor factor

Set the aggressiveness of the -hr optimization option according to a factor value between (0,1), where 0 is the least aggressive factor (applicable only with the -hr option).

-tasr, --toc-anchor-store-reschedule

Relocate TOC store instructions from frequently executed code to rarely executed code areas, when possible.

-i, --inline

Same as --selective-inline with --inline-small-funcs 12.

-ihf pct, --inline-hot-functions pct

Inline all function call sites to functions that have a frequency count greater than the given pct frequency percentage.

-isf size, --inline-small-funcs size

Inline all functions that are smaller than or equal to the given size in bytes.

-kr, --killed-registers

Eliminate stores and restores of registers that are killed (overwritten) after frequently executed function calls.

-lap, --load-address-propagation

Eliminate load instructions of variable addresses by re-using pre-loaded addresses of adjacent variables.

-las, --load-after-store

Add NOP instructions to place each load instruction further apart following a store instruction that references the same memory address.

-plas, --pattern-based-load-after-store

Optimizes inefficient memory access patterns in order to avoid load-after-store events. .

-ebplas, --event-based-pattern-based-load-after-store

Optimizes inefficient memory access patterns in order to avoid load-after-store events. The optimization is possible if PM_MRK_LSU_REJECT_LHS profile is available.

-lro, --link-register-optimization

Eliminate saves and restores of the link register in frequently-executed functions.

-lu aggressiveness_factor, --loop-unroll aggressiveness_factor

Unroll short loops containing one to several basic blocks according to an aggressiveness factor between (1,9), where 1 is the least aggressive unrolling option for very hot and short loops.

-lun unrolling_number, --loop-unrolling-number unrolling_number

Set the number of unrolled iterations in each unrolled loop. The allowed range is between (2,50). Default is set to 2. (Applicable only with the -lu flag).

-lux unrolling_factor, --loop-unroll-extended unrolling_factor

Unroll hot loops using given unrolling factor. The allowed values are integer numbers that are power of 2. Value -1 disables the optimization, value 1 calculates the unrolling factor automatically, given a machine model.

-nop, --nop-removal

Remove NOP instructions from reordered code.

-O

Switch on basic optimizations only. Same as -RC -nop -bp -bf.

-O2

Switch on less aggressive optimization flags. Same as -O -hr -pto -isf 8 -tlo -kr -see 0.

-O3

Switch on aggressive optimization flags. Same as -O2 -RD -isf 12 -si -lro -las -vro -btcar (for XCOFF files) -lu 9 -rt 0 -so -see 1 -oderat.

-O4

Switch on aggressive optimization flags together with aggressive function inlining. Same as -O3 -sidf 50 -ihf 20 -sdp 9 -shci 90 and -bldcg (for XCOFF files).

-ocvp, --opt-call-value-profiling

specialize function calls according to the values of their passed parameters.

-omullX, --mullX-optimization

Optimize mullX instructions by adding a run-time check on RA and RB and performing equivalent operations with lower penalty. The optimization requires the use of -imullX in the instrumentation phase.

-oderat, --derat-optimization

Optimize load/store indexed instructions by adding a run-time check on RA and RB and performing equivalent operations with lower penalty. The optimization requires the use of -iderat in the instrumentation phase.

-pbsi, --path-based-selective-inline

Perform selective inlining of dominant hot function calls based on the control flow paths leading to hot functions.

-pc, --preserve-csects

Preserve CSects' boundaries in reordered code.

-pca, --propagate-constant-area

Relocate the constant variables area to the top of the code section when possible.

-pfb, --preserve-first-bb

Preserve original location of the entry point basic block in program.

-pp, --preserve-functions

Preserve functions' boundaries in reordered code.

-[no]pr, --[no]ptrgl-r11

Perform removal of R11 load instruction in _ptrgl csect.

-pto, --ptrgl-optimization

Perform optimization of indirect call instructions via registers by replacing them with conditional direct jumps.

-ptoht heatness_threshold, --ptrgl-optimization-heatness-threshold heatness_threshold

Set the frequency threshold for indirect calls that are to be optimized by -pto optimization. Allowed range between 0 and 1. Default is set to 0.8. (Applicable only with -pto flag).

-ptosl limit_size, --ptrgl-optimization-size-limit limit_size

Set the limit of the number of conditional statements generated by -pto optimization. Allowed values are between 1 and 100. Default value is set to 3. (Applicable only with the -pto flag).

-RC, --reorder-code

Perform code reordering.

-rcaf aggressiveness_factor, --reorder-code-aggressivenes-factor aggressiveness_factor

Set the aggressiveness of code reordering optimization. Allowed values are [0 | 1 | 2], where 0 preserves then original code order and 2 is the most aggressive. Default is set to 1. (Applicable only with the -RC flag).

-rccrf reversal_factor, --reorder-code-condition-reversal-factor reversal_factor

Set the threshold fraction that determines when to enable condition reversal for each conditional branch during code reordering. Allowed input range is between 0.0 and 1.0 where 0.0 tries to preserve original condition direction and 1.0 ignores it. Default is set to 0.8 (Applicable only with the -RC flag).

-rcctf termination_factor, --reorder-code-chain-termination-factor termination_factor

Set the threshold fraction that determines when to terminate each chain of basic blocks during code reordering. Allowed input range is between 0.0 and 1.0 where 0.0 generates long chains and 1.0 creates single basic block chains. Default is set to 0.05. (Applicable only with the -RC flag).

-RD, --reorder-data

Perform static data reordering.

-ppcf, --pp-cross-func

Perform cross function path profiling.

-ppme, --pp-max-edges

Perform edges number limitation.

-rmte, --remove-multiple-toc-entries

Remove multiple TOC entries pointing to the same location in the input program file.

-rt removal_factor, --reduce-toc removal_factor

Perform removal of TOC entries according to a removal factor between (0,1), where 0 removes non-accessed TOC entries only and 1 removes all possible TOC entries.

-rtb, --remove-traceback-tables

Remove traceback tables in reordered code.

-rcs, --remove-csect-symbols

Remove csect symbols.

-sdp aggressiveness_factor, --stride-data-prefetch aggressiveness_factor

Perform data prefetching within frequently executed loops based on stride analysis, according to an aggressiveness factor between (1,9), where 1 is the least aggressive.

-sdpila instructions_number, --stride-data-prefetch-instruction-look-ahead instructions_number

Set the number of instructions for which data is prefetched into the cache ahead of time. Default value is platform dependant. (Applicable only with the -sdp flag).

-sdpms stride_min_size, --stride-data-prefetch-min-size stride_min_size

Set the minimal stride size in bytes, for which data will be considered a candidate for prefetching. Default value is set to 128 bytes. (Applicable only with the -sdp flag).

-ebp evt_based_prefetch, --event-based-prefetch evt_based_prefetch

Perform data prefetching based on the events file.

-ebpla instructions_number, --event-based-prefetch-look-ahead instructions_number

Set the number of instructions for which event based prefetch is performed. Default value is platform dependant. (Applicable only with the -ebp flag).

-see level

Use simplified prolog/epilog for functions that perform conditional early-exit. Use basic optimization with level=0 and maximal with level=1.

-shci pct, --selective-hot-code-inline pct

Perform selective inlining of functions in order to decrease the total number of execution counts, so that only functions with hotness above the given percentage are inlined.

-si, --selective-inline

Perform selective inlining of dominant hot function calls.

-sidf percentage_factor, --selective-inline-dominant-factor percentage_factor

Set a dominant factor percentage for selective inline optimization. The allowed range is between 0 and 100. Default is set to 80. (Applicable only with the -si and -pbsi flags).

-siht frequency_factor, --selective-inline-hotness-threshold frequency_factor

Set a hotness threshold factor percentage for selective inline optimization to inline all dominant function calls that have a frequency count greater than the given frequency percentage. Default is set to 100. (Applicable only with the -si -pbsi flags).

-slbp, --spinlock-branch-prediction

Perform branch prediction bit setting for conditional branches in spinlock code containing l*arx and st*cx instructions. (Applicable after -bp flag).

-sldp, --spinlock-data-prefetch

Perform data prefetching for memory access instructions preceding spinlock code containing l*arx and st*cx instructions.

-sll Lib1:Prof1,...,LibN:ProfN, --static-link-libraries Lib1:Prof1,...,LibN:ProfN

Statically link hot code from specified dynamically linked libraries to the input program. The parameter consists of a comma-separated list of libraries and their profiles. IMPORTANT: Licensing rights of specified libraries should be observed when applying this copying optimization.

-sllht hotness_threshold, --static-link-libraries-hotness-threshold hotness_threshold

Set hotness threshold for the --static-link-libraries optimization. The allowed input range is between 0 (least aggressive) and 1, or -1, which does not require a profile and selects all code that might be called by the input program from the given libraries. Default is set at 0.5.

-so, --stack-optimization

Reduce the stack frame size of functions that are called with a small number of arguments.

-spc, --shortcut-plt-calls

Shortcut PLT calls in shared libraries to local functions if they exist. Note: Resolving to external symbols is disabled for such calls.

-stf, --stack-flattening

Merge the stack frames of inlined functions with the frames of the calling functions.

-tb, --preserve-traceback-tables

Force the restructuring of traceback tables in reordered code. If -tb option is omitted, traceback tables are automatically included only for C++ applications that use the Try & Catch mechanism.

-tlo, --tocload-optimization

Replace each load instruction that references the TOC with a corresponding add-immediate instruction via the TOC anchor register, where possible.

-ucde, --unreachable-code-data-elimination

Remove unreachable code and non-accessed static data.

-vro, --volatile-registers-optimization

Eliminate stores and restores of non-volatile registers in frequently executed functions by using available volatile registers.

-vrox, --volatile-registers-extended-optimization

Eliminate stores and restores of non-volatile registers in frequently executed functions by using available volatile registers, the extended version supports FP registers and transparency.

-dlo layout-file, --data-layout-optimization layout-file

.

Output Options:
-cep, --complement-edge-profile

Complements partial profile information given for the basic blocks' frequencies by adding missing basic block-to-basic block edge counts.

-d, --disassemble-text

Print the disassembled text section of the output program into output_file.dis_text file.

-dap, --dump-ascii-profile

Dump profile information in ASCII format into program.aprof (requires the -f flag).

-db, --disassemble-bss

Print the disassembled bss section of the output program into output_file.dis_bss file.

-dd, --disassemble-data

Print the disassembled data section of the output program into output_file.dis_data file.

-diap, --dump-initial-ascii-profile

Dump the given profile information in ASCII format into program.aprof.init (requires the -f flag).

-dim, --dump-instruction-mix

Dump instruction mix statistics based on gathered profile information.

-dm, --dump-mapper

Print a map of basic blocks and static variables with their respective new -> old addresses into a program.mapper file.

-o output_file, --output-file output_file

Set the name of the output file. The default instrumented file is program.instr. The default optimized file is program.fdpr.

-pif, --print-inlined-funcs

Print the list of inlined functions along with their corresponding calling functions into a output_file.inl_list file (requires the -si or -i or -isf flags).

-pds, --preserve-debug-symbols

Preserve debug symbols.

-plc, --preserve-linkage-conventions

Preserve linkage conventions.

-ppcf, --print-prof-counts-file

Print a text format of the profiling counters into a program.counts file (requires the -f flag).

-sf, --strip-file

Strip the output file.

-simo, --single-input-multiple-outputs

Optimize in parallel into multiple outputs as specified by option sets read from stdin.

General Options:
-h, --help

Print the online help.

-j jour_file, --journal jour_file

Output optimization journal information to jour_file.

-m machine-model, --machine machine-model

Generate code for the specified machine model. Target machine can be one of the following models: power2, power3, ppc405, ppc440, power4, ppc970, power5, power6, power7, ppe, spe, spe_edp, z10, z9. Default is power7.

-q, --quiet

Set the output mode to quiet, suppressing informational messages.

-st stat_file, --statistics stat_file

Output statistics information to stat_file. If stat_file is '-', the output goes to the standard output. See --verbose for the default.

-v level, --verbose level

Set verbose output mode level. When set, various statistics about the output program are printed into the file program.stat. Allowed level range is between 0 and 3. Default is set to 0.

-V, --version

Print the version number.

-w level, --warning-level level

Set the warning level so only errors of this level and below will be printed. The levels are: 1: errors, 2: warnings, 3: debug warning, 4: debug information. Default is 2.


Default values of options

As shown in the previous section, determining the default value of options is done using the statistics file. The options specified under 'options. ...' are the the user-specified option, plus the ones enabled by them. So, in the above example, specifying -O3 entailed among others, the setup of -hco option (Hot-Cold Optimization), and the setup of -hrf option (HCO Rescheduling Factor) with the value of 0.1.


ASCII profile

By default the profile generated by fdprpro is in some internal binary format. To allow external tools to generate the profile, an ASCII profile is also supported (see --ascii-profile-file).

The format of the ASCII profile file is:

 <Simple> address execCount </Simple>
 <Cond> address execCount fallthruCount </Cond>
 <Reg> address execCount fallthruCount regIndex 
 type1 value1 execCount1
 type2 value2 execCount2
 ...
 typeN valueN execCountN 
 </Reg>

The profile file is set of the Profile entries - Simple, Cond and Reg. The types in <Reg> entries are Abs - for Absolute Values, Text - for Text addresses, Data - for Data addresses. There are no other "tags" defined, there must not be white spaces between the tags` letters, no comments. Addresses and Values can be in decimal or in hex form (starting with 0x).

For example -

 <Simple> 0x100000240 10 </Simple>
 <Simple> 0x100000250 20 </Simple>
 <Cond> 0x100000260 20  10 </Cond>
 <Simple> 0x100000270 20 </Simple>
 <Reg>  0x100000260 20  10 17
 Abs 23 5
 Text  0x100000300 5 
 Data  0x200000400 10
 </Reg>

The order of the profile entries is not important, although for better readability they should be sorted according to address. The ASCII profile file (extension .aprof) should contains entries for code executed at least once. The code with execCount = 0 should not be included (it is not forbidden but will not provide any information to fdpr). Generally it is sufficient to provide one profile entry for each executed basic block. The address of that profile entry should be any address within the basic block. Since fdprpro's internal basic block partitioning is not always known, several profile entries may be provided for a single basic block up to the maximum of one profile entry for each instruction. When several profile entries are provided for a single basic block and they contain conflicting information (e.g., different execCount), fdprpro will produce a warning starting with "Conflicting profiling" ... and ignore the later conflicting information.


Human-readable output

In addition to the optimized or instrumented program, fdprpro produces human readable output.

1. Standard output. The text that goes to standard output includes the sign-on message, progress information and sign-off message. The progress information displays the passage of fdprpro along the different phases of processing, as follows:

        fdprpro (FDPR) <version> Linux/POWER
        fdprpro -a opt -O3 li.linux.gcc32.base
        > reading_exe ...
        > adjusting_exe ...
        > analyzing ...
        > building_program_infrastructure ...
  ...
        > updating_executable ...
        > writing_executable ...
        bye.

If the --quiet option is specified, no output is produced here.

2. Standard error. As usual, warnings and errors messages are written to the standard error file. Note that fdprpro exists after the first error.

3. Statistics file. If the --verbose <level> option is selected, various kinds of statistics about the program will be written to the statistics file, output_file.stat. The file consists of a list of tables, typically in a form of <attribute> <value> per line. The amount of information is determined by level. The following is an example, corresponding to the above invocation:

        options. group               active_options
        options. optimization        -bf -bp -dp -hr -hrf 0.10 -kr -las -lro -lu 9 -isf 12 -nop -pr -RC -RD -rt 0.00 -si -tlo -vro
        options. output              -o 1.base
        global.use_try_and_catch:              0
        global.profile_info:            not_available
        file.input:                     li.linux.gcc32.base
        file.output:                    1.base
        file.statistics:                1.base.stat
        analysis.csects:                     347
        analysis.functions:                  343
        analysis.constants:                   13
        analysis.basic_blocks:              5360
        analysis.function_descriptors:         0
        analysis.branch_tables:               10
        analysis.branch_table_entries:       374
        analysis.unknown_basic_units:         17
        analysis.traceback_tables:             0
  ...

Note, the options specified in the optimization group are the actual ones enabled by the -O3 option. See below.


Importing code from shared libraries

Typically fdprpro optimizes a single target module (an executable file or a shared library), without considering the cross-module flow of the program. The --static-link-libaries option allows fdprpro to go beyond the boundary of the target module and import hot code (i.e., heavily used) from other modules to which it is dynamically linked. These modules are referred below as SLL libraries.

For example, to import hot code from mylib.so using its profile mylib.so.prof, to myprog, use the following command:

 $ fdprpro -sll mylib.so:mylib.so.prof -O3 -o myprog.fdpr -f myprog.prof myprog

For better performance results, it is highly recommended that users collect the profiles of the specified SLL libraries with the same workload as the one used for training the target program.

IMPORTANT: If an SLL library is later upgraded, the optimization must be rerun with the upgraded library to keep the correspondence valid between that library and the target module.

IMPORTANT: It is the responsibility of the user to ensure that code copying from SLL libraries is compliant with the usage license of these libraries.


Limited-Value Profiling (LVP)

Starting with release 5.4.0.18 fdprpro provides special optimizations that look for operations with specific values and replace them with an optimized sequence. Such optimizations, which are typically target-specific, require corresponding instrumentation that will profile the code to identify potential sequences. The first optimization that use LVP is the -omullX optimization. The optimization performs strength-reduction on selected instances of integer multiplications. The user needs to specify -imullX for instrumentation and -omullX for optimization. To tune the optimization for Power6, specify also -m power6.


Conservative vs. aggressive data reordering

The data reordering algorithm of fdprpro is enabled by the -RD option and is available only for ELF64 (64-bit) programs. The algorithm reorders data elements in order to achieve better data cache efficiency as well as more effective instruction selection. It may operate on all data elements or only on subset of them depending on the selected aggressiveness. By default, a conservative algorithm is selected which does not reorder user's static data (i.e., data defined in .bss and .data sections). This is needed to protect against data access optimizations used in GCC4.3 and later. A more aggressive optimization is possible with the option --analyze-static-data (-asd) which considers all data elements.


Runtime instrumentation stack

fdprpro inserts certain code stubs during instrumentation which perform the necessary counting. To keep program's state intact, the registers changed by these stubs are save at the beginning of the stub and restored at the end. Writing below the stack can cause segment violations in rare cases. This was found to occur in applications that use the alloca() function or that employ multi-threading. To overcome this segment violation use the --instrumentation-safe-stack-usage (-issu). The option adds code that prevents the signal at the cost of increased code size (up to 20%). The user can also set the offset of the save area from the stack pointer, which must be negative, using the --instrumentation-stack-offset (-iso).


Alignment strategies

The alignment flag -A (--align-code) indicates the alignment strategy to use. The strategy codes are:

1 - An alignment strategy based on the instruction grouping of the selected target machine. See the -m (--machine) option for the possible machine models and the default value of this option.

2 - An alignment stragegy based on instruction grouping as in (1) above, while considering also the hotness of the branch targets. This typically makes prefetching the target instruction stream more efficient.


Warnings and errors during profiling

In exceptional conditions during profiling (training) the instrumentation code produces warnings and error messages. The instrumentation messages are written to a special file name profile file.errors_pid_tid to avoid having these messages interleaved with the regular text produced by the user's program. The directory where both the profile file and profile error file reside can be specified explicitly As with the profile file itself, the user may need to set the absolute path of directory where the profile error file resides can be specified with the environment variable FDPR_PROF_DIR (see Instrumentation and Profiling section above). If the directory where the program runs changes make sure FDPR_PROF_DIR is defined with the full path name.


Files

installed_dir/bin/fdpr

The wrapper script for fdprpro (by default installed_dir is /opt/ibm/fdprpro).

installed_dir/bin/fdprpro

The actual executable (binary) program.

installed_dir/lib/libfdprinst32.so

The shared library used during profiling for ELF32 executable files.

installed_dir/lib/libfdprinst64.so

The shared library used during profiling for ELF64 executable files.

output_file.dis_text

The disassembly file of program text, produced by the --disassemble-text option.

output_file.dis_data

The disassembly file of program data, produced by the --disassemble-data option.

output_file.dis_bss

The disassembly file of program data, produced by the --disassemble-bss option.

output_file.mapper

The map of basic block and static variables. See the --dump-mapper option.

output_file.aprof_init

The initial profile information in ASCII format. See the --dump-initial-ascii-profile option.

output_file.aprof

The ASCII-formatted profile file. See the --dump-ascii-profile option.

output_file.autoerr_log

In case of error, the file contains information related to the error. Please send it with the bug report to fdpr@il.ibm.com.

output_file.stat

If --verbose <level> is specified the file will contain certain statistics about the target program or about the optimization process.