************************************************************************
* Myricom GM networking software and documentation *
* Copyright (c) 2001, 2002 by Myricom, Inc. *
* All rights reserved. See the file `COPYING' for copyright notice. *
************************************************************************
README-linux for gm-1.6.3
README for linux distribution
Supported platforms: Linux 2.2 and 2.4 for IA32, PowerPC, Alpha.
Linux 2.4 for IA64 (Itanium).
- For Alphas, if you have 2 GB or more of memory,
we recommend kernel version 2.4.18 to install GM.
You must use kernel version 2.4.14 or later (2.4.9
also works).
Supported interfaces: LANai7 (PCI64, PCI64A), and LANai9 (PCI64B, PCI64C)
If you have LANai4, you will need to upgrade your
interface, or use a previous version of GM
such as gm-1.2.3 for 256K and gm-1.5.2 for larger
memory sizes.
(Please also note that Linux 2.4 is not supported
on gm-1.2.3).
For installation instructions of an earlier GM
version please refer to the respective README
and README- files.
WARNING: When building/linking GM applications, you must do so
on a linux box that matches the OS version of the machine
on which you will be running. You cannot compile on a 2.2.x
machine and run the executable on a 2.4.x machine.
Table of Contents:
-----------------
I. GM Installation
a. Configuring and compiling GM
b. Installing the GM driver
c. Running the GM Mapper
d. Testing the GM installation
II. Verifying the GM performance
III. Running IP over GM
IV. Improving IP Performance
V. Fork() Support
VI. Sample Scripts to automatically load GM and start the Mapper
VII. Operating-system-specific Caveats
a. Using Compaq Compilers for Alpha Linux (ccc cxx)
b. PCI Chipset Tweaks
c. APIC IRQ conflict on Tyan and AMD motherboards
d. AGP (nVidia and ATI) conflicts
VIII. Miscellaneous
a. Uninstallation of the GM driver
************************************************************************
If difficulties are encountered, please consult the FAQ
http://www.myri.com/scs/GM_FAQ.html
and all technical support questions should be directed to help@myri.com.
************************************************************************
===================
I. GM Installation
===================
GM installation is performed in the following four steps.
1. Configuring and compiling GM:
---------------------------------------------
gunzip -c gm-1.6.3_Linux.tar.gz | tar xvf -
cd {GM_HOME}
./configure
make
By default, we assume that the header file for your Linux installation
is located in /usr/src/linux. If your Linux installation is not
located in /usr/src/linux, you must configure with the following
option:
./configure --with-linux=
where specifies the directory for the linux
kernel source. The kernel header files MUST match the running kernel
exactly: not only should they both be from the same version, but they
should also contain the same kernel configuration options.
Note: If you have a mixture of hosts with LANai4 and LANai7
(or LANai9) interfaces that need to talk to each other, you
must configure with --disable-new-features on all of the hosts.
For a complete listing of all options to configure, type:
./configure --help
Note: Do not use the configure flag --enable-directcopy. This
flag is not a valid option to GM 1.6.3. It will be re-enabled
in a future release.
2. Installing the GM driver:
---------------------------------------------
Select an installation directory path . It is
usually best for to be the path to an NFS
directory available on all machines that are to share this GM
installation. The directory must be accessible using
on all machines that are to share the
installation. must be an absolute path; it
must start with "/". However, may contain
symbolic links.
cd binary
./GM_INSTALL
You may omit to install the driver in /opt/gm/.
Next, you must run
su root
/sbin/gm_install_drivers
on each machine to install the drivers on that machine.
If you wish for the driver to auto-load an boot, you must
create appropriate links in the /etc/rcN directories to the
/etc/init.d/gm and /etc/init.d/myri scripts. Alternatively,
you may start and stop the drivers manually using
su root
/etc/init.d/gm start
/etc/init.d/gm stop
or
su root
/etc/init.d/gm restart
to start, stop, or restart the driver, respectively.
For directions on how to uninstall the GM driver, refer to the
"Miscellaneous" section.
Note: If the host is rebooted, you must reload the GM driver (and
rerun the GM mapper). There are sample scripts, contributed
by a customer, in {GM_HOME}/drivers/linux/scripts for
loading GM and running the mapper at reboot.
3. Running the GM Mapper
------------------------
Myrinet is a source-routed network. I.e., each host must know the
route to all other hosts through the switching fabric. The GM
mapper automatically discovers all of the hosts connected to the
Myrinet network, computes a set of deadlock free minimum length
routes between the hosts, and distributes appropriate routes to each
host on the connected network. Loopback and point-to-point network
topologies require that gm_simpleroute must be run instead of the
GM Mapper. (Refer to the GM README and the FAQ for details.) For
a switch network topology, the GM Mapper must be run before any
communication over Myrinet can be initiated.
Further technical details about the GM mapper can be found in mt/README.
Depending upon the user's needs, there are three different ways
in which the GM mapper may be used.
MAP_ONCE mapping:
----------------
The first way is by far the most common, and we shall refer to it
as "map_once". In this method, the mapper is run on one host in the
network (any of the hosts). It is rerun if a host (re)boots or a
hostname is changed or after a change of Myrinet topology (swapping of
ports on a switch). (If the Mapper must be rerun for any of these
reasons, it is strongly advised to run it on the same host.) The
command for this method of running the GM mapper is:
cd {GM_HOME}/binary/sbin/
su root
./mapper ../etc/gm/map_once.args
STATIC mapping:
--------------
The second way in which the GM mapper may be used is called "static
mapping" or "file mapping". In this method, an active mapper is run
once when ALL of the hosts are up and running the GM driver. This
initial active mapper will generate a map file and a host file.
These files are then copied to all of the hosts in the network, or
shared by NFS. An entry in the boot scripts will allow each host
to read the map file and the host file and update the routing table
on its local Myrinet interface(s). This method is particularly
appealing as no human intervention is needed and no traffic is
generated at boot time. The commands for this method of running the
GM mapper are:
cd {GM_HOME}/binary/sbin/
su root
./mapper ../etc/gm/static.args
Copy the 3 files created by this command (static.map, static.routes,
and static.hosts) to each {GM_HOME}/binary/sbin/
directory on each host if the gm tree is not mounted by NFS.
Add the following command to the boot scripts of the host
(scripts in /etc/init.d or /etc/rc.d/init.d).
cd {GM_HOME}/binary/sbin/
su root
./file_mapper ../etc/gm/file.args
HA mapping:
-----------
The third way in which the GM mapper may be used is for the users who
have a need for High Availability (HA) in an aggressive computing
environment. The command for this method of running the GM Mapper is:
cd {GM_HOME}/binary/sbin/
su root
./mapper ../etc/gm/active.args &
It will continuously run the GM mapper in the background
to detect and add any new hosts or remove any non-responding hosts,
to detect any change of topology (change of slots in the switch, change
of innerswitch topology), and periodically update the routing tables of
the Myrinet cards (by default, every 30 seconds). You should note that
this mapping method is quite intrusive. The user is strongly advised to
avoid this method of running the GM mapper if his applications produce
heavy network traffic (e.g., MPI applications) since the GM Mapper uses
non-reliable messages that may be dropped in case of heavy contention,
leading to hosts that may be marked as "non-responding" and removed
because they are unreachable. A few expert customers use this mapping
method to satisfy their high availability constraints for GM applications
designed to handle a dynamic change of configuration (by design, MPI is
NOT a fault-tolerant application).
For the majority of users, the "map_once" GM mapping method is sufficient.
For the users with more production-level constraints, the "static
mapping" is the most adequate method. For fault-tolerant GM applications,
the third method provides the best alternative.
4. Testing the GM Installation
------------------------------
A variety of test scripts are available in {GM_HOME}/binary/bin to test your
GM installation. A README describing each of these tests can be found in
{GM_HOME}/tests/README. We recommend the following five tests to
validate your installation.
cd {GM_HOME}/binary/bin
1. Test that the Mapper has correctly detected all of the hosts in your
Myrinet network by typing the following command on several of the
hosts:
./gm_board_info
Note: In the output of this command, all hosts should be listed in
the routing table of each node.
If not all of the hosts are listed, then it is possible that a
cable is not connected, or GM is not properly loaded on all
hosts in the Myrinet network. A green LED should be lit up
on the switch for each connection that is active.
If you see *** No routes found *** in the output, this is
an indication that the GM Mapper has not been run. (See
README- for details.)
When ./gm_board_info successfully reports a list of hosts,
you can then run ./gm_allsize and ./gm_stress to test the
network.
2. Test the basic connectivity of GM, by typing:
./gm_allsize --verbose --geometric
on one of the hosts in the Myrinet network.
Note: This loopback test will NOT work in a point-to-point (no switch)
configuration.
3. Test GM bandwidth between two hosts, type (on the first host)
./gm_allsize --slave --size=15
and then type the following command (on the second host)
./gm_allsize --unidirectional --bandwidth --remote-host= \
--size=15 --geometric
where is the name of the first host.
These one-way tests are performed by running in slave mode on one
machine and master on the node to be tested. This is done by adding
'--slave' on the command line of the slave machine and '-h ' on
the command line of the master where is the name of the machine
running in slave mode. The name of each host is as specified in the
output of ./gm_board_info. The --size parameter indicates the maximum
length of message that will be sent, where 2^{size} is the value of
that length. In this example, the maximum length of message sent
is 2^{15}=32K. The --geometric parameter reduces the number of
message lengths that will be tested. The default for gm_allsize is
to test every length from 1 to 2^max_size incrementing one byte at
a time.
These tests take a long time to run, and generate data files suitable
for input to gnuplot.
4. Test GM latency between two hosts, type (on the first host)
./gm_allsize --slave --size=15
and then type the following command (on the second host)
./gm_allsize --bidirectional --latency --remote-host= \
--size=15 --geometric
where is the name of the first host.
These one-way tests are performed by running in slave mode on one
machine and master on the node to be tested. This is done by adding
'--slave' on the command line of the slave machine and '-h ' on
the command line of the master where is the name of the machine
running in slave mode. The name of each host is as specified in the
output of ./gm_board_info. The --size parameter indicates the maximum
length of message that will be sent, where 2^{size} is the value of
that length. In this example, the maximum length of message sent
is 2^{15}=32K. The --geometric parameter reduces the number of
message lengths that will be tested. The default for gm_allsize is
to test every length from 1 to 2^max_size incrementing one byte at
a time.
These tests take a long time to run, and generate data files suitable for
input to gnuplot.
5. Run gm_stress on every host in the cluster to validate GM.
Complete details on running gm_stress can be found on
the FAQ.
http://www.myri.com/scs/faq/faq-html#debug-stress
This gm_stress command must be run simultaneously on each host, using
the same list of host names in each case. It can be run on any
subset of hosts on the network.
For a list of all possible runtime options for these commands, you can
issue the command with --help as the runtime option, e.g. ./gm_debug --help.
================================
II. Verifying the GM Performance
================================
We recommend the following test to verify the GM performance. View
the results of the hardware benchmark test of the PCI bus with the
DMA engine of the Myrinet adapter.
cd {GM_HOME}/binary/bin
./gm_debug --no-counters
Note: The output of this command gives the maximum sustained bandwidth
that can be obtained from the PCI bus. Refer to the section
entitled "GM Performance" in the {GM_HOME}/README for complete
details on expected GM performance.
=======================
III. Running IP over GM
=======================
The Linux command to enable IP over GM is as follows:
/sbin/ifconfig myri0 up
where you must replace 'myri0' with the appropriate name (myri1, myr2, etc.)
if you have more than one Myrinet interface per host. For more information,
please refer to the FAQ (http://www.myri.com/scs/GM_FAQ.html).
============================
IV. Improving IP performance
============================
To get good IP performance over Myrinet:
* use Linux-2.4 (Linux-2.4.19 is now available)
* configure GM with --enable-new-features (a default for gm-1.5 and later)
to get a larger 9000byte MTU for IP-over-Myrinet
You definitely want to use Linux 2.4 instead of Linux 2.2, and NFS-v3
over TCP. Linux 2.4 has vastly better TCP/IP and UDP/IP numbers than
Linux-2.2. Also, there have been some recent patches to Linux-2.4
that help udp performance.
If you are running Linux 2.2 or earlier, you should use the following
tuning options to get good NFS bandwidth. Otherwise, you are latency
dominated and Myrinet IP and Ethernet IP performance will be about the same.
- For linux you want to increase the tcp windows:
echo "262144" > /proc/sys/net/core/rmem_max
echo "262144" > /proc/sys/net/core/wmem_max
echo "262144" > /proc/sys/net/core/wmem_default
echo "262144" > /proc/sys/net/core/rmem_default
- In linux/include/net/tcp.h, replace the value of
#define MAX_WINDOW 32767
with the value of your choice (200k~500k might be good)
- check that /proc/sys/net/ipv4/tcp_window_scaling is enabled
with the value 1 (as it should be by default).
- Play with the buffer sizes of netperf or your favorite net tester.
Note: These tunings options are not required for Linux 2.4.
==================
V. Fork() Support
==================
As of gm-1.5.2 and later, GM has full support for fork() under
Linux. It works for all processor families. There are no
restrictions; GM can fork() with or without a GM port open.
However, if the customer has a choice between using vfork()
or fork(), there will be better performance with vfork() since
the time to fork a process with vfork() is much shorter.
================================================================
VI. Sample Scripts to automatically load GM and start the Mapper
================================================================
The directory {GM_HOME}/share contains some sample initialization
scripts, contributed by customers, that can be customized to suit
your system to automatically load the gm driver and start the
GM Mapper.
=======================================
VII. Operating-system-specific Caveats
=======================================
---------------------------------------------------
a. Using Compaq Compilers for Alpha Linux (ccc cxx)
---------------------------------------------------
Under the C shell:
setenv CC ccc
setenv CXX cxx
setenv CXXFLAGS \
"-g -O2 -inline speed -x cxx -noexceptions -nocxxstd -using_std -w2"
setenv CFLAGS -gcc_messages
setenv KCC gcc
rm -f config.cache
./configure
or under a Bourne shell or Bash:
CC=ccc ; export CC
CXX=cxx ; export CXX
CXXFLAGS="-g -O2 -inline speed -x cxx -noexceptions -nocxxstd"
CXXFLAGS="$(CXXFLAGS) -using_std -w2" ; export CXXFLAGS
CFLAGS=-gcc_messages ; export CFLAGS
KCC=gcc ; export KCC
rm -f config.cache
./configure
----------------
b. PCI Chipset Tweaks
----------------
In the file:
{GM_HOME}/drivers/linux/gm/gm_arch.c
If you have an i840 chipset, modify the flag to be
#define GM_INTEL_840 1
There are similar defines for:
#define GM_INTEL_860 1
#define GM_21154 1
#define GM_INTEL_450NX 1
#define GM_KT266A 1
Also from this file, please read this warning:
/****************** PCI CHIPSET TWEAKS: WARNING *************************
* *
* The patches below were supplied by customers who reported that *
* their PCI performance was improved when using these patches *
* on a particular chipset. *
* These patches tweak certain bits in the chipset and have not been *
* verified or reviewed by Myricom and may have other, possibly *
* negative, side-effects. Before applying one of these patches, *
* you may wish to check for a newer BIOS for your machine. *
* Also, a newer linux kernel may provide better PCI performance, *
* and might be a safer course of action than applying one of *
* these patches. *
* *
* Use these patches at your own risk. *
* *
***********************************************************************/
--------------------------------------------------
c. APIC IRQ conflict on Tyan and AMD motherboards
--------------------------------------------------
We have encountered APIC IRQ conflicts on several Tyan
and AMD motherboards.
The installation of GM will fail with an error message similar to the
following:
GM: LANai rate set to 198 MHz (max=2-2MHz)
GM: Board 0 page hash cache has 32768
GM: Allocated IRQ 11
GM: NOTICE:
GM: board interrupt (configured on IRQ 11) is not working
GM: NOTICE:
GM: Failed to initialize Myrinet Card
GM: gm: driver unloading
GM: WARNING:
GM: No Board Initialized
#############################
Error Installing GM driver module
#############################
or
GM: Version 1.5.2.1_Linux build 1.5.2.1_Linux xxxh@xxx.xx.xx Fri
Jul 19 14:03:17 EDT 2002
GM: NOTICE:
GM: Module not compiled from a real kernel build source tree
GM: This build might not be supported.
GM: Highmem memory configuration:
GM: PAGE_ZERO=0x0, HIGH_MEM=0x3ff80000, KERNEL_HIGH_MEM=0x38000000
GM: Memory available for registration: 224748 pages (877 MBytes)
GM: MCP for unit 0: L9 4K (new features)
GM: LANai rate set to 133 MHz (max = 134 MHz)
GM: Board 0 page hash cache has 32768 bins.
GM: Allocated IRQ5
GM: NOTICE:
GM: Board interrupt (configured on IRQ 5) is not working.
GM: NOTICE:
GM: Failed to initialize Myrinet Card
GM: gm: driver unloading
The IRQ error message says that the driver asked the Myrinet NIC to raise
the interrupt that has been assigned by the BIOS to check that it's working,
and the driver doesn't receive it in the expected timeout. Thus, the driver
cannot use the Myrinet board and exits from the initialization.
The most frequent cause for this problem is:
* The interrupt lines are managed by an APIC (Advanced Programmable
Interrupt Controller) chipset and it is not supported correctly by the
BIOS and/or by the current Linux kernel.
Possible solutions:
1. Try a different PCI slot.
2. Upgrade the BIOS.
3. Upgrade the Linux kernel version if available. Boot the Linux kernel
without APIC support; pass the flag -noapic to the booting kernel via
the LILO boot prompt. In this case, the kernel will use a safer
compatibility mode.
It is important to note that if this error occurs on any node in the
cluster, all nodes in the cluster should be booted with -noapic.
Refer to the Myrinet FAQ for further details.
http://www.myri.com/scs/faq/faq-install.html#install6b
---------------------------------
d. AGP (nVidia and ATI) conflicts
---------------------------------
Two types of problems were reported.
1. If I load the GM module first, and then load the nVidia or ATI module,
it works. But if I load the nVidia or ATI module first, GM won't load.
The GM_INSTALL error message looks like:
n03 135# ./GM_INSTALL
Making device files in /dev.
ifconfig myri0 down - in case it was up
myri0: unknown interface: No such device
Adding new GM driver.
sbin/gm: init_module: No such device
Hint: insmod errors can be caused by incorrect module parameters, including invalid IO or IRQ parameters
****
Error installing GM driver module.
****
and then in the kernel log, you see something like:
GM: Version 1.5.2_Linux build 1.5.2_Linux x@x Wed Aug 21 16:17:08 PDT 2002
GM: NOTICE:
GM: Module not compiled from a real kernel build source tree
GM: This build might not be supported.
GM: Highmem memory configuration:
GM: PAGE_ZERO=0x0, HIGH_MEM=0x7fff0000, KERNEL_HIGH_MEM=0x38000000
GM: Memory available for registration: 451752 pages (1764 MBytes)
GM: NOTICE:
GM: pci_rev2: Could NOT map board into kernel (span = 0x1000000)
GM: WARNING:
GM: Can't map IO memory to system memory
GM: NOTICE:
GM: gm_instance_init failed
GM: NOTICE:
GM: Failed to initialize Myrinet Card
GM: gm: driver unloading
GM: WARNING:
GM: No board initialized
This one is a case of shortage of virtual memory (used for IO-mapping PCI
memory) in the Linux kernel. On configurations with a lot of physical
memory, there will only be 128Mb of the address space that Linux will always
reserve for virtual memory dynamically allocated. Unfortunately the nVidia
card seems to eat as much virtual memory as it can (it occupies at least
128Mb in PCI memory space), so if you load it before the gm module on such
a configuration, you will have the error reported.
The fix is to recommend for people with more than 768Mb of memory and an
nVidia or ATI card to apply the following patch to their kernel:
--- arch/i386/kernel/setup.c Thu Aug 2 17:00:46 2001
+++ arch/i386/kernel/setup.c.2 Thu Oct 11 09:00:59 2001
@@-815,7 +815,7 @@
/*
* 128MB for vmalloc and initrd
*/
-#define VMALLOC_RESERVE (unsigned long)(128 << 20)
+#define VMALLOC_RESERVE (unsigned long)(256 << 20)
#define MAXMEM (unsigned long)(-PAGE_OFFSET-VMALLOC_RESERVE)
#define MAXMEM_PFN PFN_DOWN(MAXMEM)
#define MAX_NONPAE_PFN (1 << 20)
And be sure that the HIGHMEM option is enabled while configuring the
kernel.
If you do not mind losing memory or just to do a test, you can try
to boot your current kernel with mem=768m to see if the problem
disappears.
Refer to the Myrinet FAQ for further details.
http://www.myri.com/scs/faq/faq-install.html#install6b
2. Overlapping of prefetch memory for the AGP and PCI bridges.
SGI Visual Workstation 550 machine. AGP cards (nVidia Quadro,
ATI Mach64 PCI graphics card, ATI Rage AGP).
What we see with them is that the prefetchable memory assigned by
the BIOS for the AGP and PCI bridges is overlapping. This looks like
a BIOS problem and we have asked the customer to look into upgrading
the BIOS, or to play with the BIOS settings to attempt to get the BIOS to
do the right thing (things to try - toggling the plug-n-play OS setting,
change the size of the AGP graphics aperture, reinitialize or re-detect
the PCI space in the configuration space, etc.)
Specifically, it was seen that:
The memory for the Myrinet card is mapped at exactly the same spot
with the ATI Mach64 PCI graphics card as it is with the ATI Rage AGP
graphics card:
03:01.0 Non-VGA unclassified device: MYRICOM Inc.: Unknown device 8043 (rev 03)
Region 0: Memory at 82000000 (64-bit, prefetchable) [size=16M]
However, now look at the bridges leading to bus 3 (PCI where Myrinet
card is) and bus 1 (AGP) in the ATI Rage AGP config:
00:01.0 PCI bridge: Intel Corporation 82840 840 (Carmel) Chipset AGP Bridge (rev 01) (prog-if 00 [Normal decode])
Bus: primary=00, secondary=01, subordinate=01, sec-latency=64
Prefetchable memory behind bridge: 82300000-850fffff
00:02.0 PCI bridge: Intel Corporation 82840 840 (Carmel) Chipset PCI Bridge (Hub B) (rev 01) (prog-if 00 [Normal decode])
Bus: primary=00, secondary=02, subordinate=03, sec-latency=0
Prefetchable memory behind bridge: 81600000-831fffff
See how those the prefetchable memory regions overlap? And, more
importantly, see how the bridge to the AGP bus's prefetchable memory
region overlaps that of the Myrinet card? Note that the only
prefetchable memory on the AGP bus is for the rage card and that this
memory is a small subset of the region the bridge is claiming:
01:00.0 VGA compatible controller: ATI Technologies Inc 3D Rage IIC AGP (rev 7a) (prog-if 00 [VGA])
Region 0: Memory at 84000000 (32-bit, prefetchable) [size=16M]
This issue is now resolved. You need to download BIOS version A9
from the SGI website.
===================
VIII. Miscellaneous
===================
------------------------------------
a. Uninstallation of the GM driver
------------------------------------
The gm_install_drivers script generates the script /sbin/gm_uninstall_drivers,
which can be used to uninstall the drivers.
The GM_INSTALL script generates the script /sbin/GM_UNINSTALL,
which can be used to uninstall GM.