LA-CC 03-070, C-03,066 Sections ======== o PROCMON o Environment Variables/Settings o Usage o Other Products o Building PROCMON ======= This software provides a (relatively?) easy way to get process information (currently memory use and CPU usage) that is independent (? - ported to several architectures and should be easy to add others) of UNIX the architecture. Information is printed much like that for a ps command. Currently, this has been tested on AIX, IRIX, LINUX, OSF1. See the "Discussion" section below for information about how successful this testing has been. For the current build, see the file 'lib/libprocmon_info.settings for info about how this product was built. Please email lmdm@lanl.gov if you have any questions/comments. Environment Variables/Settings ============================== Various parameters can be set via environment variables or command line options (specifically for procmon_post.pl). For example, time between sampling is the environment variable PROCMON_TIME or the command line option -time. Please run the "procmon" executable without any options for a listing. Usage ===== There are several different ways provided to get process information: - Command Line Tool Run the "procmon" executable via the command line. This is similar to how one might use a ps or top command. For example, to print out a sampling every second (1000 milliseconds), for a particular process ID of 123456, type: procmon 123456 -time 1000 To sample every command with the name "xterm", belonging to uid 1234, sending output to the directory tmp, every second; type: procmon -command xterm -uid 1234 -dir tmp -time 1000 Run "procmon" without any arguments for more information. - Run Time Linker Environment Variable Some (most unix) operating systems allow automatic loading of shared object libraries at run time via the setting of a special environment variable. The library libprocmon_rld.so was built containing process constructor and destructor routines. So, by setting this special environment var to point to libprocmon_rld.so, procmon reports process information when running an executable without having to recompile/relink anything. Sampling is done by using the POSIX timer and signal handler functionality. The signal used is PROCMON_SIGNAL (defined in procmon_info.h to be SIGRTMAX). This will conflict with processes that use this timer. The following are the architecture dependent settings: o OSF1: % setenv _RLD_LIST /libprocmon_rld.so:DEFAULT % foo_exe % unsetenv _RLD_LIST o SGI (64 bit execs): % setenv _RLD64_LIST \ /libprocmon_rld.so:/usr/lib64/libmalloc.so:DEFAULT % foo_exe % unsetenv _RLD64_LIST o LINUX: % setenv LD_PRELOAD /libprocmon_rld.so % foo_exe % unsetenv LD_PRELOAD o AIX NOT AVAILABLE After setting this environment variable, _ANY_ command typed in will be monitored until the environment variable is unset. So _REMEMBER_ to unset this variable when finished... - Re-Link of the Executable Sampling will be done by relinking an executable with libprocmon_rld.so. Unlike the "Run Time Linker Environment Variable" option listed above, only the relinked executable will be sampled (as opposed to every process started after the environment variable is set). Sampling is done by using the POSIX timer and signal handler functionality. The signal used is PROCMON_SIGNAL (defined in procmon_info.h to be SIGRTMAX). This will conflict with processes that use this timer. You will need to specify an rpath option to the link line as well as the library. The following are some examples: o SGI and OSF1: cc [...] -L -lprocmon_rld \ -rpath o GCC: cc [...] -L -lprocmon_rld \ -Wl,-rpath= Note: linking in libprocmon_rld.so gives you the ability to automatically sample your code as well as manually instrument your code (as linking in libprocmon_info.a does). - Manual Instrumentation via libprocmon_info.a Manual sampling can be done inside a program by making calls to the libprocmon_info.a library (header file is src/procmon_info.h). A description of function calls available and struct members is located in "procmon_info.h". Also, please see the files in the examples directory. The simplest manual instrumentation is to use PROCMON_info_get_print() (see examples/wrapper_simple*): #include "procmon_info.h" PROCMON_info_get_print( "foo", 0 ); do_some_work(); PROCMON_info_get_print( "foo", 1 ); Simply name the code block ("foo") and the start (0) and then stop (1). For more advanced instrumentation, see examples/instrument_simple.c). You will end up with something like: like this in your C code: #include "procmon_info.h" ... PROCMON_INFO_struct pinfo; PROCMON_info_init( &pinfo ); while( cycling ) { do_some_work(); PROCMON_info_get( 0, &pinfo ); if( pinfo.m_free < .05 * pinfo.m_size ) fprintf( stderr, "Warning - Less than 5%% machine memory free\n" ); } You may choose to use PROCMON's printing routines to be able to use procmon_post.pl. See examples/instrument_print.c. Note: linking in libprocmon_rld.so gives you the ability to automatically sample your code as well as manually instrument your code (as linking in libprocmon_info.a does). Once you have procmon output files, use procmon_post.pl to print out plots of the data: procmon_post.pl Discussion ========== - Use of "Run Time Linker Environment Variable" and MPI: I have not had much success using this approach and MPI. You need to set the RLD environment variable before typing in the mpirun command. The mpi implementation might start off a whole slew of processes during initialization - each of which will be procmon'ed. This is where you should use the PROCMON_MIN environment variable to try and prune out these output files. You can try using this method when using MPI...but be prepared to switch to another method if it fails. - Interaction With Fork (MPI implementations like lampi): When a process is forked, timers are not inherited by the child processes. If you are using the automatic sampling method, the child process will no longer be sampled since the signal timer is reset. You may call PROCMON_rld_reset() after the fork to start up a new timer: MPI_Init( &argc, &argv ); PROCMON_rld_reset(); Also, the first sampling that is done [eg by calling PROCMON_info_get_print(), raise( PROCMON_SIGNAL ), ... ] will reset the environment as well. - Values When Running Out of Free Machine Memory Close In my tests that stressed using up all the machine memory, I would get strange values for some data (eg over 1000% cpu usage). My guess is that when the OS is having problems getting space, it is spending all its time swapping processes out instead of updating certain system values. So, be wary of values when running out of machine memory. - Bad Data There are other times when the system does not seem to fill in the correct values for process info. This seems to periodically happen when the sampling is taken near process start or stop. In cases where I detect strange data (eg process time > wall clock time, negative times, process size of 0), I return PROCMON_WARNING from PROCMON_info_get(). The procmon executable and the automatic instrumentation methods will just ignore that sampling in hopes that future values will not be tainted. If you are doing manual instrumentation, you could just do the same. Or, you could stick the code in a while loop until a good value is returned: while( PROCMON_info_get( 0, &pinfo ) == PROCMON_WARNING ) {} Another value that could be returned is PROCMON_ERROR. In this case, it it likely a good value will never be returned (eg could not open system data files for that process because the process is gone). So, you should not use the above while loop for any value other than PROCMON_WARNING. - Aix o On Aix, you cannot use the Run Time Linker Environment Variable option. o The value for the stack did not change as I thought it should. So, I calculate it by the following: Stack Size = Process Size - Data Size - Text Size. Might be wrong...but it's my best guess. o Could not get the amount of free/used memory on the machine without suid root. I do not want to depend on that, so those values are not reported. - CPU Thrashing I have found the following to cause CPU thrashing (low %CPU Usage): o IO (reading/writing to disk or tty output) An important note, having procmon sample too often will also cause thrashing. I would not have PROCMON sample more than a few times a second. o Running out of memory (eg paging) On some machines (eg OSF1), you will exhaust memory before finally running out of memory. At about 20% free machine memory, your apps will slow down drastically. At 10% free memory, your apps will crash. You want to stop your program at 20% free memory and terminate gracefully. Other Products ============== - fit_plot.pl A tool for finding the linear fit of data and plotting it. Useful if doing scaling studies. Run the tool without any arguments for more information. Building ======== To build this product in-tree: ------------------------------ - gzip -dc PROCMON.tar.gz | tar xvfp - Uncompress/untar file - ./configure.pl -d . go Creates make.inc from configure.dat - gmake Builds lib(s), exec(s), ... and then runs tests - gmake test just runs test To build in same source tree for different architecture: -------------------------------------------------------- - ./configure.pl -d . go - gmake clean Remove temp files (eg .o files) but keeps libs. - gmake To build out-of-tree (useful if building many times and you don't want to have to 'gmake clean' all the time) ----------------------------------------------------------- - Arch 1: o setenv PROCMON_PREFIX o ./configure.pl -d $PROCMON_PREFIX go Places make.inc and prepares temp files to be built in the build directory. o gmake With the environment variable set, you will not have to run configure.pl again. - Arch 2: o setenv PROCMON_PREFIX o ./configure.pl -d $PROCMON_PREFIX go Places make.inc and prepares temp files to be built in the build directory. o gmake With the environment variable set, you will not have to run configure.pl again. New Release =========== o Update README % cvs_log.pl go See what's changed and add to this file (change date for release msg). Be sure to commit README. o Tag repo with new version. o on 1 machine - gmake distclean - gmake dist - gmake doc - gmake dist o gmake clean o gmake o gmake install o cd doc/Web_page; gmake install_web ============================================================================== RELEASE MESSAGES BELOW ============================================================================== ====================== New Version v-02-02-02 (2006/07/11) ====================== OS Tested: Linux (32/64 bit OS and with bproc) OSF1 AIX Environment: Please see lib/libprocmon_info.settings.txt for the environment used. Major Changes since v-02-02-01 (apart from various bug fixes) ============================== - INTERFACE ADDITIONS: o Added a "-sync" flag to procmon_post.pl to sync all starting times to eachother. This forces all processes to start at same time. Not needed unless clocks messed up (like on bproc cluster). ====================== New Version v-02-02-01 (2005/10/27) ====================== OS Tested: Linux (and with bproc) OSF1 AIX Environment: Please see lib/libprocmon_info.settings.txt for the environment used. Major Changes since v-02-02-00 (apart from various bug fixes) ============================== - INTERFACE ADDITIONS: o fit_plot.pl Added script to find fits (currently just linear) to data and make plots. I was doing this a lot so thought I would make a script to make life easier. o procmon: -all Added the -all option to the procmon command. All processes will be sampled. o procmon_post.pl Added hierarchical output of data. This includes inclusive and exclusive data for code blocks. ====================== New Version v-02-02-00 ====================== OS Tested: Linux (and with bproc) OSF1 AIX Environment: Please see lib/libprocmon_info.settings.txt for the environment used. Major Changes since v-02-01-00 (apart from various bug fixes) ============================== - INTERFACE ADDITIONS: o Environment Variable PROCMON_SIGNAL Allows you to choose the signal being used. You must use the integer value of the signal. If not using Posix itimer, you cannot change the signal used (SIGALRM). o PROCMON_info_get_print() This is a call to make it easier to instrument your code and print data to be used by procmon_post.pl: PROCMON_info_get_print( "do_work", 0 ); do_work(); PROCMON_info_get_print( "do_work", 1 ); Just supply a name for the code block ("do_work") and the start (0) and stop (1) values. See the README and examples/wrapper_simple.c for more info. o PROCMON_rld_reset(): When a process is forked, timers are not inherited by the child processes. If you are using the automatic sampling method, the child process will no longer be sampled since the signal timer is reset. You may call PROCMON_rld_reset() after the fork to start up a new timer: MPI_Init( &argc, &argv ); PROCMON_rld_reset(); Also, the first sampling that is done [eg by calling PROCMON_info_get_print(), raise( PROCMON_SIGNAL ), ... ] will reset the environment as well. o Different Signals used: POSIX timers are now used. The signal generated is defined in procmon_info.h as PROCMON_SIGNAL (currently SIGRTMAX). - INTERFACE CHANGE: o Case Change for C interface routines: When first writing PROCMON, I had thought it was only going to be called from C/C++. I have now added a fortran interface...and need to insulate the C interface from Fortran. Thus, the need for C routines to be mixed case. The only change is that the first word (PROCMON) will now be capitalized. o procmon_plot.pl -> procmon_post.pl: Previously, procmon_plot.pl just plotted the columns of data in the output files. Now, some additional processing is done and statistics are produced as well as plots. All the functionality of procmon_plot.pl is in procmon_post.pl (as well as some additional features like "-filter memory" to ignore blocks where no process size change was detected). o Text output from procmon_post.pl longer has parens - allows for pasting into spreadsheets. - Ported to AIX. ====================== New Version v-02-01-00 (2005/01/05) ====================== OS Tested: IRIX64 Linux (and with bproc) OSF1 Environment: Please see lib/libprocmon_info.settings.txt for the environment used. Major Changes since v-02-00-01 (apart from various bug fixes) ============================== - INTERFACE CHANGE: Added psettings argument to procmon_info_print() (can just pass NULL to have original functionality). This done so pretty printing setting passed to procmon_print as well (formerly just used in procmon exec). - Added "-pretty" option to procmon. When spitting output to the screen, it will overwrite lines (like top does). - Print key at end of plots for helping to determine which files correspond to which lines in the plots. - Added creation of 'lib/libprocmon_info.settings' - contains build info. - Added kludge to get to work on bproc clusters. Currently there is a bug in the /proc info system where wall clock time is wrong. If get a strange wall clock time, just set wall clock time to process time. Hopefully will be fixed shortly. - Changed datatype in procmon_info struct to be 8 bytes instead of 4 bytes. This removes struct padding conflict when using different compilers. - Added notes in examples and README mentioning that sometimes the /proc system is not updated correctly. So, you might consider putting the procmon_info_get() call in a while loop until the error return value is no longer PROCMON_WARNING. - Fixed memory leak (oopth - but the good news is procmon itself was used to find the memory leak). ====================== New Version v-02-00-01 ====================== Machines Installed: theta/bluemountain, lambda, q(s) Environment: osf: CXX_6.5.2 sgi: MIPSpro_7.4.1 linux: gcc_3.2.3 Major Changes since v-02-00-00 (apart from various bug fixes) ============================== - More easily create metrics in tests to determine PROCMON overhead. - Added instrumentation examples to examples/ dir. ====================== New Version v-02-00-00 ====================== Machines Installed: theta/bluemountain, lambda, q(s) Environment: osf: CXX_6.5.2 sgi: MIPSpro_7.4.1 linux: gcc_3.2.3 Major Changes since v-01-00-02 (apart from various bug fixes) ============================== - Values printed out are now useful. - Added more methods to get process inforation. (previously just _RLD environment setting and linking of libprocmon_rld.so) o Command line procmon executable o Manual instrumentation via libprocmon_info.a. ====================== New Version v-01-00-02 ====================== Machines Installed: theta/bluemountain, lambda, q(s) Environment: osf: CXX_6.5.1 sgi: MIPSpro_7.4 linux: gcc_3.2.1 Major Changes since v-01-00-01 (apart from various bug fixes) ============================== - procmon_plot.pl o Change formatting to put more plots on a page. o Now create a pdf file as well as a ps file ====================== New Version v-01-00-01 ====================== Machines Installed: theta/bluemountain, lambda, q(s) Environment: osf: fortran_5.4.1, CXX_6.5.1 sgi: MIPSpro_7.3.1.2m linux: LaheyFortran95Pro_6.1, gcc_3.2.1 Major Changes since v-01-00-00 (apart from various bug fixes) ============================== This is the first "official" installation of PROCMON.