LA-CC 03-070, C-03,066

PROCMON - Process Monitor

Source

Documentation

Executive Summary

This software provides a relatively(?) easy way to get process information (currently memory and CPU usage) that is independent(?) of UNIX the architecture.

Information is printed much like that for a ps or top command (eg process size and overall/instantaneous CPU usage is printed at a specified time interval). Manual instrumentation can also be done to get information about specific code blocks. This is done by adding "start" and "stop" calls (with a tag name) in the source code (C/C++ or Fortran).

The tool procmon_post.pl can be used to process this raw data output and create graphs/charts.

Output File Formats/Discussion

Each of the example runs below has a link to a directory of output files.

procmon_MACHINE_PID.txt files - Created by PROCMON sampling
These are the raw output files generated by PROCMON. Each process creates 1 file (with machine name and PID forming the filename). These files are used as input to procmon_post.pl to create graphs/tables. A comment block at the top of the file contains a brief description of the columns.
procmon_MACHINE_PID.txt.dat files, procmon.cmd - Created by procmon_post.pl
The .dat files contain the data that gnuplot will use to create the graphs. They are bassically massaged data from the .txt files. Data from all samplings is listed first followed by information about each block.
The actual gnuplot commands are stored in procmon.cmd.
procmon.out: - Created by procmon_post.pl
Text results printed to the screen after running procmon_post.pl. Information about each block is printed.
- Delta Mem - Change in memory (with average and max per block)
- Delta Time - Time spent in code block (again with average and max)
- # Blocks - Number of times that block was seen
- Block - the block name
.pdf/.ps graphs: - Created by procmon_post.pl
- Page 1
  - Overall CPU usage ((user+system)/wall time for the life of process)
  - Instaneous CPU Usage (usage since the last sampling)
  - Percent free memory on the machine
  - Process size
- Pages 2-4
  These pages have information about other stats (like stack size and page faults).
- Instrumentation Pages
  You can manually instrument code blocks by placing "start" and "stop" calls (along with a tag name) around sections of code. PROCMON will get process information (eg delta time and process size increase) about these blocks.
  Each page gives information about each code block.
  The final Instrumentation Page is named Final Res - Final Results. This is the code block bounded by the first and last sampling.

Example Runs

Parallel Leak
Directory: procmon_parallel_leak
This is a 4 Process MPI run where all processes loop through allocating space and all processes except process 0 free their memory (process 0 has a leak). Note how when the machine memory is exhausted, the cpu usage becomes erratic. The executable continued to run until the memory allocation failed. On some systems, you can exhaust memory (with terrible performance) before you run out of memory (the point at which memory allocation fails).
The trick is to stop your code at the point where memory is exhausted and terminate gracefully. PROCMON can be called from within C/Fortran code to determine this point. This run shows that memory was exhausted at 20% machine free memory, and memory ran out at 10% machine free memory.
Serial IO
Directory: procmon_io
Serial run where processes loop through cycles of computing and doing IO (writing to a file). While doing IO, you can see the CPU usage drop.
4 Methods
Directory: procmon_methods
4 different serial runs of code that loops over allocating memory. The 4 lines per plot represent the 4 different methods in which PROCMON can be used to get process information.
1. Command Line Tool (like top)
2. Run Time Linker Environment Variable
3. Re-Link of the Executable
4. Manual Instrumentation via libprocmon_info.a