Structure of a dtrace script
#!/usr/sbin/dtrace -s
probe_which_is_something_to_trace
/ filter_optional /
{ actions }
probe_Something_else
/ filters_optional /
{ take some actions }
The script has sections that fire if the specified probe fires in the OS. For example, if do a send over TCP then my “something_to_trace” could be a probe (an event) called “tcp:::send” . I could further filter by receiving machine’s IP address. Then when a packet is sent over TCP and the receiver is the IP in the filter I can take some actions like state the size of the packet.
What can I trace? What are the possible “something_to_trace”, ie the probes? To get a list run of OS probes, run
dtrace -l
-l = list instead of enable probes
The list will be long. It’s a good to have some idea what one is searching for like “io”, “tcp”, “zfs” etc and grep for areas of interest.
The output format is 5 columns , first the probe id and followed by the name in 4 parts
id provider module function name
One can get the list of probes just for a subset using “-ln” flag and include part of the probe either provider, module, function or name. The provider, module,function and name are concatenated together with colons (:). I commonly only use provider and name, for example
sudo dtrace -ln tcp:::
ID PROVIDER MODULE FUNCTION NAME
7301 tcp ip tcp_input_data receive
7302 tcp ip tcp_input_listener receive
7303 tcp ip tcp_xmit_listeners_reset receive
7304 tcp ip tcp_fuse_output receive
-n = Specify probe name to trace or list
Now if I have a specific probe I want to trace, I can get more information about the data available for that probe using the verbose flag as in:
dtrace -lvn tcp:ip:tcp_input_data:receive
ID PROVIDER MODULE FUNCTION NAME
7301 tcp ip tcp_input_data receive
Argument Types
args[0]: pktinfo_t *
args[1]: csinfo_t *
args[2]: ipinfo_t *
args[3]: tcpsinfo_t *
args[4]: tcpinfo_t *
Now I see the arguments for this probe and I have access to these arguments when this probe fires, but what do these arguments contain? Here a crucial trick. You can look up these arguments on:
(also, can look in /usr/lib/dtrace or previously http://cvs.opensolaris.org/source/)
For example, I see that args[3] is “tcpsinfo_t. ” What is “tcpsinfo_t?” I can type “tcpsinfo_t” into the Symbol field as in
I get a list of appearances of this structure. I click on the first one as it looks like it’s probably the structure definition and get:
So now I can see the contents of the structure and variable types of each field in the structure. From the above I see the receivers address is “tcps_raddr” in arg[3] so I can now filter for the receivers address.
A good example of trying to understand dtrace probes and arguments is found here:http://blogs.oracle.com/observatory/entry/d_script_archeology
And now I’m ready to put together a script accessing the contents of these structures
#!/usr/sbin/dtrace -s
#pragma D option quiet
tcp:::send
/ args[3]->tcps_raddr == "192.168.1.1" /
{ printf("send %6d \n",args[2]->ip_plength);
}
tcp:::receive
/ args[3]->tcps_raddr == "192.168.1.1" /
{ printf("receive %6d \n", args[2]->ip_plength );
}
(added the line “#pragma D option quiet” to suppress the default output of the names of probes that fired and limit the output to only the print statements)
My main point of this blog was to show how to get a list of OS probes to trace and for those probes how to find out the contents of the data structures available for that probe. The access to these structures depends on “translators.” See
The translators are defined in /usr/lib/dtrace. So instead of looking up structures in the Illumos source one can look for them in the translator files in /usr/lib/dtrace. The translators provide a stable way to access the OS structures. The OS structures can be accessed directly without the translators but it’s more complicated and not stable across versions of the OS not to mention types of OS.
For Oracle, there are no translators though you could write them.
For Oracle, to get an idea what you can trace, get the PID of a running Oracle process, say in my case 11602
dtrace -ln pid11602:oracle:kcbr*:
The list will be quite large. It more or less requires knowing what you are looking for. Above I give “kcbr*” for Kernel Cache Buffer Recent. Checkout http://www.orafaq.com/wiki/X$_Table for an idea of the layers that make up the start of some of the function names. ( it would be a lot more fun trace Oracle with DTrace if one had access to the source!)
The best references for dtrace are first the dtrace book and dtrace community then Brendan Gregg’s blog at:
If you are specifically interested in TCP then Alan Maguire’s blog is good
And for some specific examples with Oracle check out:
http://alexanderanokhin.wordpress.com/2011/11/13/dynamic-tracing-of-oracle-logical-io/
http://andreynikolaev.wordpress.com/2010/10/28/appetizer-for-dtrace/
http://blog.tanelpoder.com/2009/04/24/tracing-oracle-sql-plan-execution-with-dtrace/
http://www.slideshare.net/dougburns/dtrace-miracle-scotland-database-forum
A few other notes:
Useful variables that can be filtered on or printed out
pid – process id
execname – executable name
timestamp – timestamp in nano-seconds
tid – thread id
cwd – current working directory
probeprov, probemod, probefunc, probename – probe’s provider, module, func and name
Dtrace variables (from Brendan’s blog see: http://dtrace.org/blogs/brendan/2011/11/25/dtrace-variable-types/ )
DTrace variable types, in order of preference:typeprefixscopeoverheadmulti-CPU safeexample assignmentaggregation@globallowyes@x = count();clause localthis->clause instance[1]very lowyesthis->x = 1;thread localself->threadmediumyesself->x = 1;scalarnonegloballow-mediumno[2]x = 1;associative arraynoneglobalmedium-highno[2]x[y] = 1;
The above list is a nice cheat sheet. I started off using scalar global variables but the right way is to use aggregations, clause local and thread local variables Definitely read about aggregates when getting started with Dtrace as aggregates are key to getting useful data out of dtrace
For formatting output data, wrap scripts with something like Perl and post process the data. Here is an example of wrapping (piping) dtrace data to perl and having perl format the output (Perl code is set up to aggregate similar types of data coming fromI/O, ZFS an NFS, but below is just the code for monitoring NFS, )
/usr/sbin/dtrace -n '
#pragma D option quiet
nfsv3:::op-read-start, nfsv3:::op-write-start ,nfsv4:::op-read-start {
tm[args[1]->noi_xid] = timestamp;
sz[args[1]->noi_xid] = args[2]->count ;
}
nfsv4:::op-write-start {
tm[args[1]->noi_xid] = timestamp;
sz[args[1]->noi_xid] = args[2]->data_len ;
}
nfsv3:::op-read-done, nfsv3:::op-write-done, nfsv4:::op-read-done, nfsv4:::op-write-done
/tm[args[1]->noi_xid]/
{ this->delta= (timestamp - tm[args[1]->noi_xid]);
this->type = probename == "op-write-done" ? "W" : "R";
@nfs_tm[this->type]=sum(this->delta);
@nfs_mx[this->type]=max( (this->type == "R" ? this->delta : 0));
@nfs_mx[this->type]=max( (this->type == "W" ? this->delta : 0));
@nfs_ct[this->type]=count();
@nfs_sz[this->type]=sum(sz[args[1]->noi_xid]);
tm[args[1]->noi_xid] = 0;
sz[args[1]->noi_xid] = 0;
}
profile:::tick-1sec
{ printa("nfs_tm ,%s,%@d\n",@nfs_tm);
printa("nfs_mx ,%s,%@d\n",@nfs_mx);
printa("nfs_ct ,%s,%@d\n",@nfs_ct);
printa("nfs_sz ,%s,%@d\n",@nfs_sz);
clear(@nfs_mx);
clear(@nfs_tm);
clear(@nfs_ct);
clear(@nfs_sz);
printf("!\n");
}
' | perl -e '
while (my $line = ) {
$line=~ s/\s+//g;
if ( $line eq "!" ) {
printf("--|------| %10s %10s %10s %10s\n", "avg_ms","MB/s","mx_ms","count");
foreach $type ('R','W') {
foreach $class ('nfs') {
$ct=${$class . "_ct"}{$type}||0;
$sz=${$class . "_sz"}{$type}||0;
$mx=${$class . "_mx"}{$type}||0;
$tm=${$class . "_tm"}{$type}||0;
if ( $ct > 0 ) {
$ms=(($tm/1000000)/$ct);
printf("$type | $class : %10.2f",$ms);
} else {
printf("$type | $class : %10.2f",0) ;
}
printf(" %10.2f",$sz/(1024*1024));
$mx_ms=$mx/1000000;
printf(" %10.2f",$mx_ms);
printf(" %10d",$ct);
print "\n";
}
}
} else {
($area, $r_w, $value)=split(",",$line);
${$area}{$r_w}=$value;
}
}'
PS I tried using AWK for formating but the data seems to get buffered so there isn’t immediate output. I tried using “fflush” and print /dev/stdout but couldn’t get it to consistently work
dtrace [code] | awk '
{
# hack to get AWK to flush out immediately
printf("%s\n",$0) > "/dev/stdout"
close("/dev/stdout")
#fflush(stdout)
#system("")
}'
PS one thing to keep in mind is overhead. The overhead of Dtrace is small but it’s not zero. See
For more info. From the above link from Brendan Gregg, here is the key part:
I’d suggest considering pid provider overhead as follows: Don’t worry too much about pid provider probe cost at < 1000 events/sec. At > 10,000 events/sec, pid provider probe cost will be noticeable. At > 100,000 events/sec, pid provider probe cost may be painful. Let’s discuss these using the 6 us per probe result from the previous tests. This could be faster (~2 us) or slower (> 15 us) depending on architecture and the probe action.
at 1000 events/sec, that could cost 6 milliseconds of CPU time. If the application was 100% on-CPU, it may slow by 0.6% – ie, don’t worry.
At 10,000 events/sec, that could cost 60 milliseconds of CPU time. For a 100% on-CPU application, then the slow down may be 6%. If it was idle some of the time (systems often are), then the overhead may be less, as there are spare cycles to run DTrace (really depends on the location of the probe in the code-path, though).
At 100,000 events/sec, the CPU cost could be 600 milliseconds. This can significantly slow down an application. Whether that’s a problem depends on the app and its function – how sensitive it is to latency.
DTrace is quite terse and limited in actions and functions. As I said above, for formatting data, wrap DTrace in perl and do calculations and formating in perl. Why? A few examples
DTrace doesn’t have floating point operations
DTrace aggregate arrays don’t have a method allowing you to query the keys and step through them, thus there is no way to do calculations on values in one aggregate array with values in another.
Just save yourself time and head ache and do calculations and formatting in perl.
DTrace is limited in functions. Many of the functions are hard to track down and/or have little documentation. Here is a list of functions:
Comments