On http://dboptimizer.com/2011/04/27/envisioning-nfs-performance/ I showed some code that displayed the send and receive sizes and times over TCP on Solaris 10 with the dtrace command (see tcp.d for the code). I took this code on another machine and got errors like
“dtrace: error on enabled probe ID 29 (ID 5461: tcp:ip:tcp_input_data:receive): invalid alignment (0xffffff516eb5e83a) in predicate at DIF offset 136”
Not quite sure why this was happening but by a process of elimination I found that accessing args[4] in tcp:::receive caused these errors.
Fortunately many of the values in args[4] are found in args[3] as well.
To find arguments to tcp:::receive , first run the following command (or alternatively look the TCP probes up on the wiki at http://wikis.sun.com/display/DTrace/tcp+Provider)
( for command line flags check out http://compute.cnr.berkeley.edu/cgi-bin/man-cgi?dtrace+7
-l = list instead of enable probes
-n = Specify probe name to trace or list
-v = Set verbose mode
i.e. list the verbose information about the probes named and don’t enable these probes, just list them)
$ dtrace -lvn tcp:::receive
5473 tcp ip tcp_output send
Argument Types
args[0]: pktinfo_t *
args[1]: csinfo_t *
args[2]: ipinfo_t *
args[3]: tcpsinfo_t *
args[4]: tcpinfo_t *
(by the way, there are a number of probes that match tcp:::receive, but they all have the same arguments, I didn’t notice all these different tcp:::receive until I actually ran the above command. Before running the command I’d depended on the wiki. Now I’m wondering what the difference is between some of these tcp:::receive and tcp:::send probes )
After finding the args for a probe, you can look up the definition of the structs at http://cvs.opensolaris.org/source/
tcp:::send and tcp:::receive both have the same arguments
args[3] is tcpsinfo_t
args[4] is tcpinfo_t
Looking up the structs at http://cvs.opensolaris.org/source/, I find the contents of the structs as follows:
tcpsinfo_t ( args[3] )
111 typedef struct tcpsinfo {
112 uintptr_t tcps_addr;
113 int tcps_local; /* is delivered locally, boolean */
114 int tcps_active; /* active open (from here), boolean */
115 uint16_t tcps_lport; /* local port */
116 uint16_t tcps_rport; /* remote port */
117 string tcps_laddr; /* local address, as a string */
118 string tcps_raddr; /* remote address, as a string */
119 int32_t tcps_state; /* TCP state */
120 uint32_t tcps_iss; /* Initial sequence # sent */
121 uint32_t tcps_suna; /* sequence # sent but unacked */
122 uint32_t tcps_snxt; /* next sequence # to send */
123 uint32_t tcps_rack; /* sequence # we have acked */
124 uint32_t tcps_rnxt; /* next sequence # expected */
125 uint32_t tcps_swnd; /* send window size */
126 int32_t tcps_snd_ws; /* send window scaling */
127 uint32_t tcps_rwnd; /* receive window size */
128 int32_t tcps_rcv_ws; /* receive window scaling */
129 uint32_t tcps_cwnd; /* congestion window */
130 uint32_t tcps_cwnd_ssthresh; /* threshold for congestion avoidance */
131 uint32_t tcps_sack_fack; /* SACK sequence # we have acked */
132 uint32_t tcps_sack_snxt; /* next SACK seq # for retransmission */
133 uint32_t tcps_rto; /* round-trip timeout, msec */
134 uint32_t tcps_mss; /* max segment size */
135 int tcps_retransmit; /* retransmit send event, boolean */
tcpinfo_t (args[4])
95 typedef struct tcpinfo {
96 uint16_t tcp_sport; /* source port */
97 uint16_t tcp_dport; /* destination port */
98 uint32_t tcp_seq; /* sequence number */
99 uint32_t tcp_ack; /* acknowledgment number */
100 uint8_t tcp_offset; /* data offset, in bytes */
101 uint8_t tcp_flags; /* flags */
102 uint16_t tcp_window; /* window size */
103 uint16_t tcp_checksum; /* checksum */
104 uint16_t tcp_urgent; /* urgent data pointer */
105 tcph_t *tcp_hdr; /* raw TCP header */
106 } tcpinfo_t;
In the script I accessed the TCP seq and ack in arg[4], which was giving me errors in tcp:::receive, so I just switch these for the equivalents in arg[3]. Now I’m not exactly clear on the equivalence, but it seems
args[4]->tcp_seq ?= args[3]->tcps_suna
args[4]->tcp_ack ?= args[3]->tcps_rack
The ack=rack seems solid as tcps_rack = “highest sequence number for which we have received and sent an acknowledgement”.
The seq=tcps_suna is less clear to me as tcps_suna = “lowest sequence number for which we have sent data but not received acknowledgement”
But for my purposes, these distinctions might be unimportant as I’ve stopped looking at seq and ack and now look at outstanding unacknowledge bytes on the receiver and sender, which is
tcps_snxt-tcps_suna ” gives the number of bytes pending acknowledgement “
tcps_rsnxt – tcps_rack ” gives the number of bytes we have received but not yet acknowledged”
but more about that later. Let’s look at the new version of the program and the output
#!/usr/sbin/dtrace -s
#pragma D option defaultargs
#pragma D option quiet
inline string ADDR=$$1;
dtrace:::BEGIN
{ TITLE = 10;
title = 0;
walltime=timestamp;
printf("starting up ...\n");
}
tcp:::send, tcp:::receive
/ title == 0 /
{ printf("%9s %8s %6s %8s %8s %12s %s \n",
"delta" ,
"cid" ,
"pid" ,
"send" ,
"recd" ,
"seq",
"ack"
);
title=TITLE;
}
tcp:::send
/ ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) /
{ nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
delta= timestamp-walltime;
walltime=timestamp;
printf("%9d %8d %6d %8d --> %8s %8d %8d %12s > %s \n",
delta/1000,
args[1]->cs_cid % 100000,
args[1]->cs_pid ,
args[2]->ip_plength - args[4]->tcp_offset,
"",
args[4]->tcp_seq -
args[3]->tcps_suna ,
args[4]->tcp_ack -
args[3]->tcps_rack ,
args[3]->tcps_raddr,
curpsinfo->pr_psargs
);
title--;
}
tcp:::receive
/ ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] /
{ delta=timestamp-walltime;
walltime=timestamp;
printf("%9d %8d %6d %8s <-- %-8d %8d %8d %12s < %s \n",
delta/1000,
args[1]->cs_cid % 100000,
args[1]->cs_pid ,
"",
args[2]->ip_plength - args[4]->tcp_offset,
args[3]->tcps_rack % 10000,
args[3]->tcps_suna % 10000,
args[3]->tcps_raddr,
curpsinfo->pr_psargs
);
title--;
}
output
starting up ...
delta cid pid send recd
570 3904 845 <-- 0
34 3904 845 <-- 140
455 3904 845 <-- 0
24 3904 845 <-- 0
4789720 3904 845 124 -->
82 3904 845 244 -->
99 3904 845 132 -->
delta cid pid send recd
52 3904 845 1448 -->
28 3904 845 1448 -->
24 3904 845 1448 -->
36 3904 845 1448 -->
33 3904 845 1448 -->
26 3904 845 952 -->
86 3904 845 244 -->
212 3904 845 <-- 140
501 3904 845 <-- 132
60 3904 845 124 -->
256 3904 845 <-- 140
72 3904 845 <-- 0
39658 3904 845 <-- 0
What the heck is that huge time 4789720 us? ie 4 secs? The whole operation took me less than 1 second. I wouldn’t have found the answer to this without help from Adam Levanthal. Turns out that output is in order only per CPU, but different CPUs can output data in different order. Each CPU buffers up data and then passed the buffer back to userland dtrace, so on a one CPU machine, this code will always output in order, but on a multi-cpu machine there is no guarentee on the order of the output. Lets add CPU # to the output:
#!/usr/sbin/dtrace -s
#pragma D option defaultargs
#pragma D option quiet
inline string ADDR=$$1;
dtrace:::BEGIN
{ TITLE = 10;
title = 0;
walltime=timestamp;
printf("starting up ...\n");
}
tcp:::send, tcp:::receive
/ title == 0 /
{ printf("%9s %8s %6s %8s %8s %4s \n",
"delta" ,
"cid" ,
"pid" ,
"send" ,
"recd" ,
"cpu#"
);
title=TITLE;
}
tcp:::send
/ ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) /
{ nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
delta= timestamp-walltime;
walltime=timestamp;
printf("%9d %8d %6d %8d --> %8s %d \n",
delta/1000,
args[1]->cs_cid % 100000,
args[1]->cs_pid ,
args[2]->ip_plength - args[4]->tcp_offset,
"",
cpu
);
title--;
}
tcp:::receive
/ ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] /
{ delta=timestamp-walltime;
walltime=timestamp;
printf("%9d %8d %6d %8s <-- %-8d %d \n",
delta/1000,
args[1]->cs_cid % 100000,
args[1]->cs_pid ,
"",
args[2]->ip_plength - args[4]->tcp_offset,
cpu
);
title--;
}
output
delta cid pid send recd cpu#
42 3904 845 244 --> 0
66 3904 845 124 --> 0
6091 3904 845 124 --> 2
81 3904 845 244 --> 2
96 3904 845 132 --> 2
31 3904 845 1448 --> 2
20 3904 845 1448 --> 2
18 3904 845 1448 --> 2
17 3904 845 1448 --> 2
16 3904 845 1448 --> 2
8910406 3904 845 0 --> 3
375 3904 845 <-- 0 3
24 3904 845 <-- 140 3
34 3904 845 0 --> 3
470 3904 845 <-- 140 3
410 3904 845 <-- 132 3
delta cid pid send recd cpu#
491 3904 845 <-- 140 3
393 3904 845 <-- 0 3
15 3904 845 952 --> 3
36 3904 845 <-- 0 3
delta cid pid send recd cpu#
19 3904 845 <-- 0 3
40167 3904 845 <-- 0 3
what we see is the data ordered by CPU. In other words for each CPU the data is ordered but which CPU get’s printed first is unknown. In dtrace each CPU buffers up it’s data and then sends it to the userland dtrace process. The only “fix” for now is to add a timestamp and order the data by timestamp. Unsorted, it looks like:
607858102997348 281 3904 845 124 --> 2
607858103608856 84 3904 845 244 --> 2
607858104125414 delta cid pid send recd cpu#
607858104143731 116 3904 845 132 --> 2
607858104176769 33 3904 845 1448 --> 2
607858104198187 21 3904 845 1448 --> 2
607858104215813 17 3904 845 1448 --> 2
607858104233004 17 3904 845 1448 --> 2
607858104267629 34 3904 845 1448 --> 2
607858104287379 19 3904 845 952 --> 2
607858102716187 11569935 3904 845 <-- 132 3
607858103245377 248 3904 845 <-- 0 3
607858103282421 37 3904 845 <-- 140 3
607858103339076 56 3904 845 244 --> 3
607858103524093 185 3904 845 <-- 140 3
607858103774417 165 3904 845 <-- 132 3
607858103823145 48 3904 845 124 --> 3
607858104027216 204 3904 845 <-- 140 3
607858104387780 100 3904 845 <-- 0 3
607858104401487 13 3904 845 <-- 0 3
607858104520815 119 3904 845 <-- 0 3
607858144436175 delta cid pid send recd cpu#
607858144454625 39933 3904 845 <-- 0 3
sorted it looks like
607858102716187 11569935 3904 845 <-- 132 3
607858102997348 281 3904 845 124 --> 2
607858103245377 248 3904 845 <-- 0 3
607858103282421 37 3904 845 <-- 140 3
607858103339076 56 3904 845 244 --> 3
607858103524093 185 3904 845 <-- 140 3
607858103608856 84 3904 845 244 --> 2
607858103774417 165 3904 845 <-- 132 3
607858103823145 48 3904 845 124 --> 3
607858104027216 204 3904 845 <-- 140 3
607858104125414 delta cid pid send recd cpu#
607858104143731 116 3904 845 132 --> 2
607858104176769 33 3904 845 1448 --> 2
607858104198187 21 3904 845 1448 --> 2
607858104215813 17 3904 845 1448 --> 2
607858104233004 17 3904 845 1448 --> 2
607858104267629 34 3904 845 1448 --> 2
607858104287379 19 3904 845 952 --> 2
607858104387780 100 3904 845 <-- 0 3
607858104401487 13 3904 845 <-- 0 3
607858104520815 119 3904 845 <-- 0 3
607858144436175 delta cid pid send recd cpu#
607858144454625 39933 3904 845 <-- 0 3
so now the strange long time delta is at the beginning where I’d expect it.
I’m not quite sure how to deal with this. Post processing the data by sorting the timestamp column works, but interactively processing the data to get it in the right order as it comes out seems problematic.
Comments