Amazon RDS cluster dashboard with Performance Insights
Amazon RDS Performance Insights (PI) doesn’t have a single pane of glass dashboard for clusters, yet. Currently PI has a dashboard that has to be looked at for each instance in a cluster.
On the other hand one can create a simple cluster dashboard using Cloudwatch.
PI, when enabled, automatically sends three metrics to Cloudwatch every minute.
These metrics are
DBLoad = DBLoadCPU + DBLoadNonCPU
These metrics are measured in units of Average Active Sessions (AAS). AAS is like the run queue on the UNIX top command except at the database level. AAS is the average number of SQL queries running concurrently in the database. In the case of DB Load AAS, the average is over 1 minute since the metrics are reported each minute, and represents the total average # of SQL queries running concurrently. The DBLoad AAS can be broken down into those queries that are runnable on CPU, which is DBLoadCPU, and those queries that are not ready to run on the CPU because they are waiting for some resource like an I/O to complete, a lock , a latch, or some resource that can only be accessed in single threaded mode like a latch or buffer.
These metrics can be used to look at the health of the database.
For example we can only have as many SQL running on the CPU as there are vCPUs. If DBLoadCPU goes above the # of vCPUs then we know that some of those queries that are runnable on the CPU are actually waiting for the CPU.
We also know that when DBLoadNonCPU is low or near 0 then the queries are not waiting for resources and can execute. When DBLoadNonCPU goes up significantly then that represents an opportunity to optimize. For example if queries are spending half their time waiting for IO then if we could buffer that IO we could remove the IO wait and theoretically the queries could go twice as fast, doubling throughput.
By looking at DBLoadCPU for each instance in a cluster we can see if the load is well balanced and we can see if the load goes above the maximum CPU resources of the instance which would indicate a CPU bottleneck.
By looking at the ratio or percentage of DBLoadNonCPU to total DBLoad we can see how much time is wasted waiting for resources instead of executing on CPU. By show this percentage for each instance in the cluster in one graph we can see if any particular instance is running into a bottleneck. If so we would want to look the performance insights dashboard for that instance to get more detail about what is happening.
So let’s set up a cluster dashboard using PI data.
Create a RDS database instance with PI enabled : https://console.aws.amazon.com/rds
PI is supported on all RDS Oracle, Aurora PostgreSQL, RDS PostgreSQL 10 , RDS SQL Server (except 2008) and on the most recent versions of Aurora MySQL 5.6, RDS MySQL 5.6 & 5.7. See docs for details: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.html
In the case of this blog I created an Aurora MySQL 5.6 cluster with the original writer node and 3 reader nodes. You can create the initial instance with one reader node, then after creation, go to modify instanced and add reader node.
My nodes are shard1, shard2, shard3 and shard4.
After creating or modifying an instance to support PI, navigate to Cloudwatch and create a new dashboard (or add to an existing one)
after creating a dashboard (or modifying an existing one) add a new widget, click “Add widget” button
and for this example chose the line graph, the first option on the left of popup:
at the bottom of the screen enter “DBLoadCPU” into the search field
hit return and click on “per database metrics”
My cluster instances are shard1, shard2, shard3 and shard4 so I click those
and click “Create Widget” in the bottom left
I got a surprise, as each shard instance was suppose to have same load but can see something is wrong on shard4. Will investigate that as we go.
For now there are some options on the widget that I want changed. I want the graph to start at 0 (zero) and have a maximum of 4 , since my instances have 2vCPUs and I want to be able to look quickly at the graph to know where I’m at without having to read the axis everytime. My max available CPU load is 2 AAS since I have 2 vCPU. I set the max at 4 so there is some head room to be able to show load about 2.
There is pull down menu in top right of widget. Click it and choose “Edit”
Click the third tab “Graph options” and enter 0 for min and for max enter a value above the # of vCPUs you have on your instance. I have 2 vCPUs, so I enter 4.
click update in the bottom right.
I can see I’m running a pretty good load as shards 1-3 are running around 1.5 AAS on CPU i.e. our SQL are asking for about 75% of the CPU capacity of the box. ( 100% * (1.5 AAS CPU / 2 vCPU) ).
I also see shard4 has a lower DBLoadCPU. Turns out I had turned off the load to that shard this morning and forgot, so I restarted it.
Now lets add a widget to see how efficient our load is, i.e. what % of the load is waiting instead of being runnable on CPU.
Create a new widget and search for DBLoad and choose the 2 metrics DBLoadNonCPU & DBLoad for all 4 instances. We will use them in a mathematical expression.
create the widget, then edit it,
uncheck all the metrics
then we click “Add a math expression”
add the expression 100*(DBLoadNonCPU/DBLoad) for each instance
You can see I restarted the load on shard4 because the DBLoadCPU has gone up.
Now for the new graph click on title and type “% bottleneck”
edit it and add min 0 and max 100 ( i.e. 0-100% range), now it looks like
Be sure and click “Save dashboard” so you don’t loose you work.
Now what do we see? well now that I’ve restarted the load on shard4, we see on “DBLoadCPU”, the DBLoadCPU is pretty evenly balanced.
On “% bottleneck” we see it’s pretty low except for shard1. To find out what is happening we have to navigate to the PI dashboard for shard1. Looks like shard1 is spending a lot of it’s load waiting on resources.
Let’s go to the PI dashboard for shard1.
we can see that on the left most of the load was not CPU. CPU is green. All other colors are waiting for resources.
This is the write node and other activity is going on than the reader nodes which are only selects.
On the right hand side we can see CPU load went up so the ratio of Wait load in relation to CPU load and total load went down. This is what we see in the “% bottleneck” widget we made in Cloudwatch.
Now what are those resources that we are waiting on and what changed to make CPU go up? We can see that by exploring the PI dashboard.
For a demo on how to use PI to identify and solve performance issues see