So I had to make modifications on a 15,000 query a second 10TB database with as little downtime.
Many of the operations had potential to cause locking, so it was important to know as fast as possible when performance issues were happening. As such I have been loving Datadog DBM for database monitoring. You can see the red spikes representing the # of queries blocked when locking modifications where issued (mostly alter commands) and the corresponding drop in throughput and increased latency.

After two weeks of research and testing , I had my scripts ready to convert our core table into a partition table. Datadog had been up and running for those two weeks.
I had a narrow slice of time to modify the database coordinated with the app team who were bringing the application down at 11pm for me to run my scripts then bring the system back up.

Ok ready to go

Guess what, the Datadog stopped collecting data! Our window to do the modify the data layout was 11-12pm on March 30:

At 10:08 the datadog dashboard for the database went blank and didn't com back up until 11:54 !

I didn't have time to debug.
I figured was a backup in data ingestion at the Datadog side.
I then tried to get on the GCP compute engine. No response.
Now I thought was a GCP issue.
I bounce it. Things come online for 15 minutes the silence.

Just as I was finishing the partition migration and the system was coming back up , Datadog started working again! and has been working ever since.

What went wrong?
Took me a while to figure out.
The GCP host was small, only 4G of memory and 2 vCPU but the only thing I was running was Datadog. That Datadog agent was only monitoring 2 instances.
As I recall, I though the Datadog agent was only suppose to uses maximum of 100MB per instance monitor. The agent is suppose to be run on hosts with the database so if it took up say 4 GB of memory that might be bad right?
Looking at PS on the GCP compute engine the agent almost immediately goes to 2.7GB of memory!
PS info
PID CPU S RSS MAJFL WCHAN
Tue Apr 11 19:39:57 UTC 2023
Tue Apr 11 19:40:57 UTC 2023
Tue Apr 11 19:41:57 UTC 2023 1826670 - S 1929136 agent.pid
Tue Apr 11 19:42:57 UTC 2023 1826670 - S 2728236 agent.pid
Tue Apr 11 19:43:57 UTC 2023 1826670 - S 2722216 agent.pid
Vmstat
you can see the free memory plummet :
Image of vmstat

Vmstat stat cut
procs -----------memory---------- ---swap-- -----io---- -syste
r b swpd free buff cache si so bi bo in
0 0 0 3653720 4272 213744 0 0 211 20 25
2 0 0 1081128 8360 341884 0 0 2185 73 2912
0 0 0 3521936 8456 343120 0 0 211 20 25
0 0 0 3509864 8496 353700 0 0 174 11 24
0 0 0 3509612 8512 353708 0 0 0 4 16
1 0 0 1596776 8604 358292 0 0 59 50 3082
1 0 0 814904 8684 359620 0 0 10 41 3224
9 0 0 810144 8788 360816 0 0 10 43 2489
em-- ------cpu----- -----timestamp-----
cs us sy id wa st UTC
18 37 4 59 0 0 2023-04-11 19:37:14
4163 36 6 57 1 0 2023-04-11 19:38:14
18 37 4 59 0 0 2023-04-11 19:39:01
41 0 0 100 0 0 2023-04-11 19:40:01
30 0 0 100 0 0 2023-04-11 19:41:01
4444 37 5 58 0 0 2023-04-11 19:42:01
5421 52 4 44 0 0 2023-04-11 19:43:01
4180 44 3 53 0 0 2023-04-11 19:44:01
sudo sar -B 60
faults/s go through the roof, doesn't prove anything
major faults go up and should be 0
page frees go through the roof
definitely look lie memory issues
19:38:59 pgpgin/s pgpgout/s fault/s majflt/s pgfree/s
19:39:59 174.07 11.07 36.57 1.42 35.03
19:40:59 0.00 4.40 17.45 0.00 10.10
19:41:59 59.27 50.07 11431.55 0.58 15115.65
19:42:59 9.93 41.33 9354.88 0.10 10779.47
19:43:59 9.92 43.25 4565.30 0.08 9063.07
Perf
output was inconsistent, but from at some point could see "shrink_inactive_list"

Unfortunately this issue seems to be random.
I increased the size of the GCP compute host, then though, hey I want to debug more and so put it back at the smaller engine size but the issue didn't reproduce.
Sort of make sense, as the agent ran for 2 weeks with out an issue.
Note, GCP compute engines don't have swap, so some of the stats don't make sense
to be investigated more.
I so look forward to Pressure Stall Info (PSI) on Linux where I can see how much time processes wait on memory.
Also of note, on the Oracle-L email list there was some praise of system with no Swap because OOM (out of memory) sweep was suppose to kill big memory processes when memory was exhausted. The feedback was that OOM killed things as soon as there was a problem and the system stayed responses. Folks felt that it was better with no swap and OOM killing larger memory so any memory issues were taken care of right way. For me , I dislike no swap as by the time OOM kills a process things are already too late.
WIth swap, I can monitor the page out/in and see a problem when it starts and then take my own action to analyze things and clear things up.
Also no swap systems become unresponsive .The example above of no swap and the system became unusable.
I saw the same thing when we were testing serverless pre prod at Amazon RDS. My RDS test systems were constantly becoming unusable as memory became exhausted even when there was no swap. By the time we released of course serverless memory was well managed, but while running performance tests preprod I lost lots of time as systems became slow and unresponsive even with no swap.
reference
Tools from Tanel;
Comentarios