top of page
  • Writer's picturekyle Hailey

Critical Upgrade Crisis: When Datadog on GCP Stalled Amid RAM Issues - How It Happened and Lessons


So I had to make modifications on a 15,000 query a second 10TB database with as little downtime.



Many of the operations had potential to cause locking, so it was important to know as fast as possible when performance issues were happening. As such I have been loving Datadog DBM for database monitoring. You can see the red spikes representing the # of queries blocked when locking modifications where issued (mostly alter commands) and the corresponding drop in throughput and increased latency.




After two weeks of research and testing , I had my scripts ready to convert our core table into a partition table. Datadog had been up and running for those two weeks.

I had a narrow slice of time to modify the database coordinated with the app team who were bringing the application down at 11pm for me to run my scripts then bring the system back up.




Ok ready to go



Guess what, the Datadog stopped collecting data! Our window to do the modify the data layout was 11-12pm on March 30:



At 10:08 the datadog dashboard for the database went blank and didn't com back up until 11:54 !




I didn't have time to debug.


I figured was a backup in data ingestion at the Datadog side.

I then tried to get on the GCP compute engine. No response.

Now I thought was a GCP issue.

I bounce it. Things come online for 15 minutes the silence.




Just as I was finishing the partition migration and the system was coming back up , Datadog started working again! and has been working ever since.



What went wrong?

Took me a while to figure out.

The GCP host was small, only 4G of memory and 2 vCPU but the only thing I was running was Datadog. That Datadog agent was only monitoring 2 instances.

As I recall, I though the Datadog agent was only suppose to uses maximum of 100MB per instance monitor. The agent is suppose to be run on hosts with the database so if it took up say 4 GB of memory that might be bad right?


Looking at PS on the GCP compute engine the agent almost immediately goes to 2.7GB of memory!


PS info

                                 PID  CPU S   RSS  MAJFL              WCHAN                                    
Tue Apr 11 19:39:57 UTC 2023
Tue Apr 11 19:40:57 UTC 2023
Tue Apr 11 19:41:57 UTC 2023 1826670   - S 1929136        agent.pid
Tue Apr 11 19:42:57 UTC 2023 1826670   - S 2728236        agent.pid
Tue Apr 11 19:43:57 UTC 2023 1826670   - S 2722216        agent.pid


Vmstat


you can see the free memory plummet :


Image of vmstat

Vmstat stat cut

procs -----------memory---------- ---swap-- -----io---- -syste
 r  b   swpd   free   buff  cache   si   so    bi    bo   in  
 0  0      0 3653720   4272 213744    0    0   211    20   25 
 2  0      0 1081128   8360 341884    0    0  2185    73 2912 
 0  0      0 3521936   8456 343120    0    0   211    20   25 
 0  0      0 3509864   8496 353700    0    0   174    11   24 
 0  0      0 3509612   8512 353708    0    0     0     4   16 
 1  0      0 1596776   8604 358292    0    0    59    50 3082 
 1  0      0  814904   8684 359620    0    0    10    41 3224 
 9  0      0  810144   8788 360816    0    0    10    43 2489 

em-- ------cpu----- -----timestamp-----
  cs us sy  id wa st                 UTC
   18 37  4  59  0  0 2023-04-11 19:37:14
 4163 36  6  57  1  0 2023-04-11 19:38:14
   18 37  4  59  0  0 2023-04-11 19:39:01
   41  0  0 100  0  0 2023-04-11 19:40:01
   30  0  0 100  0  0 2023-04-11 19:41:01
 4444 37  5  58  0  0 2023-04-11 19:42:01
 5421 52  4  44  0  0 2023-04-11 19:43:01
 4180 44  3  53  0  0 2023-04-11 19:44:01


sudo sar -B 60


faults/s go through the roof, doesn't prove anything

major faults go up and should be 0

page frees go through the roof

definitely look lie memory issues


19:38:59     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s 
19:39:59       174.07     11.07     36.57      1.42     35.03      
19:40:59         0.00      4.40     17.45      0.00     10.10      
19:41:59        59.27     50.07  11431.55      0.58  15115.65      
19:42:59         9.93     41.33   9354.88      0.10  10779.47      
19:43:59         9.92     43.25   4565.30      0.08   9063.07      

Perf


output was inconsistent, but from at some point could see "shrink_inactive_list"

Unfortunately this issue seems to be random.

I increased the size of the GCP compute host, then though, hey I want to debug more and so put it back at the smaller engine size but the issue didn't reproduce.

Sort of make sense, as the agent ran for 2 weeks with out an issue.


Note, GCP compute engines don't have swap, so some of the stats don't make sense

to be investigated more.


I so look forward to Pressure Stall Info (PSI) on Linux where I can see how much time processes wait on memory.


Also of note, on the Oracle-L email list there was some praise of system with no Swap because OOM (out of memory) sweep was suppose to kill big memory processes when memory was exhausted. The feedback was that OOM killed things as soon as there was a problem and the system stayed responses. Folks felt that it was better with no swap and OOM killing larger memory so any memory issues were taken care of right way. For me , I dislike no swap as by the time OOM kills a process things are already too late.

WIth swap, I can monitor the page out/in and see a problem when it starts and then take my own action to analyze things and clear things up.

Also no swap systems become unresponsive .The example above of no swap and the system became unusable.

I saw the same thing when we were testing serverless pre prod at Amazon RDS. My RDS test systems were constantly becoming unusable as memory became exhausted even when there was no swap. By the time we released of course serverless memory was well managed, but while running performance tests preprod I lost lots of time as systems became slow and unresponsive even with no swap.



reference


Tools from Tanel;









151 views0 comments

Comments


bottom of page