top of page
  • Writer's picturekyle Hailey

The Case for a conf.yaml Linter and Quick Tester: An Experience with Datadog

A Tale of a Typo: How a Small Error Revealed UX Issues in Datadog

As an ardent user of Datadog's monitoring and analytics platform, I recently had a user experience that I believe highlights an opportunity for significant enhancement in the platform's functionalities.

In a recent tweet, I briefly touched on my encounter, stating,

Dear @datadoghq, an enhancement request - we could really use a conf.yaml linter and quick tester. Here's my journey: On 2023-06-26 at 17:55:41, I started the agent on the host, followed by a datadog-agent check. A painstaking 15+ minutes later at 18:02:22, I discover a typo in my Postgres database name. Frustrating! #UXissues

Why this matters? This delayed feedback and requirement for host access turned what should've been a quick process into a 3-day investigation! After initially launching the agent and waiting in vain for 5-10 minutes, I moved on to other work... until I finally revisited and checked the logs today. Can't help but think a UI feature for adding a target without needing to edit a host file would've saved so much time! #BetterMonitoring"

A Journey Begins

On June 26, 2023, at exactly 17:55:41

I embarked on what seemed like a routine task. The mission was straightforward - start the Datadog agent on the host using the command

sudo service datadog-agent start

and subsequently run a

datadog-agent check

The latter process was performed with this command:

DD_LOG_LEVEL=debug \
datadog-agent check postgres -t 2 | tee /tmp/dd.debug

For those not fully immersed in the intricacies of the tech world, this command might come across as complex jargon. I agree! That command to monitor the agent launch is world class propeller head. However, for me and many others in my field, it's a standard part of our workflow. My expectation was to monitor the launch of the agent, specifically targeting my database, affectionately named "dumbo."

An Unexpected Delay

The monitoring process unfolded as expected, or so it seemed at first. After waiting for about 5-10 minutes, I realized that "dumbo" was not present in the output from the datadog-agent check nor in the output file /tmp/dd.debug. Puzzled by this absence, I decided to switch gears and focus on my other work, keeping the unresolved issue on the back burner.

Fast forward to three days later. A colleague of mine inquired about the replication delay on "dumbo." I turned to Datadog, hoping for insights, only to find that the database was still missing from the monitoring dashboard. My curiosity piqued; I decided it was time to revisit the issue and dig into the logs.

Upon inspecting the debug file on the host running the agent, I stumbled upon a service check timestamped at 18:02:22, nearly 15 minutes after I had initially launched the agent. The service check read:

=== Service Checks ===
    "check": "postgres.can_connect",
    "host_name": "dumbo",
    "timestamp": 1687802542,
    "status": 2,
    "message": "Error establishing connection to postgres://dumbo:/kylelf, error is FATAL:  database \"kylelf\" does not exist\n",
    "tags": [

The Epiphany

Suddenly, the missing piece of the puzzle came to light. The feedback delay and the subsequent three-day investigation were due to a simple typo in the Postgres database name. What was initially intended to be a quick check turned into a drawn-out and time-consuming ordeal, all due to delayed error feedback and an inconvenient requirement for host access.

Reflecting on this experience, I couldn't help but think: there has to be a better way.

And indeed, I believe there is.

The Case for a conf.yaml Linter and Quick Tester

My experience highlights the pressing need for a conf.yaml linter and quick tester.

There has to be a better way.

  1. conf.yaml lint

  2. quick check on conf.yaml connectivity

  3. a UI interface in datadog to add an new target - what a thought!

45 views0 comments


bottom of page