Diagrams as Code

February 29, 2024

A major pain point in the process of maintaining documentation is that, while a product is in development, documentation tends to go stale quickly. This can occur for a number of reasons:

Engineers don’t know how to create useful documentation
Documentation is kept separately from the work being done
Only a small subset of engineers are tasked with creating and maintaining documentation

The first problem is a large challenge. Learning how to write good documentation is an entire course. Learning how to create good diagrams is an entire course.

Fortunately, the last two problems can be partially addressed relatively easily - by saving diagrams as code.

Definitions

Diagram: A visual representation of an engineered system
Code: Text that makes a machine do things
Version Control: A system that tracks atomic changes to a file

Put that all together:

Diagrams as code: A text file that is parsed to generate an image, and which can be committed to version control.

Why save diagrams as code?

Diagrams should not be an artistic exercise
Diagrams should be version-controlled with reliable tools
Diagrams should be useful for new team members

1. Engineering diagrams aren’t a form of artistic expression

Picture this scenario - you construct a perfect symmetrical system diagram, arranging subsystem components in rounded boxes at the vertices of an equilateral pentagon. It is beautiful; it is pristine.

And then someone decides to add a subsystem.

The solution is simple:

Care less about whether boxes in a diagram line up
Care more about what the boxes actually communicate

2. Diagrams should be committed to version-control

Many WYSIWYG/visual-first tools have poor internal implementations of “version-control”. These tools typically allow a user to “checkpoint” an image manually. However, the checkpoints often have cryptic names, such as “v.203”. If a mistake is made, there is no way to easily figure out the last “good” state of a diagram.

The solution here is to use a text-based diagramming tool, so one may take advantage of fully-featured version-control systems, such as git. Mistakes can be traced with git bisect.The commit history can easily be searched from the command line. All of the powerful capabilities of git can be used to track changes to a diagram.

3. Diagrams should be useful to new team members

Last, and most importantly - diagrams must be useful to new team members. Imagine that a new member joins the team, and needs to understand the architecture of a codebase. Naturally, they will reach for documentation. But they discover that the documentation is out of date.

Stale documentation can be worse than no documentation. New team members cannot distinguish stale from up-to-date documentation, and will develop an incorrect mental model of the system. This can be very difficult to correct once the misunderstanding is complete.

The solution is to keep documentation as close to code as possible. Ideally, it should live in the same repo as the code. A task for every pull request should be to review relevant documentation and include updates if needed. This extra work to maintain documentation in each PR will save significant time trying to re-explain how a system works, to a team member that has learned the wrong information.

How does one actually create a “diagram as code” diagram?

There has recently been a renaissance of “diagram as code” tools. With support from GitHub (including native rendering in repositories), mermaid.js appear to be leading the pack. Other popular options include ZenUML and PlantUML.

But what about tools like LucidChart, diagrams.net, and Microsoft Visio? These tools are popular for remote whiteboarding sessions. Why can’t the outputs of those tools simply be committed to version control?

Tool	Can be VC’d in e.g. git	Text -> Image	Addressable in PR
mermaid.js	Yes	Yes	Yes
ZenUML	Yes	Yes	Yes
PlantUML	Yes	Yes	Yes
draw.io/diagrams.net	Yes	No	No
LucidChart	Yes	No	No
MS Visio	Yes	No	No
Cell phone pictures of whiteboards	Yes	No	No

In the chart above, I have selected the following criteria to compare tools by:

Artifacts can be version-controlled in e.g. git
Artifacts can be defined using pur text, which is then parsed to create a diagram
Artifacts can be atomically addressed in a pull request

In theory, one may commit any file type to version control. In practice, there is limited value to using version-control to track changes to a .svg or .jpeg file type, file types which are used to represent vector graphics and images, respectively. A .svg contains too much non-value-add information, used to describe what a graphic looks like. The signal-to-noise ratio in a diff’d image file is extremely low, in other words.

On the other hand, diff’d text files have a much higher signal-to-noise ratio. Each diff’d character corresponds to a visible change in the generated output of the diagramming tool.

Examples

Enough pedantry, let’s take a look at a couple of examples. I have taken a liking to a tool called mermaid.js lately, so all of the following examples will use that tool.

Sequence Diagrams

…a sequence diagram captures the behavior of a single scenario. The diagram shows a number of example objects and the messages that are passed between these objects within the user case.

– Fowler, Martin. UML Distilled: A Brief Guide to the Standard Object Modeling Language. 3rd ed., 2003

As the textbook definition alludes to, a sequence diagram can be used to described any set of systems that share messages. To keep the analogy concrete, let’s look at an example of a real message transit service.

Let’s consider a system composed of an API subsystem, Platform subsystem, and IoT Service subsystem. The API is responsible for handling the external interface. The Platform is responsible for handling “business logic”. The IoT Service is responsible for hosting the MQTT messaging service.

sequenceDiagram

participant API as API
participant F as Platform
participant IoT as IoT Service

F->>IoT: attempt authenticated connection to MQTT broker
IoT-->>F: confirm connection

loop Every 20s
    F->>API: request messages
    API-->>F: send messages

    F->>IoT: post message to MQTT broker at topic {deviceID}/{msgId}
end

A minimalist, clean, and informative diagram (such as the one above) is created with the following mermaid.js code:

sequenceDiagram

participant API as API
participant F as Platform
participant IoT as IoT Service

F->>IoT: attempt authenticated connection to MQTT broker
IoT-->>F: confirm connection

loop Every 20s
    F->>API: request messages
    API-->>F: send messages

    F->>IoT: post message to MQTT broker at topic {deviceID}/{msgId}
end

What happens if one would like to add a new database service to the diagram, perhaps in-between the Platform and IoT Service subsystems?

sequenceDiagram

participant API as API
participant F as Platform
participant Pg as Postgres DB
participant IoT as IoT Service

F->>IoT: attempt authenticated connection to MQTT broker
IoT-->>F: confirm connection

F->>Pg: attempt authenticated connection to DB
Pg-->>F: confirm connection

loop Every 20s
    F->>Pg: request timestamp of last message pull 
    Pg-->>F: send timestamp

    F->>Pg: update start_timestamp to now

    F->>API: request messages
    API-->>F: send messages

    F->>Pg: request device ID 
    Pg-->>F: send device ID

    F->>IoT: post message to MQTT broker at topic {deviceID}/{msgId}
end

In a traditional WYSIWYG editor, this task could take some time and incur significant frustration, because many distinct GUI elements must be manually moved or re-drawn. Not the case in a text-first diagramming tool:

sequenceDiagram
 
 participant API as API
 participant F as Platform
+participant Pg as Postgres DB
 participant IoT as IoT Service
 
 F->>IoT: attempt authenticated connection to MQTT broker
 IoT-->>F: confirm connection
 
+F->>Pg: attempt authenticated connection to DB
+Pg-->>F: confirm connection
+
 loop Every 20s
+    F->>Pg: request timestamp of last message pull 
+    Pg-->>F: send timestamp
+
+    F->>Pg: update start_timestamp to now
+
     F->>API: request messages
     API-->>F: send messages
 
+    F->>Pg: request device ID 
+    Pg-->>F: send device ID
+
     F->>IoT: post message to MQTT broker at topic {deviceID}/{msgId}
 end

One new participant and a handful of new messages are all that need to be defined, and mermaid.js takes care of figuring out how the boxes and arrows should be arranged. As mentioned earlier, every highlighted line in the diff corresponds to a visible change in the diagram. That’s excellent!

Activity Diagrams

Activity diagrams are a technique to describe procedural logic, business process, and work flow.

– Fowler, Martin. UML Distilled.

Activity diagrams are similar to state diagrams, except that they model the activity of system, as opposed to the various states that a system can exist in. UML purists may cringe at the use of state diagram syntax to describe an activity diagram, but the behavior of a system can still be effectively communicated.

stateDiagram-v2
  # State Definitions
  ## Main start conditions
  Q_cache_exists : Cache exists?
  Q_checkLastRecovery : lastRecoveryAttempt > 15 mins?

  ## Composite States
  mbRecovRoutine : Mailbox Recovery Routines
  msgRetrievalRoutine : Message Retrieval Routines

  ## Mailbox Recovery Routines
  retrieveInvalidMbs : SELECT * FROM mailbox \n WHERE errorMsg IS NOT NULL
  errCorrect : Attempt error correction
  writeLog : Write to log
  deletePgError : UPDATE mailbox SET errorMsg = NULL

  ## Message Retrieval Routines
  retrieveValidMbs : SELECT * FROM mailbox \n WHERE errorMsg IS NULL \n AND updatedAt > global.lastKnownUpdatedAt
  checkMsgs : Check for new messages 
  Q_maxRetryExceed : Max retry exceeded?

  ### Success States
  retrieveMsgs : Retrieve messages from Api
  sendToMqtt : Post messages to MQTT broker

  ### Failure States
  removeMbFromCache : Remove Mailbox from local cache
  writeErrToPg : UPDATE mailbox SET errorMsg = json(error)

  # State Transitions
  ## Start state
  [*] --> Q_cache_exists

  # Mailbox Recovery Routines
  Q_cache_exists --> Q_checkLastRecovery: yes
  Q_checkLastRecovery --> retrieveInvalidMbs: yes
  retrieveInvalidMbs --> mbRecovRoutine
  state mbRecovRoutine {
    [*] --> errCorrect
    errCorrect --> writeLog : correction fails
    writeLog --> [*]
    errCorrect --> deletePgError: correction succeeds
    deletePgError --> [*]
  }

  # Message Retrieval Routines
  Q_cache_exists --> retrieveValidMbs: no
  Q_checkLastRecovery --> retrieveValidMbs : no
  retrieveValidMbs --> msgRetrievalRoutine
  mbRecovRoutine --> retrieveValidMbs
  state msgRetrievalRoutine {
    [*] --> checkMsgs
    checkMsgs --> retrieveMsgs: Mailbox connection succeeds
    checkMsgs --> Q_maxRetryExceed  : Mailbox connection fails
    Q_maxRetryExceed --> checkMsgs : no
    Q_maxRetryExceed --> removeMbFromCache : yes
    removeMbFromCache --> writeErrToPg
    writeErrToPg --> [*]: sleep 15s
    retrieveMsgs --> sendToMqtt
    sendToMqtt --> [*]: sleep 15s
  }

Imagine editing this diagram in a WYSIWYG editor. Not fun. In a text-based diagramming tool, the task is a breeze - this entire diagram can be defined in less than 75 lines of code, including comments for clarity:

stateDiagram-v2
  # State Definitions
  ## Main start conditions
  Q_cache_exists : Cache exists?
  Q_checkLastRecovery : lastRecoveryAttempt > 15 mins?

  ## Composite States
  mbRecovRoutine : Mailbox Recovery Routines
  msgRetrievalRoutine : Message Retrieval Routines

  ## Mailbox Recovery Routines
  retrieveInvalidMbs : SELECT * FROM mailbox \n WHERE errorMsg IS NOT NULL
  errCorrect : Attempt error correction
  writeLog : Write to log
  deletePgError : UPDATE mailbox SET errorMsg = NULL

  ## Message Retrieval Routines
  retrieveValidMbs : SELECT * FROM mailbox \n WHERE errorMsg IS NULL \n AND updatedAt > global.lastKnownUpdatedAt
  checkMsgs : Check for new messages 
  Q_maxRetryExceed : Max retry exceeded?

  ### Success States
  retrieveMsgs : Retrieve messages from Api
  sendToMqtt : Post messages to MQTT broker

  ### Failure States
  removeMbFromCache : Remove Mailbox from local cache
  writeErrToPg : UPDATE mailbox SET errorMsg = json(error)

  # State Transitions
  ## Start state
  [*] --> Q_cache_exists

  ## Mailbox Recovery Routines
  Q_cache_exists --> Q_checkLastRecovery: yes
  Q_checkLastRecovery --> retrieveInvalidMbs: yes
  retrieveInvalidMbs --> mbRecovRoutine
  state mbRecovRoutine {
    [*] --> errCorrect
    errCorrect --> writeLog : correction fails
    writeLog --> [*]
    errCorrect --> deletePgError: correction succeeds
    deletePgError --> [*]
  }

  ## Message Retrieval Routines
  Q_cache_exists --> retrieveValidMbs: no
  Q_checkLastRecovery --> retrieveValidMbs : no
  retrieveValidMbs --> msgRetrievalRoutine
  mbRecovRoutine --> retrieveValidMbs
  state msgRetrievalRoutine {
    [*] --> checkMsgs
    checkMsgs --> retrieveMsgs: Mailbox connection succeeds
    checkMsgs --> Q_maxRetryExceed  : Mailbox connection fails
    Q_maxRetryExceed --> checkMsgs : no
    Q_maxRetryExceed --> removeMbFromCache : yes
    removeMbFromCache --> writeErrToPg
    writeErrToPg --> [*]: sleep 15s
    retrieveMsgs --> sendToMqtt
    sendToMqtt --> [*]: sleep 15s
  }

Conclusion

Prefer diagrams as code.

It makes developers want to work on documentation, because it looks like (and is) code.
It allows one to take advantage of powerful open-source version-control tools, such as git.
It helps documentation stay up-to-date and remain useful for new team members.