Ashwin Sundar

Diagrams as Code

A major pain point in the process of maintaining documentation is that, while a product is in development, documentation tends to go stale quickly. This can occur for a number of reasons:

The first problem is a large challenge. Learning how to write good documentation is an entire course. Learning how to create good diagrams is an entire course.

Fortunately, the last two problems can be partially addressed relatively easily - by saving diagrams as code.

Definitions

Put that all together:

Why save diagrams as code?

1. Engineering diagrams aren’t a form of artistic expression

Picture this scenario - you construct a perfect symmetrical system diagram, arranging subsystem components in rounded boxes at the vertices of an equilateral pentagon. It is beautiful; it is pristine.

And then someone decides to add a subsystem.

The solution is simple:

2. Diagrams should be committed to version-control

Many WYSIWYG/visual-first tools have poor internal implementations of “version-control”. These tools typically allow a user to “checkpoint” an image manually. However, the checkpoints often have cryptic names, such as “v.203”. If a mistake is made, there is no way to easily figure out the last “good” state of a diagram.

The solution here is to use a text-based diagramming tool, so one may take advantage of fully-featured version-control systems, such as git. Mistakes can be traced with git bisect.The commit history can easily be searched from the command line. All of the powerful capabilities of git can be used to track changes to a diagram.

3. Diagrams should be useful to new team members

Last, and most importantly - diagrams must be useful to new team members. Imagine that a new member joins the team, and needs to understand the architecture of a codebase. Naturally, they will reach for documentation. But they discover that the documentation is out of date.

Stale documentation can be worse than no documentation. New team members cannot distinguish stale from up-to-date documentation, and will develop an incorrect mental model of the system. This can be very difficult to correct once the misunderstanding is complete.

The solution is to keep documentation as close to code as possible. Ideally, it should live in the same repo as the code. A task for every pull request should be to review relevant documentation and include updates if needed. This extra work to maintain documentation in each PR will save significant time trying to re-explain how a system works, to a team member that has learned the wrong information.

How does one actually create a “diagram as code” diagram?

There has recently been a renaissance of “diagram as code” tools. With support from GitHub (including native rendering in repositories), mermaid.js appear to be leading the pack. Other popular options include ZenUML and PlantUML.

But what about tools like LucidChart, diagrams.net, and Microsoft Visio? These tools are popular for remote whiteboarding sessions. Why can’t the outputs of those tools simply be committed to version control?

Tool Can be VC’d in e.g. git Text -> Image Addressable in PR
mermaid.js Yes Yes Yes
ZenUML Yes Yes Yes
PlantUML Yes Yes Yes
draw.io/diagrams.net Yes No No
LucidChart Yes No No
MS Visio Yes No No
Cell phone pictures of whiteboards Yes No No

In the chart above, I have selected the following criteria to compare tools by:

In theory, one may commit any file type to version control. In practice, there is limited value to using version-control to track changes to a .svg or .jpeg file type, file types which are used to represent vector graphics and images, respectively. A .svg contains too much non-value-add information, used to describe what a graphic looks like. The signal-to-noise ratio in a diff’d image file is extremely low, in other words.

On the other hand, diff’d text files have a much higher signal-to-noise ratio. Each diff’d character corresponds to a visible change in the generated output of the diagramming tool.

Examples

Enough pedantry, let’s take a look at a couple of examples. I have taken a liking to a tool called mermaid.js lately, so all of the following examples will use that tool.

Sequence Diagrams

…a sequence diagram captures the behavior of a single scenario. The diagram shows a number of example objects and the messages that are passed between these objects within the user case.

Fowler, Martin. UML Distilled: A Brief Guide to the Standard Object Modeling Language. 3rd ed., 2003

As the textbook definition alludes to, a sequence diagram can be used to described any set of systems that share messages. To keep the analogy concrete, let’s look at an example of a real message transit service.

Let’s consider a system composed of an API subsystem, Platform subsystem, and IoT Service subsystem. The API is responsible for handling the external interface. The Platform is responsible for handling “business logic”. The IoT Service is responsible for hosting the MQTT messaging service.

sequenceDiagram

participant API as API
participant F as Platform
participant IoT as IoT Service

F->>IoT: attempt authenticated connection to MQTT broker
IoT-->>F: confirm connection

loop Every 20s
    F->>API: request messages
    API-->>F: send messages

    F->>IoT: post message to MQTT broker at topic {deviceID}/{msgId}
end

A minimalist, clean, and informative diagram (such as the one above) is created with the following mermaid.js code:

sequenceDiagram

participant API as API
participant F as Platform
participant IoT as IoT Service

F->>IoT: attempt authenticated connection to MQTT broker
IoT-->>F: confirm connection

loop Every 20s
    F->>API: request messages
    API-->>F: send messages

    F->>IoT: post message to MQTT broker at topic {deviceID}/{msgId}
end

What happens if one would like to add a new database service to the diagram, perhaps in-between the Platform and IoT Service subsystems?

sequenceDiagram

participant API as API
participant F as Platform
participant Pg as Postgres DB
participant IoT as IoT Service

F->>IoT: attempt authenticated connection to MQTT broker
IoT-->>F: confirm connection

F->>Pg: attempt authenticated connection to DB
Pg-->>F: confirm connection

loop Every 20s
    F->>Pg: request timestamp of last message pull 
    Pg-->>F: send timestamp

    F->>Pg: update start_timestamp to now

    F->>API: request messages
    API-->>F: send messages

    F->>Pg: request device ID 
    Pg-->>F: send device ID

    F->>IoT: post message to MQTT broker at topic {deviceID}/{msgId}
end

In a traditional WYSIWYG editor, this task could take some time and incur significant frustration, because many distinct GUI elements must be manually moved or re-drawn. Not the case in a text-first diagramming tool:

sequenceDiagram
 
 participant API as API
 participant F as Platform
+participant Pg as Postgres DB
 participant IoT as IoT Service
 
 F->>IoT: attempt authenticated connection to MQTT broker
 IoT-->>F: confirm connection
 
+F->>Pg: attempt authenticated connection to DB
+Pg-->>F: confirm connection
+
 loop Every 20s
+    F->>Pg: request timestamp of last message pull 
+    Pg-->>F: send timestamp
+
+    F->>Pg: update start_timestamp to now
+
     F->>API: request messages
     API-->>F: send messages
 
+    F->>Pg: request device ID 
+    Pg-->>F: send device ID
+
     F->>IoT: post message to MQTT broker at topic {deviceID}/{msgId}
 end

One new participant and a handful of new messages are all that need to be defined, and mermaid.js takes care of figuring out how the boxes and arrows should be arranged. As mentioned earlier, every highlighted line in the diff corresponds to a visible change in the diagram. That’s excellent!

Activity Diagrams

Activity diagrams are a technique to describe procedural logic, business process, and work flow.

Fowler, Martin. UML Distilled.

Activity diagrams are similar to state diagrams, except that they model the activity of system, as opposed to the various states that a system can exist in. UML purists may cringe at the use of state diagram syntax to describe an activity diagram, but the behavior of a system can still be effectively communicated.

stateDiagram-v2
  # State Definitions
  ## Main start conditions
  Q_cache_exists : Cache exists?
  Q_checkLastRecovery : lastRecoveryAttempt > 15 mins?

  ## Composite States
  mbRecovRoutine : Mailbox Recovery Routines
  msgRetrievalRoutine : Message Retrieval Routines

  ## Mailbox Recovery Routines
  retrieveInvalidMbs : SELECT * FROM mailbox \n WHERE errorMsg IS NOT NULL
  errCorrect : Attempt error correction
  writeLog : Write to log
  deletePgError : UPDATE mailbox SET errorMsg = NULL

  ## Message Retrieval Routines
  retrieveValidMbs : SELECT * FROM mailbox \n WHERE errorMsg IS NULL \n AND updatedAt > global.lastKnownUpdatedAt
  checkMsgs : Check for new messages 
  Q_maxRetryExceed : Max retry exceeded?

  ### Success States
  retrieveMsgs : Retrieve messages from Api
  sendToMqtt : Post messages to MQTT broker

  ### Failure States
  removeMbFromCache : Remove Mailbox from local cache
  writeErrToPg : UPDATE mailbox SET errorMsg = json(error)

  # State Transitions
  ## Start state
  [*] --> Q_cache_exists

  # Mailbox Recovery Routines
  Q_cache_exists --> Q_checkLastRecovery: yes
  Q_checkLastRecovery --> retrieveInvalidMbs: yes
  retrieveInvalidMbs --> mbRecovRoutine
  state mbRecovRoutine {
    [*] --> errCorrect
    errCorrect --> writeLog : correction fails
    writeLog --> [*]
    errCorrect --> deletePgError: correction succeeds
    deletePgError --> [*]
  }

  # Message Retrieval Routines
  Q_cache_exists --> retrieveValidMbs: no
  Q_checkLastRecovery --> retrieveValidMbs : no
  retrieveValidMbs --> msgRetrievalRoutine
  mbRecovRoutine --> retrieveValidMbs
  state msgRetrievalRoutine {
    [*] --> checkMsgs
    checkMsgs --> retrieveMsgs: Mailbox connection succeeds
    checkMsgs --> Q_maxRetryExceed  : Mailbox connection fails
    Q_maxRetryExceed --> checkMsgs : no
    Q_maxRetryExceed --> removeMbFromCache : yes
    removeMbFromCache --> writeErrToPg
    writeErrToPg --> [*]: sleep 15s
    retrieveMsgs --> sendToMqtt
    sendToMqtt --> [*]: sleep 15s
  }

Imagine editing this diagram in a WYSIWYG editor. Not fun. In a text-based diagramming tool, the task is a breeze - this entire diagram can be defined in less than 75 lines of code, including comments for clarity:

stateDiagram-v2
  # State Definitions
  ## Main start conditions
  Q_cache_exists : Cache exists?
  Q_checkLastRecovery : lastRecoveryAttempt > 15 mins?

  ## Composite States
  mbRecovRoutine : Mailbox Recovery Routines
  msgRetrievalRoutine : Message Retrieval Routines

  ## Mailbox Recovery Routines
  retrieveInvalidMbs : SELECT * FROM mailbox \n WHERE errorMsg IS NOT NULL
  errCorrect : Attempt error correction
  writeLog : Write to log
  deletePgError : UPDATE mailbox SET errorMsg = NULL

  ## Message Retrieval Routines
  retrieveValidMbs : SELECT * FROM mailbox \n WHERE errorMsg IS NULL \n AND updatedAt > global.lastKnownUpdatedAt
  checkMsgs : Check for new messages 
  Q_maxRetryExceed : Max retry exceeded?

  ### Success States
  retrieveMsgs : Retrieve messages from Api
  sendToMqtt : Post messages to MQTT broker

  ### Failure States
  removeMbFromCache : Remove Mailbox from local cache
  writeErrToPg : UPDATE mailbox SET errorMsg = json(error)

  # State Transitions
  ## Start state
  [*] --> Q_cache_exists

  ## Mailbox Recovery Routines
  Q_cache_exists --> Q_checkLastRecovery: yes
  Q_checkLastRecovery --> retrieveInvalidMbs: yes
  retrieveInvalidMbs --> mbRecovRoutine
  state mbRecovRoutine {
    [*] --> errCorrect
    errCorrect --> writeLog : correction fails
    writeLog --> [*]
    errCorrect --> deletePgError: correction succeeds
    deletePgError --> [*]
  }

  ## Message Retrieval Routines
  Q_cache_exists --> retrieveValidMbs: no
  Q_checkLastRecovery --> retrieveValidMbs : no
  retrieveValidMbs --> msgRetrievalRoutine
  mbRecovRoutine --> retrieveValidMbs
  state msgRetrievalRoutine {
    [*] --> checkMsgs
    checkMsgs --> retrieveMsgs: Mailbox connection succeeds
    checkMsgs --> Q_maxRetryExceed  : Mailbox connection fails
    Q_maxRetryExceed --> checkMsgs : no
    Q_maxRetryExceed --> removeMbFromCache : yes
    removeMbFromCache --> writeErrToPg
    writeErrToPg --> [*]: sleep 15s
    retrieveMsgs --> sendToMqtt
    sendToMqtt --> [*]: sleep 15s
  }

Conclusion

Prefer diagrams as code.