Ashwin Sundar

Web App Performance Investigation

Problem

A Django web application is experiencing performance issues. Baseline tests show that the average latency to access an authenticated page is approximately 1500 milliseconds. This is not acceptable performance for a user interface1.

Approach

When faced with an engineering problem, a great place to start is by formulating tests that bisect the problem space. This allows one to apply binary search2 to the space, which is the same method that git bisect3 uses to assist a programmer in identifying the commit that introduced a bug.

Problem Space

The developer believes that one of the following are the primary contributor to the high latency of the application:

The first round of testing was focused around answering the following question:

Does the database configuration contribute significantly to application latency?

The decision to target the database was driven by an observation that latency seemed to have increased significantly after a code commit which added authentication. The auth implementation required new queries of the auth database each time a page was loaded, to determine whether a user is permitted to see a page. It appears that these additional queries were slowing down the performance of the application.

What to measure

The problem is specific to web page loading. Modern browsers offer a suite of developer tools for profiling and analyzing website performance. The Network tab in Firefox was chosen as the measurement tool, with the Timings column chosen to calculate the dependent variable in each test.

Parameters

All tests were run under the same fixed conditions - from within a Docker container created from an image that was built off the same commit. The Docker container was run from the developer’s local machine. The version of the database was held constant as well (Postgres 16).

The Django application is hosted on a cloud provider (AWS, GCP, etc…). The cloud provider allows the following parameters for the database to be modified:

To minimize the number of tests which needed to be run, values for storage and RAM were selected which allowed the dependent variable (latency) to be interpolated at multiple points. In other words, these values were selected near the edges of the response surface.

Thus, the following tests were developed:

Test Name CPU + disk type Storage RAM
Test 1 (baseline) Shared Regular + SSD 10 GB 1 GB
Test 2 Shared Regular + SSD 40 GB 2 GB
Test 3 Shared AMD + NVMe 10 GB 1 GB
Test 4 Shared AMD + NVMe 40 GB 2 GB
Test 5 Shared Intel + NVMe 10 GB 1 GB
Test 6 Shared Intel + NVMe 40 GB 2 GB

Results

The best-laid schemes o’ mice an’ men

Gang aft agley

– Robert Burns, To a Mouse (1785)

Unbeknownst to the developer, the cloud provider does not permit a database to be “down-provisioned” - the storage capacity can only be increased, not decreased. As a result, the tests could not be executed in the order planned.

After several tests with minimal impact on latency, the developer made a key observation: the database server and application server did not reside in the same location. The database resided in the NYC3 datacenter, while the app server resided in the SFO3 datacenter. These locations are separated by nearly 3000 miles. Any time the app server needed to make a database query, a request was sent in one direction across the country, and the response traveled all the way back across the country in the other direction.

The developer decided to pivot, and investigate a different parameter of the database - the location of the server. Thus, the following tests were run:

Test Name CPU + disk type Storage RAM database location web server location
Baseline 1 Shared Regular + SSD 10 GB 1 GB NYC3 Colorado
Baseline 2 Shared AMD + NVMe 45 GB 1 GB NYC3 SFO3
Test 1 Shared Regular + SSD 40 GB 2 GB NYC3 Colorado
Test 2 Shared AMD + NVMe 45 GB 1 GB NYC3 Colorado
Test 3 Shared AMD + NVMe 45 GB 1 GB SFO3 Colorado
Test 4 Shared AMD + NVMe 45 GB 1 GB SFO3 SFO3

Data

Each test was run 10 times each, on multiple pages in the Django application. Each page that was loaded required a user to be authenticated in to the system. For baseline 1, only data on the homepage could be collected (due to a mistake by the developer). The following is the summarized results:

Baseline 1

Page Load time +/- 2σ (milliseconds)
Homepage -
View 1 1495 +/- 57 ms
View 2 -
View 3 -
View 4 -

Baseline 2

Page Load time
Homepage 1080 ms +/- 112 ms
View 1 1871 ms +/- 102 ms
View 2 2331 +/- 83 ms
View 3 2165 +/- 75 ms
View 4 2059 +/- 72 ms

Test 1

Page Load time
Homepage 812 +/- 22 ms
View 1 1484 +/- 104 ms
View 2 1840 +/- 42 ms
View 3 1704 +/- 93 ms
View 4 1656 +/- 76 ms

Test 2

Page Load time
Homepage 833 +/- 72 ms
View 1 1455 +/- 31 ms
View 2 1830 +/- 61 ms
View 3 1692 +/- 30 ms
View 4 1657 +/- 69 ms

Test 3

Page Load time
Homepage 577 +/- 25 ms
View 1 1027 +/- 72 ms
View 2 1278 +/- 39 ms
View 3 1192 +/- 50 ms
View 4 1141 +/- 33 ms

Test 4

Page Load time
Homepage 94 +/- 27 ms
View 1 160 +/- 25 ms
View 2 193 +/- 89 ms
View 3 170 +/- 47 ms
View 4 123 +/- 33 ms

Conclusion

The primary determinant of page load latency was proximity of database to web server. In test 3, the database was migrated from the NYC3 datacenter to the SFO3 datacenter, which is approximately 40% closer to the web server in Colorado. The latency to load VIew 1 also declined by approximately 30%, indicating a negative correlation for distance between servers and page latency.

In test 4, latency declined by nearly an order of magnitude compared to, Baseline 1. The difference in this test is that the database server and web server were co-located in the SFO3 datacenter.

This exercise of systematically changing factors that may contribute to database performance was productive, as it helped isolate a (monetarily) free change that could be made - which is to migrate the database to the same datacenter as the web server. It also helped confirm that with the current application setup and demands, the paid upgrades (RAM, storage, disk type) were not worth investing in at this time. Finally, this set of tests can be repeated in the future if application latency becomes an issue again.


  1. https://www.nngroup.com/articles/response-times-3-important-limits/ ↩︎

  2. https://en.wikipedia.org/wiki/Binary_search_algorithm ↩︎

  3. https://git-scm.com/docs/git-bisect ↩︎