It's Always the Database
Note: This was (unintentionally) part of the Production vs Tutorial Code Series.
Story Time
Today our website was a bit slow, caused (as usual) by an N+1
query.
Unfortunately, a version of this query runs on every single request
And a specific subset of users with an average N
of ~60
started hitting our website with twice our normal load
And the query in question was running in a separate rails microservice, against a separate database that apparently doesn’t appreciate a ton of quick small reads.
This caused that service’s response times to grow to 20x their normal levels.
This caused our app’s response times to ratchet up to 10x our normal levels.
This caused our requests to back up in our puma worker queues.
This caused our health check requests to be considered failed due to timeouts.
This caused our load balancer to cycle our various instances constantly.
This caused additional load per instance, and longer request queues.
Eventually, this subset of users went to lunch and our traffic died down enough to be manageable.
Then they came back from lunch, and we were cycling again.
However, over lunch, I had found and patched the N+1
query using ActiveRecord’s preload
method.
So once that deployed (sadly almost 4 hours after the incident started) we went from cycling instances to normal response times under identical load conditions.
🎉
Explanation
N+1
is a term for the kind of performance bottleneck you find when you do work on each item in a list.
The 1
is the work to get the list itself, and there are N
items in the list, so you do N+1
units of work (roughly).
Most often in websites, “work” looks like a DB query or API call.
# controller
def items
@items ||= Item.all # Runs 1 query
end
<%# view %>
<ul>
<% items.each do |item| %>
<li id="<%= item.id %>">
<%# Runs 1 query per item (N total) %>
<%= item.owner.full_name %>
</li>
<% end %>
</ul>
In this example, we do one query in the controller and then we iterate over the results in the view.
This looks safe enough, but item.owner
is an association, so it triggers a new query for each item in the list.
The solution here is to use something like preload
to load all the owners in a single query and cache the associations on the items.
# controller
def items
# Runs 2 queries, one for items, and one for owners of those items
@items ||= Item.all.preload(:owner)
end
Application
A common starting point when trying to figure out why a particular page is slow is to assume “it’s always the database” until proven wrong. Look for loops, especially in your view code, and especially when you pull lots of data off items in that loop.
Especially bad is when you need to chain calls through multiple layers of associations:
<%# view %>
<ul>
<% items.each do |item| %>
<li id="<%= item.id %>">
<%= item.owner.full_name %><br />
From <%= item.owner.address.city %>, <%= item.owner.address.state_shortcode %>
</li>
<% end %>
</ul>
But thankfully preload
still handles this.
def items
@items ||= Item.all.preload(:owner).preload(owner: :address)
end
Tool Recommendation
You can use the bullet gem to help find your N+1 queries while running your code locally or in tests/pipelines.
This can be super useful in a codebase you are unfamiliar with or if you are new to performance optimization.
Caveat
It is important to point out… this was an exceptional case.
Normally N
has to be in the 1000s or you have to have deeply nested loops with many associations before you get slowdowns like we saw.
With an N
less than 100, I never would have dreamed there would be an issue.
But as soon as that query fix went out, even though the traffic actually continued increasing for a while, we had zero issues.
This is why good practices are important. You never know when or how the stars will align and reveal performance bottlenecks.