H12 errors, blocked dynos and Heroku website claims Money Trees (Rap Genius Response to Heroku) by LEMON 12

Read the thread yourself

Tim also wrote a blog post documenting the experience:

The admin section of the app I recently moved over to Heroku is used daily by 20 or so employees. Their work flow has them making a few longer running requests to the app for report generation, sending emails, and file uploads. Most of these requests don’t take longer then 5-10 seconds and that was never a problem, but now it is. If the app has 5 dynos and one request takes 15 seconds, in the first second 20% of the requests to the app will have a 15 second delay. The next second, 20% of the apps request will have a 14 second delay and so on. The other 4 dynos may be available, but that one dyno will have a large and growing backlog. A simple request to the front page of the site that should take ~200ms could take over 15s.

At this point I have 3 options if I want to remain on Heroku. Optimize these report generators to the point they all take less than 1s. (easier said then done). The request could send the report to Delayed::Job, which saves the report output to S3. (Introduces more lag for the employee). Duplicate the app on Heroku and send all admin requests to this second app that the public never hits.

Heroku is a great service and the purpose of this post is not to speak bad about them, but to highlight the current backlog queue situation and provide anyone else an explanation if they are researching the same strange behavior. Though I do hope this may encourage Heroku to update their documentation and focus on getting a new backlog queue in place.

This video is processing – it'll appear automatically when it's done.

Heroku claims that though they received reports of unexplained latency over the past couple of years, they weren’t able to figure out the request queuing issue until they read the Rap Genius article Money Trees (Rap Genius Response to Heroku) by LEMON 12

They say:

Over the past couple of years Heroku customers have occasionally reported unexplained latency on Heroku. There are many causes of latency—some of them have nothing to do with Heroku—but until this week, we failed to see a common thread among these reports

This video is processing – it'll appear automatically when it's done.

Working to better support concurrent-request Rails apps on Cedar Routing Performance Update by Jesper Joergensen (Ft. Heroku & LEMON) 9

Why isn’t Unicorn the default webserver on Cedar if Cedar was explicitly designed with Unicorn-like webservers in mind?

One thought: dynos only have 512mb of RAM and so depending on your app’s memory footprint you might not even be able to run 2 Unicorn workers per dyno (let alone the 4 you would need to get reasonable throughput)

This video is processing – it'll appear automatically when it's done.

UPDATE 2/17: Rap Genius responds to Heroku's 2nd apology Bamboo Routing Performance by Oren Teich (Ft. Heroku & LEMON) 4

This video is processing – it'll appear automatically when it's done.

UPDATE 2/17: Rap Genius responds to Heroku's 2nd apology Heroku's Ugly Secret by James Somers (Ft. Andrew Warner, ATodd, Chrissy & LEMON) 41

This video is processing – it'll appear automatically when it's done.

Specifically, our router logs captured the service time and the depth of the per app request queue and present that to customers, who in turn were relying on these metrics to determine scaling needs. However, as the cluster grew, the time-and-depth metric for an individual router was no longer a relevant way to determine latency in your app. Routing Performance Update by Jesper Joergensen (Ft. Heroku & LEMON) 9

E.g., New Relic, which costs Rap Genius $8,000 / month, always reported 0ms spent in queue:

even though we were actually spending more time in queue than processing requests:

(we got these new measurements by installing our new gem, heroku-true-relic. See http://rapgenius.com/1506509 for more)

This video is processing – it'll appear automatically when it's done.

Adding metrics that let customers determine queuing impact on application response times Routing Performance Update by Jesper Joergensen (Ft. Heroku & LEMON) 9

Rap Genius has released a gem called heroku-true-relic to patch New Relic to display the actual request queue times

These new accurate queue numbers confirm the results of our simulations: we are currently running 250 dynos (monthly bill: $27,000) with an average throughput of ~11000 requests per minute

A simulation with those numbers estimates that the average queue time should be around 290ms, which is very close to the 324ms average New Relic now reports. We don’t have a ton of data yet with the accurate request queueing, but that’s pretty close!

This video is processing – it'll appear automatically when it's done.

Some frameworks, like Rails, are not concurrent in their default configurations Routing Performance Update by Jesper Joergensen (Ft. Heroku & LEMON) 9

Heroku’s default webserver for Rails apps on both Bamboo and Cedar is thin, which is not concurrent.

This means that Heroku’s misstep affected every one of its Rails customers that didn’t change their webserver to Unicorn or Puma

(I’d be curious to know the actual percentage of Rails Cedar apps running on concurrent web servers)

This video is processing – it'll appear automatically when it's done.

Discrepancies between documented and observed behaviors Routing Performance Update by Jesper Joergensen (Ft. Heroku & LEMON) 9

Excerpts from Heroku’s docs:

The heroku.com stack only supports single threaded requests. Even if your application were to fork and support handling multiple requests at once, the routing mesh will never serve more than a single request to a dyno at a time.

See http://rapgenius.com/1501932

This video is processing – it'll appear automatically when it's done.

Mismatch between reported queuing and service time metrics and the observed reality Routing Performance Update by Jesper Joergensen (Ft. Heroku & LEMON) 9

Heroku’s logs had an entry for the time a request spent in queue and that number was always 0. See http://rapgenius.com/1501395

Similarly, New Relic had stats on time spent in queue that were always 0:

This video is processing – it'll appear automatically when it's done.