> Here’s an exchange I had on twitter a few months ago:
The purple account is just plain wrong. Classically, the full architecture is this (keeping in mind that all rules are sometimes broken):
* CQRS is the linchpin.
* You generally only queue commands (writes). A few hundreds of ms of latency on those typically won't be noticed by users.
* Reads happen from either a read replica or cache.
The problem the author faces are caused by cherry-picking bits of the full picture.
A queue is a load smoothing operator. Things are going to go bad one way or another if you exceed capacity, a queue at least guarantees progress (up to a point). It's also a great metric to use to scale your worker count.
> What will you do when your queue is full
If your queue fills up you need to start rejecting requests. If you have a public facing API there's a good chance that there will be badly behaved clients that don't back off correctly - so you'll need a way to IP ban them until things calm down. AWS has API Gateway and Azure APIM that can help with this.
If you're separating commands and queries you should _typically_ see more headroom.
mhawthorne 9 hours ago [-]
Agree that CQRS seems like a useful way to partitions writes from reads, aka slower requests from faster requests, to avoid many fast requests waiting behind a few slow ones in line.
But even if you shifted reads to one or more caches or read replicas, wouldn't those also have queues that will fill up when you are under-provisioned?
Note that I'm using the term "queue" pretty loosely, to include things like Redis' maxclients or tcpbacklog, or client-side queues when all connections are in use.
zamalek 7 hours ago [-]
Absolutely. That's typically a good problem to have :). Hopefully you would had gradual enough growth to implement elastic scaling before this is an issue, but you're definitely eventually screwed and have to outright copy what the likes of FAANG do - your startup is a unicorn at that point, so you'd probably already have the talent hired.
mrngm 4 days ago [-]
That reminds me of this talk[0] by Gil Tene called "How NOT to Measure Latency" at the Strangeloop conference in 2015 (or read this blog post[1] that contains the most important points).
Author here. That was a great article, thanks for sharing. Especially the part about how your probability of experiencing a p99 latency is much higher than you'd intuit.
I don't agree with all of it, but definitely a few points made directly or indirectly hit home, such as:
- there is no single metric that can accurately represent "latency"
- most of our metrics are misleading in what they unconsciously include or exclude
I can remember once looking at a graph of requests/second and wishing I could see a distribution of requests per millisecond within an individual second. That level of detail is hard to come by, so in the meanwhile, we do what we can with the data we have.
mrngm 7 hours ago [-]
If you have individual request logs with timing infomation, you could construct that afterwards. It does take some effort to have an effective way of displaying these metrics. Where would you put an individual request that took 532ms and started at t=34.682s? Would you align all requests that started in the 34th second at t=34s, or look at completion time (ie within t=35s)?
Would you rather see "number of requests started at this ms" (you seem to suggest this), or is something else more interesting?
I think a sort of Gantt chart that plots duration of requests as well as starting time within the time span (e.g. a second or more) might be very informative. Each individual request on a different position on the Y axis, time on the X axis. Perhaps you have some bound on requests in flight, that could be the height of the Y axis, so you can easily see calm or busy periods.
At least our observability stack doesn't show this level of detail, but it would be very interesting to have it. (We do have calculated heatmaps based on maximum request time in Grafana, which is at least better than plots of average request times)
avidiax 1 days ago [-]
When I give system design interviews, candidates that start adding queues reflexively to the design always do poorly.
Queueing is only useful for a few cases, IMO:
* The request is expensive to reject. For example, the inputs to the rejected request also came from expensive requests or operations (like a file upload). So rejecting the request because of load will multiply the load on other parts of the system. You still need backpressure or forwardpressure (autoscaling).
* Losing a request is expensive, delaying the result is not. Usually you want a suitably configured durable queueing system (e.g. Kafka) if you have this scenario.
* A very short queue is acceptable if it's necessary that downstream resources are kept 100% busy. A good example of this is in a router, the output to a slower link might queue 1-2 packets so that there is always something to send, which maximizes throughput.
* If you have very bursty traffic, you can smooth the bursts to fit in your capacity. But this runs the danger of having the queue always full, which you have to manage with load shedding (either automated or manual).
----
An underappreciated queue type is LIFO (last-in, first-out). It sounds unfair, but it keeps you from moving the median response time at the cost of the maximum response time, and it behaves well when full. It fails over into either responding quickly or just rejecting requests when full, so it works well for dealing with bursty traffic.
bluGill 14 hours ago [-]
A short queue is different from a long one. Toyota keeps a box of bolts on the line - a type of queue - instead of ordering them individually as needed. However there should be just enough queue - if it backs up that is a problem.
avidiax 4 hours ago [-]
That sort of queue is fundamentally different than network queues. The individual bolts are not impatient waiting to be installed. There is no P99 bolt install latency measuring from when the bolt arrives in the factory.
You therefore only need enough bolts at each station that they won't run out before the restocker completes a lap, and such that there aren't so many that they get in the way.
thaumasiotes 24 hours ago [-]
> An underappreciated queue type is LIFO (last-in, first-out). It sounds unfair, but it keeps you from moving the median response time at the cost of the maximum response time
Why is that beneficial?
ajb 23 hours ago [-]
Suppose you are a building contractor. You have given start dates for future jobs, but your current job is going to run over the expected time. You can choose between:
1 slip every job, annoying all of the customers whose jobs are queued up. You get a bad reputation.
2 Move onto the next job on time, and gradually complete the stalled job in the background by sending workers back to it when you have spare (which you should have, because in general you must overestimate or things will go badly wrong).
That customer will now suffer because their job is going to take a multiple of the expected time, but all of the other customers are happy, so your reputation is good.
mhawthorne 13 hours ago [-]
Author here. It's a great analogy.
I had a section in the post I cut out about how optimizing queue selection started out as a technical problem, but transformed into more of a business and ethical problem the more I pondered it.
You're effectively deciding how to distribute suffering across a large group of people.
Comes up in any situation where large metric gains can be accomplished by optimizing for specific groups - recommender and personalization systems are another example.
mankyd 1 days ago [-]
Use a stack? LIFO.
As long as you have capacity to keep it mostly empty, it's fine. When requests backup, at least some people will still get quick responses, instead of making everyone suffer.
thaumasiotes 24 hours ago [-]
For a queue, a backup means that every request (from "now" on, until the end of time) is delayed.
For a stack, a backup means that some requests are informally forgotten, and although they still appear to be open, they will not complete until the end of time.
That's worse. It's a better match to the behavior you want, except for the part where the old requests still appear to be open. You need to actually close them.
You might also want to consider how requesting behavior will change when requests are stacked instead of queued. As soon as people have learned that you keep requests in a stack, the correct way to make a request is to make it, wait for a very small amount of time, and then, if your request hasn't already succeeded, repeat it.
Guess what will happen then?
avidiax 3 hours ago [-]
> You might also want to consider how requesting behavior will change when requests are stacked instead of queued. As soon as people have learned that you keep requests in a stack, the correct way to make a request is to make it, wait for a very small amount of time, and then, if your request hasn't already succeeded, repeat it.
It would be very hard to learn this so long as the queue is a very small fraction of the total throughput. If the queue depth is 100, and you receive 10,000qps, but process 9,900 qps, the queue will get full, and roughly 97 calls will go unanswered. Ideally you should have another mechanism to time these out, which most systems do. Whatever queue type you pick, you are going to reject 1% of the inbound, but with a FIFO queue, you will also delay 100% of the responses. Do that at several layers, and you can even end up with the client timing out even though their request wasn't even rejected at any stage.
pamcake 20 hours ago [-]
> Guess what will happen then?
All metrics up! Will fit nicely in my promo packet.
andrewstuart 1 days ago [-]
The author speculates about ways to deal with an overloaded queue.
Kingmans Formula says that as you approach 100% utilization, waiting times explode.
The correct way to deal with this is bounded queue lengths and back pressure. I.e don’t deal with an overloaded queue, don’t allow an overloaded queue.
bluGill 1 days ago [-]
Which is easy to say. I've been trying to debug an overloaded queue for over a week now. (it used to work until I discovered there were some serious race conditions resulting in 1 in a million problems crashes, and every fix for them so far has not fixed things. (at least I can detect it and I'm allowed to toss things from the queue - but the fact is we were handling this before I put the fixes in and people don't like it when I now reject thing from the queue so they want the performance back without the races)
avidiax 1 days ago [-]
I feel you may be adding your critical sections at too high of a layer (either in the code, or the data structure) if it is severely affecting performance. Look up sharded locks, and totally order them if you must acquire 2 or more at once.
You may also want to implement reader/writer locks if your load has many more reads than writes.
Unfortunately, nobody really teaches you these things in a really clear way, and plenty of engineers don't fully understand it either.
bluGill 14 hours ago [-]
I didn't give near enough details for you to speculate like that. what you said applies to some very different queues but not mine. (What I have is currenly lock free, though this is my third redesign with different locking strategies for each.)
andrewstuart 1 days ago [-]
Is your queue bounded?
Does it reject entries when service times are too high?
Your debugging effort may become more predictable when the system measures the time workers take to complete.
I note you say it used to work overloaded. I would argue it probably was having hidden problems. Perhaps ask those people what the acceptable service time is and lock it in by refusing new entries when it is exceeded.
If they want both infinite queue length and consistently acceptable service times then you must add enough work resource to do that.
bluGill 16 hours ago [-]
Queue is and was bounded, if it gets too large we already logged an error and stopped processing. it is currenty lock free, but the old version had locks (i've tried several versions with and without locks). the bounds didn't change but before it was processing in time even under heavy load, now it isn't.
throwaway290 23 hours ago [-]
Hit with machine generated art, so awful. Is the rest of it also generated?
Rendered at 02:59:40 GMT+0000 (Coordinated Universal Time) with Vercel.
The purple account is just plain wrong. Classically, the full architecture is this (keeping in mind that all rules are sometimes broken):
* CQRS is the linchpin.
* You generally only queue commands (writes). A few hundreds of ms of latency on those typically won't be noticed by users.
* Reads happen from either a read replica or cache.
The problem the author faces are caused by cherry-picking bits of the full picture.
A queue is a load smoothing operator. Things are going to go bad one way or another if you exceed capacity, a queue at least guarantees progress (up to a point). It's also a great metric to use to scale your worker count.
> What will you do when your queue is full
If your queue fills up you need to start rejecting requests. If you have a public facing API there's a good chance that there will be badly behaved clients that don't back off correctly - so you'll need a way to IP ban them until things calm down. AWS has API Gateway and Azure APIM that can help with this.
If you're separating commands and queries you should _typically_ see more headroom.
But even if you shifted reads to one or more caches or read replicas, wouldn't those also have queues that will fill up when you are under-provisioned?
Note that I'm using the term "queue" pretty loosely, to include things like Redis' maxclients or tcpbacklog, or client-side queues when all connections are in use.
[0] https://www.youtube.com/watch?v=lJ8ydIuPFeU
[1] https://bravenewgeek.com/everything-you-know-about-latency-i...
I don't agree with all of it, but definitely a few points made directly or indirectly hit home, such as:
- there is no single metric that can accurately represent "latency"
- most of our metrics are misleading in what they unconsciously include or exclude
I can remember once looking at a graph of requests/second and wishing I could see a distribution of requests per millisecond within an individual second. That level of detail is hard to come by, so in the meanwhile, we do what we can with the data we have.
Would you rather see "number of requests started at this ms" (you seem to suggest this), or is something else more interesting?
I think a sort of Gantt chart that plots duration of requests as well as starting time within the time span (e.g. a second or more) might be very informative. Each individual request on a different position on the Y axis, time on the X axis. Perhaps you have some bound on requests in flight, that could be the height of the Y axis, so you can easily see calm or busy periods.
At least our observability stack doesn't show this level of detail, but it would be very interesting to have it. (We do have calculated heatmaps based on maximum request time in Grafana, which is at least better than plots of average request times)
Queueing is only useful for a few cases, IMO:
* The request is expensive to reject. For example, the inputs to the rejected request also came from expensive requests or operations (like a file upload). So rejecting the request because of load will multiply the load on other parts of the system. You still need backpressure or forwardpressure (autoscaling).
* Losing a request is expensive, delaying the result is not. Usually you want a suitably configured durable queueing system (e.g. Kafka) if you have this scenario.
* A very short queue is acceptable if it's necessary that downstream resources are kept 100% busy. A good example of this is in a router, the output to a slower link might queue 1-2 packets so that there is always something to send, which maximizes throughput.
* If you have very bursty traffic, you can smooth the bursts to fit in your capacity. But this runs the danger of having the queue always full, which you have to manage with load shedding (either automated or manual).
----
An underappreciated queue type is LIFO (last-in, first-out). It sounds unfair, but it keeps you from moving the median response time at the cost of the maximum response time, and it behaves well when full. It fails over into either responding quickly or just rejecting requests when full, so it works well for dealing with bursty traffic.
You therefore only need enough bolts at each station that they won't run out before the restocker completes a lap, and such that there aren't so many that they get in the way.
Why is that beneficial?
1 slip every job, annoying all of the customers whose jobs are queued up. You get a bad reputation.
2 Move onto the next job on time, and gradually complete the stalled job in the background by sending workers back to it when you have spare (which you should have, because in general you must overestimate or things will go badly wrong). That customer will now suffer because their job is going to take a multiple of the expected time, but all of the other customers are happy, so your reputation is good.
I had a section in the post I cut out about how optimizing queue selection started out as a technical problem, but transformed into more of a business and ethical problem the more I pondered it.
You're effectively deciding how to distribute suffering across a large group of people.
Comes up in any situation where large metric gains can be accomplished by optimizing for specific groups - recommender and personalization systems are another example.
As long as you have capacity to keep it mostly empty, it's fine. When requests backup, at least some people will still get quick responses, instead of making everyone suffer.
For a stack, a backup means that some requests are informally forgotten, and although they still appear to be open, they will not complete until the end of time.
That's worse. It's a better match to the behavior you want, except for the part where the old requests still appear to be open. You need to actually close them.
You might also want to consider how requesting behavior will change when requests are stacked instead of queued. As soon as people have learned that you keep requests in a stack, the correct way to make a request is to make it, wait for a very small amount of time, and then, if your request hasn't already succeeded, repeat it.
Guess what will happen then?
It would be very hard to learn this so long as the queue is a very small fraction of the total throughput. If the queue depth is 100, and you receive 10,000qps, but process 9,900 qps, the queue will get full, and roughly 97 calls will go unanswered. Ideally you should have another mechanism to time these out, which most systems do. Whatever queue type you pick, you are going to reject 1% of the inbound, but with a FIFO queue, you will also delay 100% of the responses. Do that at several layers, and you can even end up with the client timing out even though their request wasn't even rejected at any stage.
All metrics up! Will fit nicely in my promo packet.
Kingmans Formula says that as you approach 100% utilization, waiting times explode.
The correct way to deal with this is bounded queue lengths and back pressure. I.e don’t deal with an overloaded queue, don’t allow an overloaded queue.
You may also want to implement reader/writer locks if your load has many more reads than writes.
Unfortunately, nobody really teaches you these things in a really clear way, and plenty of engineers don't fully understand it either.
Does it reject entries when service times are too high?
Your debugging effort may become more predictable when the system measures the time workers take to complete.
I note you say it used to work overloaded. I would argue it probably was having hidden problems. Perhaps ask those people what the acceptable service time is and lock it in by refusing new entries when it is exceeded.
If they want both infinite queue length and consistently acceptable service times then you must add enough work resource to do that.