#engineering · Started by Elena Rodriguez
I noticed the webhook retry logic is hitting a race condition under high load. Working on a fix now.
Good catch. I saw some flaky behavior in the retry queue yesterday too. Let me know if you need a second pair of eyes on the fix.
I can reproduce it consistently with 50+ concurrent requests. The mutex is not being released properly on timeout.
Found it. The issue is in the dequeue step - we need to use a distributed lock instead of an in-memory mutex. PR coming up.
Makes sense. Redis-based lock with TTL should work. We already have the client set up for the cache layer.
Agreed. I will add a Grafana panel to track lock acquisition latency once you merge.
PR #852 is up. Added tests for the concurrent case. Can one of you review?