The cause of slowdowns and 504 errors on some queries comes from our proxy layer.
The response times of our proxies have increased due to memory overload.
We do not yet have the elements to explain the sudden increase in memory occupancy.
It began around 6:40 pm and ended around 00:40.
Our current monitoring did not work well because it monitors a process group by machine.
However, only a few machine processes have been impacted by the problem.
The second factor is that processes have not been impacted at the same level at the same time.
We will make our monitoring tool more reliable to better detect a faulty cluster process.
Once a faulty process is detected, we can then exclude it from the cluster.
In addition, we will conduct a memory leak check at the proxy level.