Thanks for getting back to me Henrick. All of our tentacles are polling, even in our data center. We found that disabling UAC and rebooting them made the health checks and deployments go much faster, though I’m not sure why UAC would matter.
We saw the problem in our deployments to ec2 instances in eu-West-1 and ap-southeast-2 regions. We’ve been deploying to ec2 instances in us-East-2 for several months without any problems.
We had tried simply rebooting one of the slow tentacles, but it didn’t seem to help. So it does look like UAC was involved but again I’m not sure why.
To confirm, does EUPRDSBOTENG02 still have UAC enabled, or did you disable it on that server and then the health checks run quickly again? If that server still has UAC enabled, could you disable it, reboot the server and then re-run a health check on it and confirm if the health check runs quickly again.
I've been unable to replicate the issue with a new EC2 setup (but this doesn't mean much as I don't know your exact configurations).
We’re not able to modify the production servers outside of maintenance windows, so I can’t disable UAC on EUPRDSBOTENG02. I can however, create an AMI from it, and when I do and launch an instance using that AMI, the clone responds to health checks nearly instantly. That points to it not being UAC related. But that just raises more questions.
I installed sysinternal’s processmonitor and recorded activity from the tentacle. From there I saw it writing a bootstrap.ps1 file with the health check powershell code. When I copied that code and ran it myself, it ran instantly, so I don’t think that the Calamari version check is slow. It also looked like the health check spawned powershell.exe which exited quickly. So then it seems that there’s a huge delay between when the health check code finishes and the result is returned to the server. But I see that there is communication happening between the tentacle and the server.
Is there a way to increase the logging the tentacle does to see what's going on during the health check? Or could this somehow be an issue on the server?
No problem, I understand making changes to production servers need to be scheduled.
This page shows how you can increase the amount of logging the Tentacle does to it's log file (not logging to the task log of the health check) and this might tell us a bit more of what it's waiting on that is causing the delay.
I enabled Trace level logging on both tentacles; the one with the delay and the clone that is fast. I didn't see anything obvious other than the slow one logging more IScriptService::GetStatus messages. Anything else I can try?
I set the fast tentacle to use the proxy, and it was still fast. So, I don't think that was it.
Next, I wondered if the underlying hardware might be too old or slow, so I took a slow tentacle out of its load balancer, and shut it down, then started it up. I believe this usually causes it to launch on different hardware. Once the tentacle had started up, I tested it and it passed the health check instantly! I did the same for the other slow tentacles, and now they're all running the health check quickly. I can't say for sure it was the underlying hardware to blame (since the OS didn't seem to indicate it had any perf issues), but I have no other guess. These instances hadn't been restarted since June 2016, so perhaps they were due for a restart anyway.
I'd like to suggest adding more diagnostics to the health check on the tentacle. It may have helped to be able to narrow down the problem to a particular part of the health check.
Thanks for your help with this. I really appreciate it!