Slow Health Check

Jeff Evans's Avatar

Jeff Evans

10 Nov, 2017 04:50 PM

What happens in a health check between the "Running Tentacle" message and the Calamari message? We see that on most but not all of our tentacles that the health check takes a very long time. These machines also are very slow at executing steps during a deployment. We ran see on the server's tentacle page that there is active connectivity, but for some reason the tentacles take their time responding to commands. These are all polling tentacles. These are also tentacles in AWS, and we noticed we didn't have any delay with the tentacles running in our data center.

03:14:37 Verbose | Performing health check on machine
03:14:40 Info | Host Name: EUPRDSBOTENG02
03:14:40 Info | Running As: eu\SYSTEM (Local Administrator: True)
03:14:40 Info | Running Tentacle version 3.8.8
03:18:14 Info | Running latest version of Calamari: 3.6.43
03:18:19 Info | Drive C: has 27 GB available
03:18:19 Info | Drive D: has 10 GB available
03:18:19 Info | Drive P: has 991 MB available

  1. Support Staff 1 Posted by Henrik Andersso... on 12 Nov, 2017 11:35 PM

    Henrik Andersson's Avatar

    Hi Jeff,

    Thanks for getting in touch.

    All that happens between those to output lines is that we search through the Calamari folder to check if the server is running the latest version of Calamari.

    As these are polling Tentacles, it could be that the search takes that little bit more time (depending on disk IO) and it misses one of the polling intervals.

    The Tentacles that are in your own data center, are they listening Tentacles (I'm assuming so)?

    I hope that helps.

    Thank you and best regards,
    Henrik

  2. 2 Posted by Evans, Jeff on 13 Nov, 2017 02:49 AM

    Evans, Jeff's Avatar

    Thanks for getting back to me Henrick. All of our tentacles are polling, even in our data center. We found that disabling UAC and rebooting them made the health checks and deployments go much faster, though I’m not sure why UAC would matter.

    We saw the problem in our deployments to ec2 instances in eu-West-1 and ap-southeast-2 regions. We’ve been deploying to ec2 instances in us-East-2 for several months without any problems.

    We had tried simply rebooting one of the slow tentacles, but it didn’t seem to help. So it does look like UAC was involved but again I’m not sure why.

    Sent from my iPhone

    On Nov 12, 2017, at 6:36 PM, Henrik Andersson <[email blocked]<mailto:[email blocked]>> wrote:

  3. Support Staff 3 Posted by Henrik Andersso... on 13 Nov, 2017 06:58 AM

    Henrik Andersson's Avatar

    To confirm, does EUPRDSBOTENG02 still have UAC enabled, or did you disable it on that server and then the health checks run quickly again? If that server still has UAC enabled, could you disable it, reboot the server and then re-run a health check on it and confirm if the health check runs quickly again.

    I've been unable to replicate the issue with a new EC2 setup (but this doesn't mean much as I don't know your exact configurations).

    Thank you and best regards,
    Henrik

  4. 4 Posted by Jeff Evans on 14 Nov, 2017 06:33 PM

    Jeff Evans's Avatar

    We’re not able to modify the production servers outside of maintenance windows, so I can’t disable UAC on EUPRDSBOTENG02. I can however, create an AMI from it, and when I do and launch an instance using that AMI, the clone responds to health checks nearly instantly. That points to it not being UAC related. But that just raises more questions.

    I installed sysinternal’s processmonitor and recorded activity from the tentacle. From there I saw it writing a bootstrap.ps1 file with the health check powershell code. When I copied that code and ran it myself, it ran instantly, so I don’t think that the Calamari version check is slow. It also looked like the health check spawned powershell.exe which exited quickly. So then it seems that there’s a huge delay between when the health check code finishes and the result is returned to the server. But I see that there is communication happening between the tentacle and the server.

    Is there a way to increase the logging the tentacle does to see what's going on during the health check? Or could this somehow be an issue on the server?

  5. Support Staff 5 Posted by Henrik Andersso... on 14 Nov, 2017 11:20 PM

    Henrik Andersson's Avatar

    Hi Jeff,

    No problem, I understand making changes to production servers need to be scheduled.

    This page shows how you can increase the amount of logging the Tentacle does to it's log file (not logging to the task log of the health check) and this might tell us a bit more of what it's waiting on that is causing the delay.

    Thank you,
    Henrik

  6. 6 Posted by Jeff Evans on 14 Nov, 2017 11:51 PM

    Jeff Evans's Avatar

    Hey Henrik,

    I enabled Trace level logging on both tentacles; the one with the delay and the clone that is fast. I didn't see anything obvious other than the slow one logging more IScriptService::GetStatus messages. Anything else I can try?

    I've attached one file containing both log files.

  7. Support Staff 7 Posted by Henrik Andersso... on 15 Nov, 2017 03:37 AM

    Henrik Andersson's Avatar

    Hey Jeff,

    Thanks for sending through the extra information.

    It looks like the slow Tentacle is configured to use a proxy (while the fast Tentacle isn't).

    2017-10-17 19:03:29.8812      7  INFO  Agent configured to use the system proxy, but no system proxy is configured for https://deploy.ncrsmb.com:10943/
    

    I wonder if this could have something to do with the slowness of that Tentacle?

    Thanks,
    Henrik

  8. 8 Posted by Jeff Evans on 15 Nov, 2017 09:13 PM

    Jeff Evans's Avatar

    Hi Henrik,

    I set the fast tentacle to use the proxy, and it was still fast. So, I don't think that was it.

    Next, I wondered if the underlying hardware might be too old or slow, so I took a slow tentacle out of its load balancer, and shut it down, then started it up. I believe this usually causes it to launch on different hardware. Once the tentacle had started up, I tested it and it passed the health check instantly! I did the same for the other slow tentacles, and now they're all running the health check quickly. I can't say for sure it was the underlying hardware to blame (since the OS didn't seem to indicate it had any perf issues), but I have no other guess. These instances hadn't been restarted since June 2016, so perhaps they were due for a restart anyway.

    I'd like to suggest adding more diagnostics to the health check on the tentacle. It may have helped to be able to narrow down the problem to a particular part of the health check.

    Thanks for your help with this. I really appreciate it!

  9. Support Staff 9 Posted by Henrik Andersso... on 16 Nov, 2017 08:57 AM

    Henrik Andersson's Avatar

    Hi Jeff,

    Good to hear that you managed to sort out the slow performance and thanks for letting me know.

    Thank you and best regards,
    Henrik

Reply to this discussion

Internal reply

Formatting help / Preview (switch to plain text) No formatting (switch to Markdown)

Attaching KB article:

»

Attached Files

You can attach files up to 10MB

If you don't have an account yet, we need to confirm you're human and not a machine trying to post spam.

Keyboard shortcuts

Generic

? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac