Unreliable web site
Andrew Hodgkinson (6) 465 posts |
Sorry for the frequent downtime experienced by the web site lately. The authorisation component of the site (which handles user login) seems to be failing on a regular basis, but it isn’t at all clear why at the moment. I’m investigating the problem but in the mean time all I can do is restart the service whenever I notice that it has fallen over! Please bear with us… |
Andrew Hodgkinson (6) 465 posts |
At 5:00am GMT each day, the web site services are restarted as part of automated log rotation. Web statistics show that this is roughly the time of the lowest traffic for the site. This has worked for months, but right now, it’s beginning to look like something goes wrong with the restart attempt and the Hub DRb server doesn’t come up properly. I don’t have the precise cause or, therefore, a solution, but at least I’m narrowing down the problem. If this is indeed the case then, until I’ve a real fix, so long as I remember to restart things manually each morning, the site will be available for the majority of the day. |
Andrew Hodgkinson (6) 465 posts |
The problem is hopefully now solved, though I won’t know for sure until the server conducts its 5am log rotation tomorrow morning. The schedule job for the log rotation was originally kicked off from an SSH command line. Our service provider’s server is very reliable and had been running 24/7 for several months but was rebooted recently. This meant that the schedule job was restarted as part of the ROOL account startup script. The script didn’t take into account a few important environment variables that had been set at the SSH prompt. The DRb service relied on these variables and consequently failed. The startup script has been modified to include the relevant variables, though this cannot be tested in anger until the next server restart, which may be several months away. In the mean time the schedule job has been restarted at the SSH command line so ought to function as well as it did originally. |
Andrew Hodgkinson (6) 465 posts |
The site has survived 21 hours unassisted including one full log rotation, so looks like things are back to normal now. |