Server restarts and failures
Andrew Hodgkinson (6) 465 posts |
Today’s ROOL site is a lot more busy than it was back in 2006, but at its core the same 2006 era applications and service frameworks are still in use. These are now causing us a lot of trouble in areas such as long-since fixed memory leaks, possible security issues and instability under load for the single-sign on mechanism (which turns out to be a particularly weak link in the chain). Other general application issues have surfaced over time too, though I think only the Wiki is particularly problematic. Statistics back in 2006 showed that 5am UTC was by far the quietest time for access. This holds true today, so we most likely have a (relatively unsurprising) UK / Europe access bias. As a result, the site has always shut down, rotated logs and restarted from clean at 5am every day. This is no longer sufficient and we sometimes see extra failures during the evening, requiring manual restarts to get things going again. Serious issues with excessive RAM usage are surfacing for the service provider – as I write this, I’m about to restart the web server (very sorry to those people I can see logged on right now!) as just the Wiki alone is using over 11% of the entire physical RAM of the shared host hardware. Updating the applications making up the site is a time consuming and fiddly process. Site templates have to be modified to fall into the common site style (sidebar structure along with CSS changes), per-application sign-in and user account mechanisms have to be replaced with the single sign-on system to avoid users having to log in to every site section individually and data must be migrated from the old application. Then there’s a bunch of testing to do. It’s a task which is rising up the priority list due to the now obvious urgency, but I still don’t have time to fix things just yet. At this point I can only apologise for the increasingly obvious site stability problems, thank you all for your continued patience and ask you to bear with us for the time being. Rest assured that plans are afoot to sort things out, it’s just a question of finding enough time to make the necessary changes. |
Andrew Hodgkinson (6) 465 posts |
I’ve just finished coordinating with Arachsys to move the ROOL hosted account to a new piece of hardware. It won’t stop the software issues we have with the ageing Rails stack underneath the web site, but it should speed things up a bit. |
Trevor Johnson (329) 1645 posts |
Sorry it’s proving to be a headache but thanks very much for keeping people informed. |