Web site work underway
Pages: 1 2
Andrew Hodgkinson (6) 465 posts |
Related to Web site faults: I’m doing work on the software architecture that supports our web site. At ROOL we have two parallel “trees” for this. One is our live tree, that you’re looking at now. The other is a duplicate copy for development. All testing usually takes place there, without disturbing the live site. Unfortunately the nature of our current site problems produces a severe resource shortage on the server side. This means that I can’t run much of the development tree and live tree concurrently, as it completely exhausts server resources and Very Bad Things Happen. As a result, I’m sometimes forced to bring down the live site while I test development stuff. You will see sporadic outages. In general these are likely to be in the late evening/overnight/early morning UK time, as I’m doing this work from a +12 hour time zone. Apologies for any inconvenience. If all goes well, the replacement architecture should be up and running live in a few days – there are unlikely to be any obvious user-facing changes, apart from generally better reliability, maybe a bit more speed, photos ;-) and better request capacity overall. |
Andrew Hodgkinson (6) 465 posts |
As part of this I’m about to bring down the bug tracker, bounty section and source code viewer, but I’ll leave other stuff intact as much as I can. UK time as I write this is about 11pm. |
Colin (478) 2433 posts |
Darn there goes the bedtime reading. Suppose I’ll have to resort to counting sheep. 1… 2… 236843… 236844… |
Rick Murray (539) 13851 posts |
…and something that looks like Textile but, like, works? |
Andrew Hodgkinson (6) 465 posts |
An ironic comment given it includes a quote and italics Textile is glitchy but works OK; and more to the point, it’s deeply embedded in every Rails component we use. The source code is all there in SVN – once that part of the site is back up, feel free to download it all, work out and test patches to replace Textile with something else, add migrations for all our data to convert from Textile to something else or include backwards compatibility with existing Textile data – then submit patches. OTOH I suspect we can both find better things to do with the kind of time that’d take! |
Steve Pampling (1551) 8172 posts |
https://www.riscosopen.org/viewer/revisions 404 not found |
Dave Higton (1515) 3534 posts |
What would be needed to solve the resource shortage? Would it be anything that would benefit from crowd funding, e.g. a bounty or a kickstarter-equivalent? |
Dave Higton (1515) 3534 posts |
Rick, I suggest taking this over to the Wish List. And when you do, be specific about what features you want. You may be the one who has to rewrite the parser to add the features. |
Chris Gransden (337) 1207 posts |
The CVS viewer is one of the things that is disabled. If you go to https://www.riscosopen.org/viewer then it redirects. |
Chris Hall (132) 3559 posts |
The CVS viewer is one of the things that is disabled. How long for please? |
Andrew Hodgkinson (6) 465 posts |
Unhelpfully the best I can say is “as long as it takes”; I don’t want to commit to anything as there could be all manner of last-minute problems. With that disclaimer out of the way – I’m at the stage where the development tree seems to be functioning correctly and I’m preparing to move it over into live deployment tomorrow. By Friday morning UK time, it’ll either be running or reverted! There’s simply no way to know how it’ll behave under load without making it live and loading it. You can use direct CVS access if you need to look at source code.
The problem is one of spawning excessive processes. There are two sides to it; one is just the way that Apache, Passenger and PostgreSQL interact in more recent incarnations to significantly increase the number of process slots they take up. We could ask Arachsys for a wider allocation and I’m pretty sure we’d get it, but the other issue is occasional uncontrolled spawning of httpd instances. This would seem to be an Apache bug but diagnosis has proved difficult/impossible and there’s no fix available TTBOMK at present. Basically, when we reach some certain load threshold (which isn’t a constant) with all Rails applications up, Apache goes nuts and spawns httpd instances until the server keels over. We can’t even SSH in to fix it because there aren’t enough process slots left, so we have to go cap-in-hand to the overall sysamin. Given that Apache is very heavyweight, but our original server of LigHTTPd was too lightweight, I’ve decided to move to Nginx (“Engine X”). This is small, fast and supported by Passenger for theoretically robust Rails hosting but it has a number of shortcomings, not least its lack of support for standard CGI. Only FCGI is possible. I’ve set up an FCGI<→CGI wrapper process with a couple of worker threads and that should prove sufficient but it’s an unknown quantity amongst many other unknown quantities. It’ll be invoked whenever anyone’s using the web CVS viewer front-end or when ROOL staff are looking at site statistics; the former means it could get quite heavily hammered during automated web crawls or fuzzing attacks. I’d have reasonable confidence in the new infrastructure were it not for a couple of occasions in development when Passenger seems to have suffered a run-away consumption of file handles leading to file handle exhaustion. This is very odd. There were complaints about such behaviour in the version 4 release candidates but the bug in question was supposed to be fixed and it’s now at version 4.0.14, a long way from that first major new version release, with no other reports of recurrence. I’ve only seen it doing that kind of thing at the point where concurrently trying to run the live and development trees (even with half of live “switched off”) was causing process slot shortages though, so I’m hopeful that the two are related and once we’re up and running as a live-only deployment, it should operate correctly. All a bit wing-and-prayer but that seems to be the state of open source these days, especially when compiling on a rather heavily customised virtual host with all manner of quirks compared to more mainstream Linux distributions around today. |
Rick Murray (539) 13851 posts |
Dave:
Probably the same as everybody else – for the parser to act in a consistent manner. Textile isn’t a bad thing, but its behaviour depends so much on phase-of-the-moon situations. Plus, the bug reporter seems to have a different concept of Textile commands than this forum (I used the code highlight in my report, epic fail).
Oh, don’t hold your breath. I did download the forum code that I was pointed to, but I gave up trying to work out how it links together. I’m a guy that expects main() followed by clearly defined functions that get called. Or, BASIC-like, begin at the beginning and call functions. All I saw with the forum code was a bunch of disparate functions that didn’t seem to tie in to anything else [I wanted to see if I could fix that edit-returns-to-page-one thing]. Andrew:
I’ve heard of this before. My long-time-ago host used to keep an active ssh session running to tidy up the mess that happened when Apache would go crazy and make loads of processes that were like a plague of zombies. I don’t know if he ever tracked down why. There was no Rails, it was a fairly standard lamp setup.
It’s a shame it isn’t really feasible to write a reject filter that will only serve a page to an IP address every ‘x’ seconds, where ‘x’ increases the more you look at without a gap. I wrote something for my blog that did this (and stopped Yandex.ru in its tracks, hehe) but the management of IP addresses and times started to get to the stage where I imagined the processing of that would be taking more time than just serving up the content. But, then, I wasn’t using MySQL or anything of that nature…
As opposed to the lovely days of running a personal server with Microsoft’s mickey-mouse server on Windows95? Oh, you left a pointer to some files served off a CD-ROM? The CD isn’t there? Well, that server is dead. Windows just blue-screened because the ATAPI device was not ready… At least with open source you can look for the fault yourself, or hope it bites the ass of somebody who has the resources to getting sorted. This is surely better than a mega corp ignoring your first twenty messages, and then saying “you’re an idiot, there is no problem with our product”. Or, maybe, “you’re holding it wrong”… :-) |
Andrew Hodgkinson (6) 465 posts |
“It’s open source. When it breaks, you get to keep both pieces”. Nginx is up – it was a struggle to get it going – fingers firmly crossed. I am aware of the SSL error on the home page right now; this will be fixed soon (I need to make code changes to support SSL certificate chains in our back-end, now that the site is always forced to HTTPS). Some users may be annoyed at the HTTPS insistence but it made life MUCH easier in the Nginx configuration and is altogether more secure. It also solves the “Hub says I’m not logged in” confusion that can occur when accidentally falling through to non-HTTPS URLs. |
Andrew Hodgkinson (6) 465 posts |
Now resolved. If all goes well, there should be no site glitches / error messages, so please use this thread or the bug tracker to report things that go wrong which are “new”. Subversion has been updated and the static web site packages are being exported as I type. |
Colin (478) 2433 posts |
The pages don’t finish resolving. When I read a page, it appears – and the display looks ok – but the bar which shows that the whole page is fetched stops so far until a while later it completes. It’s as if there is a fetch on every page which times out. |
Frank de Bruijn (160) 228 posts |
I’m not seeing that here (using Firefox on Linux). Everything seems quite a bit faster as well. |
Andrew Hodgkinson (6) 465 posts |
Try Reload in your browser if you see issues. There may be stray bits of cached data floating around. Incidentally I’ve just rolled out a couple of long-requested changes to the forum – just minor tweaks but nice to have – you can select the number of posts per page (25/50/100) and I’ve finally got around to fixing that really annoying “editing your post throws you back into the first page of a thread” fault. Both are only evident/useful when the current posts-per-page setting means that more than one page is available. Otherwise, per-page stuff is hidden. |
Trevor Johnson (329) 1645 posts |
Both1 work a treat. Thanks, Andrew :-) 1 Post-note… except if you’ve selected to show, e.g. 100 posts for a multi-page thread and then edit your post on that particular page: you’re dropped back to page 1 when it reverts to 25 posts :-| (This may also cause an issue if, e.g. new post no. 26 (by A.N.Other) increases the no. of pages during the time that a different previous post is being edited – but this is a minor issue and it’s all still a huge improvement.) |
Colin (478) 2433 posts |
The only browser I have a progress bar on is Safari on my ipad so only notice it there – the page appears but the progress bar pauses for a while (what seems like a timeout) before finally completing. I’ve cleared its data but it made no difference. It may be a clue to something if you have problems. |
Andrew Hodgkinson (6) 465 posts |
I don’t see that but I can check on the iPad to compare later. It doesn’t sound like you have any issues with specific errors though, which is something. |
Steve Pampling (1551) 8172 posts |
“Entertaining” quirk: 1. Click the next page link at the bottom of the 50 post page – you move to page 2 (of 3) and the 50 post setting is retained. Clicking next again takes you to page 3 of 3 2. Click on the page 2 link and you are on page 2 of 6 with the 25 post default setting. Clearly the code for moving forward differs between the two links. Hope that’s clear enough to assist the diagnosis. |
Rick Murray (539) 13851 posts |
First, a quirk. Steve posted this: “1 “Mostly” being some abitary value between 1% and 99%”. In the line above is the footnote link. Following it in “Recent posts” takes you to an entirely different post. :-) Andrew…
Okay, I get the same thing. The base page content loads, but it will pause for around sixty seconds before finishing and displaying the Log in / Account link. Whatever happens between the content and the account links is what is taking time. This does appear, however, to be a Safari specific issue. Dolphin, on the same iPad, does not pause. So maybe Safari is looking for some Apple specific bumph? |
Andrew Hodgkinson (6) 465 posts |
So:
Ticket #353 is annoying; the forum isn’t very well written really. Little in the way of centralised routines to stop all this sort of every-bit-of-code-requires-review-on-every-change fault cropping up. If you spot other issues then gold stars may be awarded for bugs submitted in the tracker (Severity – your call, Part: “Web site: Beast…”, Release: “3rd public site release”, Milestone: “3rd public site release completed”), with a link to the bug mentioned here – thanks. Not essential, but saves me the bother of entering the faults into the database myself. Open tickets on that milestone are at: |
Rick Murray (539) 13851 posts |
Your message begins: |
Steve Pampling (1551) 8172 posts |
When Rick mentioned it it occurred to me that me inserting a link reference of 1 in brackets isn’t really distinguishable from Rick inserting the same reference number. Users manually adding their own prefix couldn’t be relied upon and the forum software probably doesn’t allow for an auto add of the posting reference. What was that reference about open source, broken and you get to keep both bits? |
Pages: 1 2