RISC OS Search Engine
Pages: 1 2
Rebecca Shalfield (2257) 18 posts |
Hi All, A very happy 2014 to one and all! A while back, someone mentioned their frustration with generic search engines when searching for RISC OS software with generic names such as ‘!Internet’. Even the arrival of all those new users as a result of RISC OS appearing for the Raspberry Pi is not going to make it any easier for us to search for such-named software via any of the generic search engines. In the RISC OS world today, there are a few RISC OS software links database web sites remaining. If only ever updated by its developer, the data is long since out-of-date and the site all but abandoned. My own RISC OS Products Directory web site from 2001 is a prime example and if I had any control over it, it would have disappeared long since. Those web sites that are updated via form submission have proved more successful but even these are not updated by all developers all the time, and, to the best of my knowledge, none of these sites actively encourage their data to be shared. During the last decade, a few people, including myself, have tried to create the ultimate RISC OS software database/directory/search engine based on one of two approaches – either the distributed XML file method or by analysing the directory structure of RISC OS software be it installed locally or in an archive file downloadable via the Web. Unbeknown to me until years later, the earliest suggestion for the distributed XML file method (codenamed “Poogle”) was by our very own Jeffrey Lee way back in May 2004! Drobe webmaster, Ian Hawkins, attempted the latter with The RISC OS File Repository in August 2005 although I never got to see it when it disappeared soon after. The first question people always asked was why is the developer bothering when we already had a number of adequate systems and all new ones were seen as too advanced that their aims would never be reached and would surely disappear in the next two years anyway. My early attempts at implementing such systems in 2005-2007 (namely “RiscOsNet”, “Peer-To-Peer RISC OS Software Sharing” and “The Framework for a Distributed RISC OS Encyclopedia” for those that don’t remember) were met with total apathy by most and mild hostility, some of it totally justified I might add, by a few and such implementations probably did not even reach their first anniversary. Sometimes an announcement was accompanied by a request for assistance or collaboration but no such help ever came. Back then, I eventually learnt how not to implement such a system, but the complexity of what I personally wanted to achieve meant I could not just take an off-the-shelf content management system and an existing XML framework and simply customise it for use with RISC OS. In retrospect, it seems it was perfectly OK to suggest such a system but be ready for the onslaught should you actually produce a prototype. I guess I learnt the hard way that what was a prototype or proof of concept to me was perceived as a fully polished system to others. Along the way I dabbled with an alternative approach, semantic databases, which unfortunately proved a total disaster as RONWUG will testify; in fact such system was never publicly announced; only vanity caused me to mention its existence! Having secured a better job and sensing very little outward evidence of anything actively going on in the RISC OS world at that time, I basically walked away from RISC OS circa 2008-2009. Whilst away, licking my wounds, I perfected my skills in the areas where it was lacking, that six months ago, I decided to resurrect both of the above ideas by rescuing a small amount of code from my earlier attempts, amalgamating them into just the one system and totally reimplementing everything else that was required. In fact, it would be more correct to say that I needed to wait until now for suitable technology to come my way. The arrival of RISC OS on the Raspberry Pi helped enormously in my decision to return. The RISC OS Search Engine, as my latest project has become known, appeared on the web some six months back (and commented on by Theo Markettos in mid-August) and has been steadily gathering information ever since. The spidering and analysis of RISC OS software within archive files implemented into The RISC OS Search Engine has already automatically located 1423 distinct applications from 6638 .zip files with over 1.2 million URLs still to be processed. A video of the RISC OS Search Engine at it was on 9th November when it was basically only capable of spidering the Web and analysing .zip files with none of its XML capabilities can be viewed at: One mistake I made back in 2007 was to just devise an XML structure and then fail to implement anything that actually processed such user XML code. The RISC OS Search Engine has, in the last few months, evolved into a full implementation of “riscosnet – The Framework for a Distributed RISC OS Encyclopedia” I arrogantly devised way back in January 2007. Although I suspect that the XML structure will not initially be complete or to everyone’s liking (in fact it isn’t even the same as I suggested back then), please do realise that it is the best starting point for such a system we have ever had and it should not prove too problematic to tweak at this early stage. The most up-to-date implementation (and yes it is still a prototype hosted on my home PC) can be seen running at: The beauty of the RISC OS Distributed Information Model and RISC OS Markup Language approach is that any RISC OS and/or web developer is free to extract whatever information they require from the various riscos.xml files scattered around the Internet. In addition to web sites utilising such technology to populate themselves and hopefully never become out of date, I am sure such data could be put to a number of other uses such as developing the PRMs. In order to facilitate easier adoption of the RISC OS Markup Language, all you need to do is search for a record similar to one you wish to create yourself and then simply steal the displayed XML code and modify it as required. You then just need to place such code in a riscos.xml file on your own web site and make its existence known to the RISC OS Search Engine, you’ll need to log on to submit a URL, although it should be able to ascertain this for itself if other pages on the same web site have already been scanned. In the future, I am hopeful that there will be a number of python-based web sites and Desktop applications based entirely or partially on the RISC OS Search Engine source code. For this project to succeed, we need others in the RISC OS Community to take up the gauntlet and create a network of such sites not only sharing data between them but the enormous task of spidering. With the Raspberry Pi, RISC OS has been handed a second bite of the cherry (to totally mix up my fruits), so I sincerely hope that the RISC OS Distributed Information Model and RISC OS Markup Language can become a major contributor to our success this time around. As the above text has evolved into more of a history lesson, the overall aims of this latest project can be read at: Being just a prototype running on a home PC via broadband, the spidering task alone is proving too much so don’t expect too much in the way of coverage or speed. The system is having to visit way more URLs than actually end up in the database. The introduction page gives on-the-fly statistics for when the site expects to have finished processing the 1.2 million URLs still outstanding and it’s suggesting years rather than months. I am sure by now that you have searched for your own or your favourite RISC OS piece of software and the system has failed to return any results. This can be overcome if enough people adopt the RISC OS Distributed Information Model or come forward with offers of help. On one hand, it is just an exercise for me in developing such a system, on the other it could prove invaluable. If and when the time comes, I’ll be more than happy to clone it onto a Virtual Private Server. Just in case things go horribly wrong, I have placed the source code up at GitHub (https://github.com/RebeccaShalfield/RISCOSSearchEngine) should apathy or annoyance prove too much that I abandon the RISC OS Community again! I guess this is the point where Jeffrey Lee assassinates me and steals my ideas (assuming I didn’t actually steal his in the first place), although I hope that this is seen as a noteworthy attempt by someone other than Jeffrey Lee to develop such a system and that we’ll all be happy to work together on developing it further. Note to Jeffrey Lee: I look forward to seeing Poogle on a web server sometime real soon! Rebecca Shalfield P.S. Did I mention the free advertising? |
Jeffrey Lee (213) 6048 posts |
I remembered the idea, but I forgot the name – probably for the best. What you’ve created so far looks very impressive – keep up the good work! |
Steve Pampling (1551) 8172 posts |
Surely that’s a case of talking to Aaron and or Dave as one or the other (or both) have the key to the door on riscos.com IIRC. It would certainly help – I’ve seen posts in Pi related forums as well as Linux ones asking about a repository site for RISC OS software and bemoaning dead links. |
Rick Murray (539) 13850 posts |
Part of the problem is Google tends to strip out punctuation. Plus the increasing tendency to spam search results with rubbish (or am I the only person who has noticed Google’s results getting less and less pertinent?). What sometimes helps is to write a phrase like: risc os internet configuration Compare: https://www.google.co.uk/#q=!internet useless!
Yup. I wonder if what needs to be set up is a public wiki, where people can add their own products. Having said that – there is an alternative approach underway. That of “packaging”, projects such as !Store. This has a long way to go, but the ultimate aim would appear to be to have all the software right there with no need to go searching. Ambitious, but not impossible…
And therein lies a big factor. If data was shareable, it could be supplemented and augmented rather than everybody reinventing the wheel each time.
Why do we bother maintaining RISC OS when we could easily just switch to running Debian? Maybe because we want to ? Maybe because the developer can see something in their offering that they don’t believe the current ones do, or do correctly? Does there need to be a specific reason? ;-)
Ah, I have a “blip” between early 2002 and late 2010 where I just “wasn’t” on-line. At all. Well, okay, I lie. I had about half an hour a week, sometimes, at a local library when their temperamental ISDN was working (and the librarian didn’t kick us, paying subscribers, off so her son could play games on the computer…).
Nothing new there then. Just read some of the back postings here about the change of DisplayManager – to either get rid of it, or worse to put it on the other side of the icon bar. :-P
If I might suggest – do not make public announcements too early. Get your prototype running and then make some vague-ish postings and see if anybody shows interest. If so, try to get them involved in private to evolve and expand the system. Alternatively, plaster a gigantic “proof of concept” emblem all over the… er… proof of concept. For those who still don’t get it. :-)
Mmm, I’m hoping you will be one of many to feel that way. While RISC OS is lacking compared to modern systems, it has its own attributes that make it enjoyable to use and blindingly quick on modern machines.
Just out of interest, can it support Spark and ArcFS archives? You probably won’t find much modern ArcFS, but Spark may still exist. Just out of interest #2, is it capable of determining if a product is likely to be 32 bit safe? For modules and absolutes, those can be determined by simply parsing the header. This might help separate between new software and older stuff that may present more problem with the newer generation of machines.
The beauty of XML is that is is extensible. So, for instance, if you like my “this appears to be 32 bit okay” suggestion, it can be wired in without needing to rewrite loads of stuff and throwing away all the old data.
Does the XML file have a datestamp within? If not, could it be added? This way, if there are multiple locations for the riscos.xml file, one can determine the “age” of the file automatically.
Just something I noticed. There appear to be a number of dupes. I searched for “teletext” and was pleased that my site came out on top (how do you order results, incidentally?).
“We are averaging 625 URL scans a day; it will therefore take us 1959 days to plough through the backlog!” – wow, your connection must be even slower than my 2mbit! ;-) Alternatively, that is approximately one every two minutes. Are you doing this because your ISP has a dumb poorly disclosed FUP/AUP?
Actually, it has. “teletext” worked, and “ovationpro” worked. I’m not sure, however, that I like linking directly to zip files. It might be more useful to link to the page that links to the zip file. In that way, people can read the description, licence (if described), and any important notes (like this software needs such and such). I feel that direct linking to archives might be useful for power users that know what they are looking for and can spot it when they see it; however for less experienced users, attempting to install random things much be less useful. Furthermore – direct linking to archives is useless for people who embed version numbers in archive filenames. You will be linking to a static entity which may (or may not) be the latest version. I don’t tend to look for things like this any more (less free time to do so) so there may be many I have failed to encounter which will be returning 404s (because I don’t tend to keep more than the current version and the previous one on my site, the rest are deleted). Even though VeroDes has a really slow update cycle (the latest is from April 2011), you can see that a site linking directly is already out of date! [but, it seems that this is now superfluous, CircuitSector has gone and its domain is now parked – never mind, it is only a line of PHP…]
I think he is somewhat busy with the GraphicsV modifications right now. Still, maybe the assassination/theft task could be outsourced…? (^_^) Best wishes for 2014, and good luck with this project! |
WPB (1391) 352 posts |
Rebecca, this looks great.
Could you elaborate on this a little? Having had the briefest of looks at the website, I couldn’t find any small example xml files on which I might be able to base one of my own. Also, how do you submit a URL? (Sorry if it’s blindingly obvious – I’m a bit tired!) Happy 2014! |
Steve Pampling (1551) 8172 posts |
The learning algorithm in google learnt already that people tend to be dumb and only look at the first few links, so google put adverts there. You can of course do more generic queries, repeatedly and then repeatedly select the same set of link that you know go where you want. It learns and promotes your favoured links and ones like them.
The author probably didn’t know the other “wheel” existed. VNC: Peter N vs. Thomas M and friends. Number 1 upset and removes his version. Now we have no development on VNC and an unstable instance with variable behaviour on different systems and setups.
For most people the ROL/Castle “war” was a bit much, in my case it coincided with a major buildings move and decade of frenetic work by an under-resourced group (currently just me, but the wife pulls me away from work these days. Forces me to visit pubs etc – dreadful it is…)
:-P and :-P Rebecca:
Some people are born with a tact bypass and their constructive critique is phrased badly.
Or someone put on a 32 bit style header without properly checking the innards. I could put a new 32-bit flagged header onto the ATAFSDriver module but that wouldn’t deal with the 51 obvious instances of non-32 bit safe code. Might make starting the RPC (number 1) a bit interesting.
Or the parsing is taking a while for each page/page set. Rebecca did comment on it running on a PC rather than a dedicated server. |
GavinWraith (26) 1563 posts |
Welcome back Rebecca. I hope that you have read “Del Rigor en la Ciencia” or, failing that, |
Rick Murray (539) 13850 posts |
…assuming that you didn’t make a habit of logging out and running a utility that auto-obfuscates tracking cookies. I am under no illusions that Google doesn’t know all about me; but I am not planning on making it easy. The breadcrumbs get swept up, and blacklisting helps the analytics stay out of the picture. By way of incidental aside, I looked for “Cherisha Software” as I vaguely remember the name. It works better putting the phrase in quotes, but the non-quoted example goes to show how skewed searching for stuff is getting these days. Okay, it found the website, but it took rather more than it should to find anything about Cherisha. Oh, and looking for “review of 2013” found so much superficial bollocks it wasn’t funny. Best and worst celebrities of 2013, 2013 in rugby, NHL bests of 2013, best cars of 2013, what we liked on telly in 2013.
Why? [note the emphasis]
Well, yes, I heard bits of that. I would ask who “won” the pointless war, but I’m not certain there is much point. We have a less glossy version of RISC OS that is seeing active development, plus a glossy version that sort of died and is only being kept on minimal life support to aid in sales of a related product, that being the emulator. Logic would say that we need him onboard. Partly so we can benefit from some of the snazzy stuff that RISC OS 4/6 already does (if licence permits it, that is?) and partly so that he can benefit from the continuing advances in development of RISC OS 5. The alternative? Drift further and further apart. Or, to put it another way – if I am developing something and RISC OS 5 offers an easier way to do it, I will use that and not pay much attention to previous versions of RISC OS if doing so presents an obstacle. RISC OS 5 is available for IOMD machines. Anything older is prehistoric. Funny thing is, compromise is a hard concept to sell.
That’s why I emphasised likely to be safe. Without making the engine able to understand ARM code (including working out what is data and what is code), it will only be a guideline. But given that this is a great big obstacle, the 26 bit vs 32 bit – some of which might be really hard to explain to newbies – some indication might be helpful. Certainly, a huge chunk of older code likely plain won’t work. As in… http://arcade.demon.co.uk/filepages/findex.htm
Probably we have different ideas about the interpretation of those words. To me, a dedicated server is a box in a room with a hundred others. Like the co-lo box serving my site. Probably like the one running this site. If I was developing a search engine on a PC, I’d set it up, hook it to the router, then leave it to get on with it…using a different machine for my own personal use. |
Rebecca Shalfield (2257) 18 posts |
It is certainly a good idea to ask 4QD to take down the RISC OS Products Directory but only once this new project has proved successful. In all the time reimplementing this RISC OS Search Engine, I followed development of the RISC OS packaging project and heard from those in favour of it and those against. If a developer is against it, and simply wishes to place their software on their web site as they have always done, how does the central packaging server ever know about such software? In the Linux world, even some of the big American software companies refuse to use the standard package manager available for a particular Linux distribution; perhaps it because there are a number of such package managers that it is easier to simply ignore them all. I’ve learnt my lesson about not making public announcements too early, hence the reason why I emailed RISC OS Open/Theo Markettos way back in August. I am using a Python library to unzip the .zip files. I am not aware of such a library for Spark and ArcFS files so even though I have collected knowledge of such files, the files themselves are not currently being analysed. It would certainly be possible in the future to develop another part of the overall system to read such files on RISC OS itself and repackage as zip files if this proves the best method. Although the program checks the flag of a relocatable module to see if it’s 26-bit or 32-bit, it doesn’t currently for anything else. There is already an The default date assigned when reading in an XML file is the timestamp of the file itself unless overridden by something (e.g. The RISC OS Search Engine tries very hard to prevent duplicates being added to the system in the first place. A huge number of duplicates appeared around 23/24 Nov when I had to rescan all apps due to a change in the MongoDB data structure and I attempted to speed up the spidering by running multiple copies of the spidering script and their randomness in what to scan next was not at all random. The duplicates should naturally disappear after a year when the URL is rescanned. Normally the records are displayed in the order MongoDB returns them to me. Sometimes, the date of the record may be taken into account. There are a few records in the RISC OS Search Engine that are nothing to do with RISC OS. They have almost certainly been indexed because they contain one or more specific keywords (e.g. “RISC OS”, “Iyonix” etc.). A branch in the spidering is abandoned upon scanning a URL without one of the specific keywords. The spidering is slow because my home PC is only running at 1.3Ghz, it is performing database housekeeping duties during the day, only scanning for all its might during the night when my ISP allows unlimited traffic. The spidering script is also running on the same machine as that hosting the web server. Although I do link directly to zip files, a hyperlink to the parent URL if known is always shown alongside. To assist myself in developing the site, whenever a search returns just the one record, its XML and JSON code are displayed underneath. If more than one record is returned, you can click on its ‘XML’ button to display the corresponding XML code. The URL submission occurs in the footer and is disabled until you log on. |
Steve Pampling (1551) 8172 posts |
Worth a question to David Pilling I would have thought, he works on non-RO stuff as well so the libraries may be available. Note the presence of the read-only SparkFS in RO disc content. I’d be tempted to buy the full version if I didn’t already have it. |
Steve Pampling (1551) 8172 posts |
Rebecca – typing FAT into your search offers !FatPaint clicking search gives a “no records found” have I misunderstood something? |
Steve Pampling (1551) 8172 posts |
It learns and promotes your favoured links and ones like them.
You are confusing the drop the tracker on your machine element and the stored in google element I’ve deliberately searched for items I needed and selected the most promising from several IP sources (the connections at my desk are quite interesting – I can appear on two different BT internet lines, NHSNet and Warwick University without moving) repeatedly.
Cherisha software with no quotes gave me the list you quoted. I ignore the adverts anyway. Although clicking on the link gave a blocked redirect (ghostery on my laptop) VNC: Peter N vs. Thomas M and friends. Number 1 upset and removes his version. Er, if I just say Peter and leave it at that? For most people the ROL/Castle “war” was a bit much, No one wins wars, there are losers and casualties. In this case it seems rather like two losers and multiple casualties.
Strangely I think Castle did pretty much the best they could to calm things by stepping away.
Have you rummaged through that stuff and checked how much needs no alteration or minimal alteration?
Me too. Then if I needed help financing a suitable server or would prefer someone else to host the whole thing I’d mention it could do with a proper server to run on… oh… :) |
WPB (1391) 352 posts |
I’ve yet to find a result with an XML button – can you give a few examples? Most of them are obviously non-XML spider results. |
Martin Avison (27) 1494 posts |
There is no obvious XML button that I can see. But overall, the site shows great promise! |
Steve Pampling (1551) 8172 posts |
search results buttons are in the far right column. Rebecca: you might want to look at that, and also the column order – placing things like the buttons early in the list. |
Martin Avison (27) 1494 posts |
What display are you looking at? After a search I am on the ‘Generic Search’ page, with no columns or XML buttons, so I seem to be looking in the wrong place. A description of how to set up one of these mysterious XML files would be useful, when time becomes available! |
WPB (1391) 352 posts |
Wot Martin said. ;) I don’t see any columns, either, let alone juicy XML buttons. @Martin – I wonder if !Rover could be updated to talk to Rebecca’s system. Seems like the two would go well together. |
Steve Pampling (1551) 8172 posts |
I was about to do a walk through and the setup seems to have changed. |
Rebecca Shalfield (2257) 18 posts |
The program is currently showing results in ‘Report’ mode even though the filter states ‘Table’ mode should be the default. Try setting the filter’s View attribute to ‘Table’ mode (and then submitting) whereupon the results should be displayed as a table complete with an XML button. The default was a change I made recently and obviously I needed to view my own site from work to see what was going wrong. My apologies. |
Steve Pampling (1551) 8172 posts |
Er, “It is a test” No apologies due I think. |
WPB (1391) 352 posts |
Yes, that works, and wow – totally different perspective on the database. It’s phenomenal. I like the idea of grabbing !Help text. I wonder if there’s a clever way to follow !Help files that are really Obey files that run an HTML file or StrongHelp file, etc. That would be really neat. Again, great job! |
Steve Revill (20) 1361 posts |
@Rebecca ROOL has a cloud server running 24/7 which is, for the most part, not doing anything. It spends maybe 33% of it’s time doing the nightly autobuilds. If installing your spidering stuff on that is fairly straightforward, maybe we could do that and accelerate that process a bit? |
Steve Revill (20) 1361 posts |
I like the whole search engine but I’ve not yet thought about the XML/markup side of it. The aesthetics could do with some work :) but that’s probably bottom of the pile right now. On a personal note (with my 7th software hat on), it doesn’t do a very good job of processing MoreDesk, presumably because the zipfile just contains an installer application. Within that are the various bits and bobs (other applications, including !MoreDesk itself) that get installed. If your processing recursed into the subdirectories of the zipfile to find applications it might help – I’ve a sneaking suspicion you are doing that but stopping at the first application; that might explain why the MoreDesk search returns !ConfiX. I don’t know if you’re already spidering from whatever the database !PackMan uses, but that might be a good place to get URIs from. Another source might be to ask for a list from RComp (!Store) – they may say “no” for some reason but it seems like it’d be worth asking to me. |
Steve Pampling (1551) 8172 posts |
R-Comp might ask to be able use the database to populate some element of Store and PackMan could usefully pickup info from the database. Two way street… |
Steve Revill (20) 1361 posts |
Bump. |
Pages: 1 2