RISC OS Search Engine
Pages: 1 2
Rebecca Shalfield (2257) 18 posts |
Hi All, Work on the RISC OS Search Engine has been continuing apace throughout January. With 1.2 million URLs still to be processed, I have endeavoured to get a handle on what these URLs actually are, from how many domains they derive and a count of those with strikes against them. The Spidering page now features an Activity section to monitor what is being added to each collection, which has allowed me to get to the bottom of the duplicate insertions. I now also assign a priority, between 0 and 99 to each newly added URL so as to process them in a sensible order, with .zip files, riscos.xml files, syndicated feeds and software-specific ones gaining 0, 1, 2 and 3 respectively. To assist with separating genuine RISC OS-related URLs from the ‘noise’, I have added what I am calling the Crowd Spidering (after Crowd Sourcing) page where the priority of all pages from a common domain can be promoted/demoted on mass. The code extracting the hrefs from HTML code has been rewritten and now uses a Python library. Due to a separation of housekeeping and spidering duties, the number of URLs scanned per day has been doubled, although this has had the side effect of increasing the overall unprocessed count. The books page’s search results have been altered to start moving away from the standard table format. Also, ‘edition’ has been added to the XML code for books. The generic search has been experimentally updated to allow “…” “…” searches. I have added an alternative way to search should autocomplete not work in your chosen browser. The autocomplete code has also been rationalised. The display format for dates has been improved. I have also taken ‘ownership’ of a Virtual Private Server onto which I wish to clone the entire site and to get to grips with separating the spidering duties between two machines as a precursor to taking Steve up on his kind offer. That’s all for now. I’ll report back in another month. Rebecca |
Rebecca Shalfield (2257) 18 posts |
Hi All, Well, another month has been and gone and the RISC OS Search Engine has now been cloned onto my VPS at http://185.30.213.186. An Application Programmer’s Interface has also been implemented of which a part, the JSON-formatted synchronisation mechanism, is working successfully, happily copying data between the two mirrors. Although http://www.shalfield.com/riscos is now pointing to the VPS, where only limited data is currently available, the mirror on my home machine at http://84.92.157.78/riscos, where you’ll find details of all 1471 RISC OS applications discovered so far, can still be reached via the menu at the top. Rebecca |
Dave Higton (1515) 3497 posts |
Hi Rebecca, Thank you for your efforts with your RISC OS Search Engine. I think it will be a very valuable resource. There’s an oddity, though, that I think you might like to take a look at. It struck me when I read the headline “Portsmouth show reminder” for “next Saturday, the 28th of September” that you tag the items with the date you put them onto your page (presumably), but there is no mention of the original date of the posting. So we have no way to see how out of date the information is, short of clicking the link you give. |
Rebecca Shalfield (2257) 18 posts |
Hi All, I have been away from RISC OS for a while (4 years) whilst I developed a Kanban-based project management system named Kanbanara using the same underlying technology (Python, CherryPy, MongoDB) as the RISC OS Search Engine. I parked the RISC OS Search Engine back in 2014 because I was very dissatisfied with its spidering – it seems to suffer from “telomeres-in-reverse”, where URLs would just grow longer and longer and effectively clog up the spidering process and “attack” RISC OS websites with unnecessary requests. Since building my new RISC OS computer based around a Titanium board and with the open sourcing of the OS itself, I have a renewed interest in RISC OS again. As a result, I resurrected the RISC OS Search Engine a couple of months ago. The code has been rewritten to work with Python 3 rather than Python 2. The spidering has been examined in greater detail and got to work a whole lot better. It now has details of 2397 RISC OS applications. I am currently working on reworking the web UI component. I am not yet in a position to officially announce anything except to reassure others that such a scheme is being actively worked upon. |
Stevyn Gadd (2272) 63 posts |
Good to hear you’ve been tempted back to the world of RISC OS. Looking forward to trying it. |
Rebecca Shalfield (2257) 18 posts |
Hi All, Looking at the entries in the RISC OS Products Directory, which I manually assembled back in 2003-ish, you will see that it featured details of 2923 RISC OS products. I am pleased to announce that the new-improved RISC OS Search Engine spidering process has now exceeded that figure, despite some 436 thousand URLs still to be processed. The RISC OS Search Engine currently knows about 3167 RISC OS applications, many of which can be downloaded from multiple sources throughout the world wide web. As the web-based front end is still under re-development, I shall email out today a static web page containing details of these 3167 RISC OS applications to the various RISC OS portals in the hope that such information can be disseminated as widely as possible. |
Rebecca Shalfield (2257) 18 posts |
Hi All, I was hoping to give you an update on the RISC OS Search Engine project a few weeks ago, but the power supply on my home Windows PC failing put pay to that and any thoughts of working on its web site over Christmas. Anyway, with power supply now replaced, my home PC is back up and running. The development of a RISC OS Search Engine is proving something of a challenge but at least now the spidering in under better control; so much so that I now want to rewrite the spidering process to work in quite a different way. Basically, I have two database collections, one holding the successful RISC OS-related URLs and the other a backlog of URLs still to be processed. When a RISC OS-related entry reaches its scheduled rescan time, it is simply removed from the RISC OS database and placed in the backlog of URLs, where it may remain unrescanned for some time based on the length of the backlog (On my home PC, the backlog is 173382 in length); as a result, such valid entries are no longer visible to the end user. As I am quite likely to break what I currently have in the course of such a rewrite, and possibly would not have a working system for a few months, I thought you might like to now see, and benefit from, what I have achieved so far. This evening, I set up a fresh RISC OS Search Engine installation on a spare VPS, seeded it with all the successful RISC OS-related URLs I currently have, then set it in motion to observe what might happen. On my home PC, the database is reporting 3498 RISC OS applications; Hopefully, this new installation will be upto that number in a few days. The UI is still very much incomplete but heading in what I hope is the right direction. If viewed from a current RISC OS browser such as NetSurf, it is going to look terrible being written in HTML5, CSS3 and using d3.js graphics. Hopefully, it looks and performs better on one of the new but unreleased RISC OS browsers, to which I do not have access. The results can be seen at http://151.80.169.242:22676/ Rebecca |
Paolo Fabio Zaino (28) 1855 posts |
Nice job! :) |
Andrew McCarthy (3688) 605 posts |
What a nice surprise for 2021! I’m impressed with how you’ve handled problematic websites that might turn up in the “Latest Web Pages” by providing a “Goto” button that you can hover over to see where it might take you. :) |
Andreas Skyman (8677) 170 posts |
Very nice! I’m not sure how the backend works, but there may be some tags on GitHub that your spiders might enjoy munching on, e.g. 0 https://github.com/topics/riscos and https://github.com/topics/riscos-ci |
Paul M (4167) 12 posts |
That’s a really impressive piece of work. A great idea and looks to be coming along really well. Great to see something like this being available to the RISC OS community. |
Rebecca Shalfield (2257) 18 posts |
The spidering seems to have progressed far quicker than I ever expected, which is great. We are already upto 3487 apps on the new installation at http://151.80.169.242:22676/. Only 23 more required until it is at the same level as my home installation. Since the weekend, I have tidied up a few UI features such as module listings related to apps; One of the benefits of working from home! I have also added a new “Module Dependencies” page that displays all the apps dependent on a given module – Try this yourself with “UtilityModule”. I am wondering if I need to also be able to give the module’s version such as “3.10” or “3.50”. |
John Rickman (71) 645 posts |
First class! Best resource finder since Paul Vigay’s stuff. Don’t know what category but you should do well in the Riscository awards. |
Raik (463) 2059 posts |
Nice project :-) I’m missing any German websites. www.gag.de, www.riscos.berlin, http://forum.acorn.de… A German user reports “Strange activity in my area …” … looks like the content was collected but I not found in your engine. Thanks a lot. |
Rebecca Shalfield (2257) 18 posts |
Things were running so well that something had to go wrong. Basically, my VPS had managed to both terminate the Python scripts and lock me out, rendering me unable to restart them. My VPS provider felt the best solution to resolve the situation was to give me a fresh VPS. Unfortunately, this is running “Chad Valley OS” otherwise known as Windows Server 2012 R2, which I utterly refuse to struggle with, not to mention that it is too old to install MongoDB. I have raised a helpdesk request as I am additionally unable to change the ISO image for the OS to something later. In the meantime, I shall continue to develop my home installation of the RISC OS Search Engine. Hopefully, a VPS will be up and running in a few days. Having 3518 RISC OS apps, I was intrigued to know which year had the most releases or updates. Assuming the following image appears at your end, you’ll be able to see the fruits of my labour. |
Rick Murray (539) 13806 posts |
Sorry, you’ve linked to a Windows filename… If you need to drop an image in a hurry and you aren’t worried about copyright, try imgur.com.
Well, there’s a hard call. I think that the year with the most app development is likely to be something like 1994 or 1995 when the RiscPC was shiny and new and the RISC OS market was vibrant enough to have a lot of software updated to run on the machine. However, this is pre-Internet. It may be that, for you, you peg 2012 as a year with a lot of activity. Hope you get your VPS sorted soon. |
Rebecca Shalfield (2257) 18 posts |
The web-based UI of the RISC OS Search Engine is nearing completion. Unfortunately, I am unable to show you an experimental version on my VPS as we have now parted company. Basically, the original VPS corrupted itself and could not be revived. I initially rejected its replacement due to its OS – the awful Windows Server 2012 R2 – but managed to get it set up with a older version of the MongoDB software I needed to use. The VPS was running fine late last night but had been suspended by this morning due to CPU abuse. As the spidering and UI are basically complete, any ideas from RISCOSOpen or RISCOSDev going forward? It would be a shame for the RISC OS Community to lose this resource but it is now obvious to me that although I can run it quite successfully on my home PC and domestic broadband account, having a beefy-enough VPS is going to fall outside the budget range. |
Stefan Fröhling (7826) 167 posts |
Hello Rebecca, |
Rebecca Shalfield (2257) 18 posts |
Hi Stefan, |
Steve Revill (20) 1361 posts |
Hi Rebecca. I’ll raise this topic next time ROOL and RISC OS Developments has one of their regular coordination calls. |
Pages: 1 2