RISC OS Open: Forum: RISC OS Search Engine

Feb 1, 2014 12:01pm

Hi All,

Work on the RISC OS Search Engine has been continuing apace throughout January.

With 1.2 million URLs still to be processed, I have endeavoured to get a handle on what these URLs actually are, from how many domains they derive and a count of those with strikes against them. The Spidering page now features an Activity section to monitor what is being added to each collection, which has allowed me to get to the bottom of the duplicate insertions. I now also assign a priority, between 0 and 99 to each newly added URL so as to process them in a sensible order, with .zip files, riscos.xml files, syndicated feeds and software-specific ones gaining 0, 1, 2 and 3 respectively. To assist with separating genuine RISC OS-related URLs from the ‘noise’, I have added what I am calling the Crowd Spidering (after Crowd Sourcing) page where the priority of all pages from a common domain can be promoted/demoted on mass. The code extracting the hrefs from HTML code has been rewritten and now uses a Python library. Due to a separation of housekeeping and spidering duties, the number of URLs scanned per day has been doubled, although this has had the side effect of increasing the overall unprocessed count.

The books page’s search results have been altered to start moving away from the standard table format. Also, ‘edition’ has been added to the XML code for books. The generic search has been experimentally updated to allow “…” “…” searches. I have added an alternative way to search should autocomplete not work in your chosen browser. The autocomplete code has also been rationalised. The display format for dates has been improved.

I have also taken ‘ownership’ of a Virtual Private Server onto which I wish to clone the entire site and to get to grips with separating the spidering duties between two machines as a precursor to taking Steve up on his kind offer.

That’s all for now. I’ll report back in another month.

Rebecca

Mar 1, 2014 10:12pm

Rebecca Shalfield (2257) 18 posts

Hi All,

Well, another month has been and gone and the RISC OS Search Engine has now been cloned onto my VPS at http://185.30.213.186. An Application Programmer’s Interface has also been implemented of which a part, the JSON-formatted synchronisation mechanism, is working successfully, happily copying data between the two mirrors. Although http://www.shalfield.com/riscos is now pointing to the VPS, where only limited data is currently available, the mirror on my home machine at http://84.92.157.78/riscos, where you’ll find details of all 1471 RISC OS applications discovered so far, can still be reached via the menu at the top.

Rebecca

Mar 2, 2014 1:54pm

Dave Higton (1515) 3534 posts

Hi Rebecca,

Thank you for your efforts with your RISC OS Search Engine. I think it will be a very valuable resource.

There’s an oddity, though, that I think you might like to take a look at. It struck me when I read the headline “Portsmouth show reminder” for “next Saturday, the 28th of September” that you tag the items with the date you put them onto your page (presumably), but there is no mention of the original date of the posting. So we have no way to see how out of date the information is, short of clicking the link you give.

Jan 5, 2019 12:06pm

Rebecca Shalfield (2257) 18 posts

Hi All,

I have been away from RISC OS for a while (4 years) whilst I developed a Kanban-based project management system named Kanbanara using the same underlying technology (Python, CherryPy, MongoDB) as the RISC OS Search Engine. I parked the RISC OS Search Engine back in 2014 because I was very dissatisfied with its spidering – it seems to suffer from “telomeres-in-reverse”, where URLs would just grow longer and longer and effectively clog up the spidering process and “attack” RISC OS websites with unnecessary requests. Since building my new RISC OS computer based around a Titanium board and with the open sourcing of the OS itself, I have a renewed interest in RISC OS again. As a result, I resurrected the RISC OS Search Engine a couple of months ago. The code has been rewritten to work with Python 3 rather than Python 2. The spidering has been examined in greater detail and got to work a whole lot better. It now has details of 2397 RISC OS applications. I am currently working on reworking the web UI component. I am not yet in a position to officially announce anything except to reassure others that such a scheme is being actively worked upon.

Jan 7, 2019 6:55pm

Stevyn Gadd (2272) 63 posts

Good to hear you’ve been tempted back to the world of RISC OS. Looking forward to trying it.

Jan 13, 2019 12:24pm

Rebecca Shalfield (2257) 18 posts

Hi All,

Looking at the entries in the RISC OS Products Directory, which I manually assembled back in 2003-ish, you will see that it featured details of 2923 RISC OS products. I am pleased to announce that the new-improved RISC OS Search Engine spidering process has now exceeded that figure, despite some 436 thousand URLs still to be processed. The RISC OS Search Engine currently knows about 3167 RISC OS applications, many of which can be downloaded from multiple sources throughout the world wide web. As the web-based front end is still under re-development, I shall email out today a static web page containing details of these 3167 RISC OS applications to the various RISC OS portals in the hope that such information can be disseminated as widely as possible.

Jan 18, 2021 11:22pm

Rebecca Shalfield (2257) 18 posts

Hi All,

I was hoping to give you an update on the RISC OS Search Engine project a few weeks ago, but the power supply on my home Windows PC failing put pay to that and any thoughts of working on its web site over Christmas. Anyway, with power supply now replaced, my home PC is back up and running. The development of a RISC OS Search Engine is proving something of a challenge but at least now the spidering in under better control; so much so that I now want to rewrite the spidering process to work in quite a different way. Basically, I have two database collections, one holding the successful RISC OS-related URLs and the other a backlog of URLs still to be processed. When a RISC OS-related entry reaches its scheduled rescan time, it is simply removed from the RISC OS database and placed in the backlog of URLs, where it may remain unrescanned for some time based on the length of the backlog (On my home PC, the backlog is 173382 in length); as a result, such valid entries are no longer visible to the end user. As I am quite likely to break what I currently have in the course of such a rewrite, and possibly would not have a working system for a few months, I thought you might like to now see, and benefit from, what I have achieved so far. This evening, I set up a fresh RISC OS Search Engine installation on a spare VPS, seeded it with all the successful RISC OS-related URLs I currently have, then set it in motion to observe what might happen. On my home PC, the database is reporting 3498 RISC OS applications; Hopefully, this new installation will be upto that number in a few days. The UI is still very much incomplete but heading in what I hope is the right direction. If viewed from a current RISC OS browser such as NetSurf, it is going to look terrible being written in HTML5, CSS3 and using d3.js graphics. Hopefully, it looks and performs better on one of the new but unreleased RISC OS browsers, to which I do not have access.

The results can be seen at http://151.80.169.242:22676/

Rebecca

Jan 19, 2021 3:02am

Paolo Fabio Zaino (28) 1882 posts

Nice job! :)

Jan 19, 2021 9:43am

Andrew McCarthy (3688) 605 posts

What a nice surprise for 2021! I’m impressed with how you’ve handled problematic websites that might turn up in the “Latest Web Pages” by providing a “Goto” button that you can hover over to see where it might take you. :)

Jan 19, 2021 10:59am

Andreas Skyman (8677) 170 posts

Very nice! I’m not sure how the backend works, but there may be some tags on GitHub that your spiders might enjoy munching on, e.g. riscos and riscos-ci ⁰.

⁰ https://github.com/topics/riscos and https://github.com/topics/riscos-ci

Jan 19, 2021 1:55pm

Paul M (4167) 12 posts

That’s a really impressive piece of work. A great idea and looks to be coming along really well. Great to see something like this being available to the RISC OS community.

Jan 21, 2021 8:47pm

Rebecca Shalfield (2257) 18 posts

The spidering seems to have progressed far quicker than I ever expected, which is great. We are already upto 3487 apps on the new installation at http://151.80.169.242:22676/. Only 23 more required until it is at the same level as my home installation. Since the weekend, I have tidied up a few UI features such as module listings related to apps; One of the benefits of working from home! I have also added a new “Module Dependencies” page that displays all the apps dependent on a given module – Try this yourself with “UtilityModule”. I am wondering if I need to also be able to give the module’s version such as “3.10” or “3.50”.

Jan 22, 2021 1:04pm

John Rickman (71) 646 posts

First class! Best resource finder since Paul Vigay’s stuff. Don’t know what category but you should do well in the Riscository awards.

Jan 22, 2021 1:33pm

Raik (463) 2061 posts

Nice project :-)

I’m missing any German websites. www.gag.de, www.riscos.berlin, http://forum.acorn.de…
RISC OS on www.a4com.de is gone last year.

A German user reports “Strange activity in my area …” … looks like the content was collected but I not found in your engine.

Thanks a lot.

Jan 26, 2021 8:15pm

Rebecca Shalfield (2257) 18 posts

Things were running so well that something had to go wrong. Basically, my VPS had managed to both terminate the Python scripts and lock me out, rendering me unable to restart them. My VPS provider felt the best solution to resolve the situation was to give me a fresh VPS. Unfortunately, this is running “Chad Valley OS” otherwise known as Windows Server 2012 R2, which I utterly refuse to struggle with, not to mention that it is too old to install MongoDB. I have raised a helpdesk request as I am additionally unable to change the ISO image for the OS to something later. In the meantime, I shall continue to develop my home installation of the RISC OS Search Engine. Hopefully, a VPS will be up and running in a few days. Having 3518 RISC OS apps, I was intrigued to know which year had the most releases or updates. Assuming the following image appears at your end, you’ll be able to see the fruits of my labour.

Jan 26, 2021 8:21pm

Rick Murray (539) 13850 posts

Assuming the following image appears at your end

Sorry, you’ve linked to a Windows filename…
c:\Temp\riscos_apps_released_updated_per_year.png :-)

If you need to drop an image in a hurry and you aren’t worried about copyright, try imgur.com.

I was intrigued to know which year had the most releases or updates.

Well, there’s a hard call. I think that the year with the most app development is likely to be something like 1994 or 1995 when the RiscPC was shiny and new and the RISC OS market was vibrant enough to have a lot of software updated to run on the machine.

However, this is pre-Internet. It may be that, for you, you peg 2012 as a year with a lot of activity.
Why?
Two phrases. “RISC OS” and “RaspberryPi”. :-)

Hope you get your VPS sorted soon.

Feb 8, 2021 7:45am

Rebecca Shalfield (2257) 18 posts

The web-based UI of the RISC OS Search Engine is nearing completion. Unfortunately, I am unable to show you an experimental version on my VPS as we have now parted company. Basically, the original VPS corrupted itself and could not be revived. I initially rejected its replacement due to its OS – the awful Windows Server 2012 R2 – but managed to get it set up with a older version of the MongoDB software I needed to use. The VPS was running fine late last night but had been suspended by this morning due to CPU abuse. As the spidering and UI are basically complete, any ideas from RISCOSOpen or RISCOSDev going forward? It would be a shame for the RISC OS Community to lose this resource but it is now obvious to me that although I can run it quite successfully on my home PC and domestic broadband account, having a beefy-enough VPS is going to fall outside the budget range.

Mar 5, 2021 3:40pm

Stefan Fröhling (7826) 167 posts

Hello Rebecca,
Interesting project. I was also thinking about something like it as alternative to Google.
Sadly the link that you showed didn’t work with a timeout. So I not was able to test it out…
Did you solve your server problems?
We might be able to offer you server time/space if needed.
regards Stefan

Mar 27, 2021 9:27am

Rebecca Shalfield (2257) 18 posts

Hi Stefan,
I have now come to the conclusion that hosting the RISC OS Search Engine myself is not really a viable option, from a financial point of view, due to its spidering requirements. I have now placed the entire source code to the RISC OS Search Engine version 2.0 up at https://sourceforge.net/projects/risc-os-search-engine/ for yourself and others to see if this project’s aims can finally see the light of day. I have added some basic installation instructions. Let me know if this proves inadequate to get everything setup on a new server.
Rebecca

Mar 28, 2021 12:48pm

Steve Revill (20) 1361 posts

Hi Rebecca. I’ll raise this topic next time ROOL and RISC OS Developments has one of their regular coordination calls.

RISC OS Search Engine

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Feb 1, 2014 12:01pm Rebecca Shalfield (2257) 18 posts	Hi All, Work on the RISC OS Search Engine has been continuing apace throughout January. With 1.2 million URLs still to be processed, I have endeavoured to get a handle on what these URLs actually are, from how many domains they derive and a count of those with strikes against them. The Spidering page now features an Activity section to monitor what is being added to each collection, which has allowed me to get to the bottom of the duplicate insertions. I now also assign a priority, between 0 and 99 to each newly added URL so as to process them in a sensible order, with .zip files, riscos.xml files, syndicated feeds and software-specific ones gaining 0, 1, 2 and 3 respectively. To assist with separating genuine RISC OS-related URLs from the ‘noise’, I have added what I am calling the Crowd Spidering (after Crowd Sourcing) page where the priority of all pages from a common domain can be promoted/demoted on mass. The code extracting the hrefs from HTML code has been rewritten and now uses a Python library. Due to a separation of housekeeping and spidering duties, the number of URLs scanned per day has been doubled, although this has had the side effect of increasing the overall unprocessed count. The books page’s search results have been altered to start moving away from the standard table format. Also, ‘edition’ has been added to the XML code for books. The generic search has been experimentally updated to allow “…” “…” searches. I have added an alternative way to search should autocomplete not work in your chosen browser. The autocomplete code has also been rationalised. The display format for dates has been improved. I have also taken ‘ownership’ of a Virtual Private Server onto which I wish to clone the entire site and to get to grips with separating the spidering duties between two machines as a precursor to taking Steve up on his kind offer. That’s all for now. I’ll report back in another month. Rebecca

Mar 1, 2014 10:12pm Rebecca Shalfield (2257) 18 posts	Hi All, Well, another month has been and gone and the RISC OS Search Engine has now been cloned onto my VPS at http://185.30.213.186. An Application Programmer’s Interface has also been implemented of which a part, the JSON-formatted synchronisation mechanism, is working successfully, happily copying data between the two mirrors. Although http://www.shalfield.com/riscos is now pointing to the VPS, where only limited data is currently available, the mirror on my home machine at http://84.92.157.78/riscos, where you’ll find details of all 1471 RISC OS applications discovered so far, can still be reached via the menu at the top. Rebecca

Mar 2, 2014 1:54pm Dave Higton (1515) 3534 posts	Hi Rebecca, Thank you for your efforts with your RISC OS Search Engine. I think it will be a very valuable resource. There’s an oddity, though, that I think you might like to take a look at. It struck me when I read the headline “Portsmouth show reminder” for “next Saturday, the 28th of September” that you tag the items with the date you put them onto your page (presumably), but there is no mention of the original date of the posting. So we have no way to see how out of date the information is, short of clicking the link you give.

Jan 5, 2019 12:06pm Rebecca Shalfield (2257) 18 posts	Hi All, I have been away from RISC OS for a while (4 years) whilst I developed a Kanban-based project management system named Kanbanara using the same underlying technology (Python, CherryPy, MongoDB) as the RISC OS Search Engine. I parked the RISC OS Search Engine back in 2014 because I was very dissatisfied with its spidering – it seems to suffer from “telomeres-in-reverse”, where URLs would just grow longer and longer and effectively clog up the spidering process and “attack” RISC OS websites with unnecessary requests. Since building my new RISC OS computer based around a Titanium board and with the open sourcing of the OS itself, I have a renewed interest in RISC OS again. As a result, I resurrected the RISC OS Search Engine a couple of months ago. The code has been rewritten to work with Python 3 rather than Python 2. The spidering has been examined in greater detail and got to work a whole lot better. It now has details of 2397 RISC OS applications. I am currently working on reworking the web UI component. I am not yet in a position to officially announce anything except to reassure others that such a scheme is being actively worked upon.

Jan 7, 2019 6:55pm Stevyn Gadd (2272) 63 posts	Good to hear you’ve been tempted back to the world of RISC OS. Looking forward to trying it.

Jan 13, 2019 12:24pm Rebecca Shalfield (2257) 18 posts	Hi All, Looking at the entries in the RISC OS Products Directory, which I manually assembled back in 2003-ish, you will see that it featured details of 2923 RISC OS products. I am pleased to announce that the new-improved RISC OS Search Engine spidering process has now exceeded that figure, despite some 436 thousand URLs still to be processed. The RISC OS Search Engine currently knows about 3167 RISC OS applications, many of which can be downloaded from multiple sources throughout the world wide web. As the web-based front end is still under re-development, I shall email out today a static web page containing details of these 3167 RISC OS applications to the various RISC OS portals in the hope that such information can be disseminated as widely as possible.

Jan 18, 2021 11:22pm Rebecca Shalfield (2257) 18 posts	Hi All, I was hoping to give you an update on the RISC OS Search Engine project a few weeks ago, but the power supply on my home Windows PC failing put pay to that and any thoughts of working on its web site over Christmas. Anyway, with power supply now replaced, my home PC is back up and running. The development of a RISC OS Search Engine is proving something of a challenge but at least now the spidering in under better control; so much so that I now want to rewrite the spidering process to work in quite a different way. Basically, I have two database collections, one holding the successful RISC OS-related URLs and the other a backlog of URLs still to be processed. When a RISC OS-related entry reaches its scheduled rescan time, it is simply removed from the RISC OS database and placed in the backlog of URLs, where it may remain unrescanned for some time based on the length of the backlog (On my home PC, the backlog is 173382 in length); as a result, such valid entries are no longer visible to the end user. As I am quite likely to break what I currently have in the course of such a rewrite, and possibly would not have a working system for a few months, I thought you might like to now see, and benefit from, what I have achieved so far. This evening, I set up a fresh RISC OS Search Engine installation on a spare VPS, seeded it with all the successful RISC OS-related URLs I currently have, then set it in motion to observe what might happen. On my home PC, the database is reporting 3498 RISC OS applications; Hopefully, this new installation will be upto that number in a few days. The UI is still very much incomplete but heading in what I hope is the right direction. If viewed from a current RISC OS browser such as NetSurf, it is going to look terrible being written in HTML5, CSS3 and using d3.js graphics. Hopefully, it looks and performs better on one of the new but unreleased RISC OS browsers, to which I do not have access. The results can be seen at http://151.80.169.242:22676/ Rebecca

Jan 19, 2021 3:02am Paolo Fabio Zaino (28) 1882 posts	Nice job! :)

Jan 19, 2021 9:43am Andrew McCarthy (3688) 605 posts	What a nice surprise for 2021! I’m impressed with how you’ve handled problematic websites that might turn up in the “Latest Web Pages” by providing a “Goto” button that you can hover over to see where it might take you. :)

Jan 19, 2021 10:59am Andreas Skyman (8677) 170 posts	Very nice! I’m not sure how the backend works, but there may be some tags on GitHub that your spiders might enjoy munching on, e.g. `riscos` and `riscos-ci` ⁰. ⁰ https://github.com/topics/riscos and https://github.com/topics/riscos-ci

Jan 19, 2021 1:55pm Paul M (4167) 12 posts	That’s a really impressive piece of work. A great idea and looks to be coming along really well. Great to see something like this being available to the RISC OS community.

Jan 21, 2021 8:47pm Rebecca Shalfield (2257) 18 posts	The spidering seems to have progressed far quicker than I ever expected, which is great. We are already upto 3487 apps on the new installation at http://151.80.169.242:22676/. Only 23 more required until it is at the same level as my home installation. Since the weekend, I have tidied up a few UI features such as module listings related to apps; One of the benefits of working from home! I have also added a new “Module Dependencies” page that displays all the apps dependent on a given module – Try this yourself with “UtilityModule”. I am wondering if I need to also be able to give the module’s version such as “3.10” or “3.50”.

Jan 22, 2021 1:04pm John Rickman (71) 646 posts	First class! Best resource finder since Paul Vigay’s stuff. Don’t know what category but you should do well in the Riscository awards.

Jan 22, 2021 1:33pm Raik (463) 2061 posts	Nice project :-) I’m missing any German websites. www.gag.de, www.riscos.berlin, http://forum.acorn.de… RISC OS on www.a4com.de is gone last year. A German user reports “Strange activity in my area …” … looks like the content was collected but I not found in your engine. Thanks a lot.

Jan 26, 2021 8:15pm Rebecca Shalfield (2257) 18 posts	Things were running so well that something had to go wrong. Basically, my VPS had managed to both terminate the Python scripts and lock me out, rendering me unable to restart them. My VPS provider felt the best solution to resolve the situation was to give me a fresh VPS. Unfortunately, this is running “Chad Valley OS” otherwise known as Windows Server 2012 R2, which I utterly refuse to struggle with, not to mention that it is too old to install MongoDB. I have raised a helpdesk request as I am additionally unable to change the ISO image for the OS to something later. In the meantime, I shall continue to develop my home installation of the RISC OS Search Engine. Hopefully, a VPS will be up and running in a few days. Having 3518 RISC OS apps, I was intrigued to know which year had the most releases or updates. Assuming the following image appears at your end, you’ll be able to see the fruits of my labour.

Jan 26, 2021 8:21pm Rick Murray (539) 13850 posts	Assuming the following image appears at your end Sorry, you’ve linked to a Windows filename… c:\Temp\riscos_apps_released_updated_per_year.png :-) If you need to drop an image in a hurry and you aren’t worried about copyright, try imgur.com. I was intrigued to know which year had the most releases or updates. Well, there’s a hard call. I think that the year with the most app development is likely to be something like 1994 or 1995 when the RiscPC was shiny and new and the RISC OS market was vibrant enough to have a lot of software updated to run on the machine. However, this is pre-Internet. It may be that, for you, you peg 2012 as a year with a lot of activity. Why? Two phrases. “RISC OS” and “RaspberryPi”. :-) Hope you get your VPS sorted soon.

Feb 8, 2021 7:45am Rebecca Shalfield (2257) 18 posts	The web-based UI of the RISC OS Search Engine is nearing completion. Unfortunately, I am unable to show you an experimental version on my VPS as we have now parted company. Basically, the original VPS corrupted itself and could not be revived. I initially rejected its replacement due to its OS – the awful Windows Server 2012 R2 – but managed to get it set up with a older version of the MongoDB software I needed to use. The VPS was running fine late last night but had been suspended by this morning due to CPU abuse. As the spidering and UI are basically complete, any ideas from RISCOSOpen or RISCOSDev going forward? It would be a shame for the RISC OS Community to lose this resource but it is now obvious to me that although I can run it quite successfully on my home PC and domestic broadband account, having a beefy-enough VPS is going to fall outside the budget range.

Mar 5, 2021 3:40pm Stefan Fröhling (7826) 167 posts	Hello Rebecca, Interesting project. I was also thinking about something like it as alternative to Google. Sadly the link that you showed didn’t work with a timeout. So I not was able to test it out… Did you solve your server problems? We might be able to offer you server time/space if needed. regards Stefan

Mar 27, 2021 9:27am Rebecca Shalfield (2257) 18 posts	Hi Stefan, I have now come to the conclusion that hosting the RISC OS Search Engine myself is not really a viable option, from a financial point of view, due to its spidering requirements. I have now placed the entire source code to the RISC OS Search Engine version 2.0 up at https://sourceforge.net/projects/risc-os-search-engine/ for yourself and others to see if this project’s aims can finally see the light of day. I have added some basic installation instructions. Let me know if this proves inadequate to get everything setup on a new server. Rebecca

Mar 28, 2021 12:48pm Steve Revill (20) 1361 posts	Hi Rebecca. I’ll raise this topic next time ROOL and RISC OS Developments has one of their regular coordination calls.