Safeguarding the past, present and future of RISC OS for everyone

News | Downloads | Bugs | Bounties | Forums | Library

Forums → General →

Networking

34 posts, 10 voices

Pages: 1 2

Aug 9, 2018 6:28pm Jeffrey Lee (213) 6048 posts	Use a web crawler which populates a searchable database. Who runs the crawler? Anyone who wants to host the search service. I’m assuming that the core server components will be open-source (like Rebecca’s code). So there could easily be multiple active sites offering the service, offering some much needed redundancy for when they inevitably go “pop”.

Aug 9, 2018 7:02pm Rick Murray (539) 13857 posts	I’m assuming that the core server components will be open-source In this day and age, surely that would be a necessity? offering some much needed redundancy If this is the case, then it might also be a potentially useful way to distribute the scanning/spidering work amongst multiple sites (to reduce the workload, to reduce the data consumption, to reduce the time taken)?

Aug 10, 2018 12:56pm Jeffrey Lee (213) 6048 posts	Googling for “web crawler framework” suggests that there are more open source crawlers available than you can shake a stick at, so it shouldn’t be too hard to find one that meets most of our needs (robots.txt-compliant, acceptable licensing terms, sensible programming language, etc.). So the hard part will be writing the code that analyses the content – which a fair amount of Rebecca’s code was focused around (examining zip files to identify programs, etc.) Not to mention that it will need to be clever enough to detect “the latest version” of something to provide links to that, and not various older incarnations. A good example here is Zap – would you like the 26 bit version? Tank’s 32 bit version? Or one of the three (?) that followed ending with what I believe is the most recent – my Zap builds. At least, until somebody else hacks’n’patches and makes another one. ;-) Yes. It’s easy to detect software, and it’s easy to detect identical copies of software. But extracting meaningful version information (or other attributes) from software can be a lot harder. For modules, if you assume that all module names are registered, it can easily build a database of all known modules and their version numbers, providing a human-friendly result. But things get a bit unfriendly again when you realise that there could be multiple builds of the same version of a module (e.g. optimised for different CPU architectures, or manually patched modules). In some cases extra analytics might be able to help – e.g. if the module is in a !System distribution, it could attempt to determine OS/CPU compatibility by the folder the module is in. Or it could use the timestamp of the module to determine which is newest. But ultimately it might be down to the user to decide which version of the module they want (e.g. the user could look at the websites which contain the different builds and conclude “oh, I want this one because this text here says it’s been patched for ARMv7”) Non-modules are a bit trickier because there’s no standard way of indicating the version of the program. In some cases it might be possible to extract the information from help/readme files. We’d probably have to start with a simple algorithm, then improve it based upon the list of programs that it fails to identify version information for. When software ownership has been passed around between multiple people, or where multiple pieces of software share the same name, analysis of text in readme files might be able to determine whether program A is a relative of program X or program Y. There’s lots of “fun” things you can do, and the small RISC OS market means that you won’t find yourself drowning in terabytes of data, making it easy to experiment with different approaches once you’ve done the initial work of identifying the interesting websites/zip files.

Aug 10, 2018 9:01pm Peter Scheele (2290) 178 posts	But extracting meaningful version information (or other attributes) from software can be a lot harder. Is there something developers/web site owners can do? For instance by adding specific descriptions about programs, modules etc. in their web sites? I know it’s a bit early to bring this up, but maybe it makes things a lot easier to design both database and crawlers. There’s one thing Rick mentioned: adding a rool.html in the web site. @ Jeffrey: what ‘fun’ things do you mean?

Aug 10, 2018 10:32pm Martin Avison (27) 1495 posts	Anything that relies on web site changes will only work with sites that are still being actively maintained. That, I think, would be a major limitation.

Aug 10, 2018 10:42pm Rick Murray (539) 13857 posts	That, I think, would be a major limitation. Not necessarily. Site changes can be used to guide and inform the spider. Abandoned sites will require the spider to fall back on finding things by itself (and possibly flagging such content for archival?). This method will be less reliable, but a compromise between aiding the spider, and not. It’s rather like Google, really. You can make a sitemap to help it, or just let it discover stuff for itself, though it may not be as good.

Aug 10, 2018 11:17pm Rick Murray (539) 13857 posts	One request I have – please make the spider not try to degrade URLs. I’ve had to teach Google that https://www.heyrick.co.uk/blog/index.php?diary=20180800 and https://www.heyrick.co.uk/blog/index.php?diary=20180700 (and the other 800 odd) are not the same, it’s a bit more hit and miss with the wayback engine (works sporadically). Leaving off the specifier will 301 redirect to the most recent entry, which means /blog/ should be disregarded as it’ll point to a different thing each time. But then, if you’re following the redirect you’ll have a proper full URI so should be using that. This has been done so a user looking at /blog/ will always see the most recent article, while the rest have definite bookmarkable URIs (based upon the date). Oh, here’s a fun one. Pick http or https. They are both supported, but plain http will not redirect to SSL. It’s by intention. Google… Gets confused by this. ;-) Simplest method for this is to just try HTTPS always, falling back if that doesn’t work. Don’t panic too much. My site is still ~~having crap dumped in it~~ being maintained so it’s no problem to drop in a control file pointing at all the juicy content that may be of interest…

Aug 11, 2018 1:24am Jeffrey Lee (213) 6048 posts	@ Jeffrey: what ‘fun’ things do you mean? Just all the different types of analysis you can do on the data to try and extract useful information from it.

Jan 5, 2019 10:23am Rebecca Shalfield (2257) 18 posts	Hi All, I have been away from RISC OS for a while (4 years) whilst I developed a Kanban-based project management system named Kanbanara using the same underlying technology (Python, CherryPy, MongoDB) as the RISC OS Search Engine. I parked the RISC OS Search Engine back in 2014 because I was very dissatisfied with its spidering – it seems to suffer from “telomeres-in-reverse”, where URLs would just grow longer and longer and effectively clog up the spidering process and “attack” RISC OS websites with unnecessary requests. Since building my new RISC OS computer based around a Titanium board and with the open sourcing of the OS itself, I have a renewed interest in RISC OS again. As a result, I resurrected the RISC OS Search Engine a couple of months ago. The code has been rewritten to work with Python 3 rather than Python 2. The spidering has been examined in greater detail and got to work a whole lot better. It now has details of 2397 RISC OS applications. I am currently working on reworking the web UI component. I am not yet in a position to officially announce anything except to reassure others that such a scheme is being actively worked upon.

Pages: 1 2

Reply

To post replies, please first log in.

Forums → General →

Search forums

Social

Follow us on

and

ROOL Store

Buy RISC OS Open merchandise here, including SD cards for Raspberry Pi and more.

Donate! Why?

Help ROOL make things happen – please consider donating!

RISC OS IPR

RISC OS is an Open Source operating system owned by RISC OS Developments Ltd and licensed primarily under the Apache 2.0 license.

Description

General discussions.

Voices

Options

Forums
Login

Contact Us | About Us

The RISC OS Open Beast theme is based on Beast's default layout
Site design © RISC OS Open Limited 2024 except where indicated

Hosted by Arachsys

Powered by Beast © 2006 Josh Goebel and Rick Olson
This site runs on Rails