Networking
Pages: 1 2
Jeffrey Lee (213) 6048 posts |
Use a web crawler which populates a searchable database. Anyone who wants to host the search service. I’m assuming that the core server components will be open-source (like Rebecca’s code). So there could easily be multiple active sites offering the service, offering some much needed redundancy for when they inevitably go “pop”. |
Rick Murray (539) 13857 posts |
In this day and age, surely that would be a necessity?
If this is the case, then it might also be a potentially useful way to distribute the scanning/spidering work amongst multiple sites (to reduce the workload, to reduce the data consumption, to reduce the time taken)? |
Jeffrey Lee (213) 6048 posts |
Googling for “web crawler framework” suggests that there are more open source crawlers available than you can shake a stick at, so it shouldn’t be too hard to find one that meets most of our needs (robots.txt-compliant, acceptable licensing terms, sensible programming language, etc.). So the hard part will be writing the code that analyses the content – which a fair amount of Rebecca’s code was focused around (examining zip files to identify programs, etc.)
Yes. It’s easy to detect software, and it’s easy to detect identical copies of software. But extracting meaningful version information (or other attributes) from software can be a lot harder. For modules, if you assume that all module names are registered, it can easily build a database of all known modules and their version numbers, providing a human-friendly result. But things get a bit unfriendly again when you realise that there could be multiple builds of the same version of a module (e.g. optimised for different CPU architectures, or manually patched modules). In some cases extra analytics might be able to help – e.g. if the module is in a !System distribution, it could attempt to determine OS/CPU compatibility by the folder the module is in. Or it could use the timestamp of the module to determine which is newest. But ultimately it might be down to the user to decide which version of the module they want (e.g. the user could look at the websites which contain the different builds and conclude “oh, I want this one because this text here says it’s been patched for ARMv7”) Non-modules are a bit trickier because there’s no standard way of indicating the version of the program. In some cases it might be possible to extract the information from help/readme files. We’d probably have to start with a simple algorithm, then improve it based upon the list of programs that it fails to identify version information for. When software ownership has been passed around between multiple people, or where multiple pieces of software share the same name, analysis of text in readme files might be able to determine whether program A is a relative of program X or program Y. There’s lots of “fun” things you can do, and the small RISC OS market means that you won’t find yourself drowning in terabytes of data, making it easy to experiment with different approaches once you’ve done the initial work of identifying the interesting websites/zip files. |
Peter Scheele (2290) 178 posts |
Is there something developers/web site owners can do? For instance by adding specific descriptions about programs, modules etc. in their web sites? I know it’s a bit early to bring this up, but maybe it makes things a lot easier to design both database and crawlers. There’s one thing Rick mentioned: adding a rool.html in the web site. @ Jeffrey: what ‘fun’ things do you mean? |
Martin Avison (27) 1495 posts |
Anything that relies on web site changes will only work with sites that are still being actively maintained. That, I think, would be a major limitation. |
Rick Murray (539) 13857 posts |
Not necessarily. Site changes can be used to guide and inform the spider. It’s rather like Google, really. You can make a sitemap to help it, or just let it discover stuff for itself, though it may not be as good. |
Rick Murray (539) 13857 posts |
One request I have – please make the spider not try to degrade URLs. I’ve had to teach Google that https://www.heyrick.co.uk/blog/index.php?diary=20180800 and https://www.heyrick.co.uk/blog/index.php?diary=20180700 (and the other 800 odd) are not the same, it’s a bit more hit and miss with the wayback engine (works sporadically). Leaving off the specifier will 301 redirect to the most recent entry, which means /blog/ should be disregarded as it’ll point to a different thing each time. But then, if you’re following the redirect you’ll have a proper full URI so should be using that. Oh, here’s a fun one. Pick http or https. They are both supported, but plain http will not redirect to SSL. It’s by intention. Google… Gets confused by this. ;-) Don’t panic too much. My site is still |
Jeffrey Lee (213) 6048 posts |
Just all the different types of analysis you can do on the data to try and extract useful information from it. |
Rebecca Shalfield (2257) 18 posts |
Hi All, I have been away from RISC OS for a while (4 years) whilst I developed a Kanban-based project management system named Kanbanara using the same underlying technology (Python, CherryPy, MongoDB) as the RISC OS Search Engine. I parked the RISC OS Search Engine back in 2014 because I was very dissatisfied with its spidering – it seems to suffer from “telomeres-in-reverse”, where URLs would just grow longer and longer and effectively clog up the spidering process and “attack” RISC OS websites with unnecessary requests. Since building my new RISC OS computer based around a Titanium board and with the open sourcing of the OS itself, I have a renewed interest in RISC OS again. As a result, I resurrected the RISC OS Search Engine a couple of months ago. The code has been rewritten to work with Python 3 rather than Python 2. The spidering has been examined in greater detail and got to work a whole lot better. It now has details of 2397 RISC OS applications. I am currently working on reworking the web UI component. I am not yet in a position to officially announce anything except to reassure others that such a scheme is being actively worked upon. |
Pages: 1 2