Compression of ELF files and shared libraries
Andrew Rawnsley (492) 1445 posts |
One of the things I’m seeing with the browser developments is that the shared libraries are pretty huge. Iris on OBrowser are both using 40-100 MB (or more) of share libaries, which have to be loaded before during first startup. Whilst this isn’t too bad on a SATA machine (let’s say 1 second), on a 10MB/sec SD-card-based system that’s potentially 10 or so secs of just loading libraries. It struck me that Acorn solved this (sort of) by offering !Squeeze and ModSqueeze in the dev tools. This crunched the executable down and decompressed it on load (I assume). Is there any scope for something similar for ELF files and shared libraries? The amount of compute power available to decompress is likely to far exceed the performance of the SD bus, making it a worthwhile trade-off, I think? Or, is something like that already done, and there’s no scope to reduce that? |
Jeffrey Lee (213) 6048 posts |
Correct. I’d imagine that the default for Unix/Windows/etc. is to not compress libraries, because it’d prevent them from being (efficiently) used in a memory-mapped manner. After all, there’s no point loading a 100MB binary into memory when starting an app if only 10% of that code is actually needed for the app to become ready for use. Apparently there is a generic mechanism for compressing ELF files (via the SHF_COMPRESSED section header flag), but it’s relatively new (2015-ish?), so some extra work may be needed to add the relevant support to our ELF tools. http://www.linker-aliens.org/blogs/ali/entry/elf_section_compression/ Or we could start looking at ways to implement proper memory mapped file support. |
Jon Abbott (1421) 2652 posts |
That would be the route to take, dynamically loading file pages as they’re touched instead of bulk loading large chunks of code into memory. This could probably be done with DA’s as-is, but might be better if the OS handled it by trapping OS_File calls trying to load files into a DA that’s flagged for the purpose. |
Jeffrey Lee (213) 6048 posts |
The approach that’s usually used is to allow the OS to decide where in memory the file gets mapped. The program can make some high-level choices (e.g. for RISC OS there’d be the option between mapping to global address space or into app space), but then the OS handles the rest. That way the OS can load it on a page boundary, allowing the mapping to be shared between multiple processes if they all request the same file (potentially at different addresses), and to allow direct integration with the filesystem cache. Bonus points if the binaries are fully read-only, position-independent, otherwise copy-on-write is needed to allow the loader to patch them. |
Steffen Huber (91) 1958 posts |
While memory-mapped file support would obviously be a much better solution (along with some FS cacheing), I think that Andrew’s pragmatic compression approach would still go a long way towards a more usable system for applications which use large amounts of shared libs. Integrating simple zlib capability in SOLoader along with a simple “provide compressed alternatives for all shared libs” application would probably be 100 times easier to implement. Did anyone ever profile the loading process of the ELF stuff when a large app starts? I was always surprised about the large gain when putting things on RAM disc – after all, in benchmarks, RAM disc is not THAT fast. It reminded me of effects usually seen in situations with many small files, where access times are the deciding factor while block transfer speed is not that important. |
Andrew Rawnsley (492) 1445 posts |
If zlib is too computationally expensive, then even a simple run length encoding (RLE) would probably be better than nothing. Jeffrey’s solution is unquestionably better, but sounds like a major undertaking (albeit worthwhile). As Steffen says, I just figured that reducing the amount of stuff to load would be something that could be done in weeks not months. Copying to RAM disc also works, but the time it takes to copy to RAM probably isn’t much better than the time it takes to load the libraries in the first place. It is just very telling how much faster to start the browsers are on second-run when the shared libraries are already loaded in memory. |
Steffen Huber (91) 1958 posts |
I am not an expert in data compression, but zlib is pretty much used everywhere and is a lot faster decompressing than compressing. Using a single core on modern PCs, it reaches 500 MB/s. I think that LZ4 is better optimizied for speedy decompression, I found reported speeds of 4500 MB/s. So on a single core ARM with 1 GHz and reasonably fast RAM, I would expect plain zlib to reach at least 50 MB/s. There seems to be quite some activity speeding up zlib on ARM, because it matters for HTTP decompression and PNG decoding. See e.g. here for “Chromium zlib”: https://events.linuxfoundation.org/wp-content/uploads/2017/11/Optimizing-Zlib-on-ARM-The-Power-of-NEON-Adenilson-Cavalcanti-ARM.pdf |
nemo (145) 2569 posts |
There’s already a module for memory-mapping files isn’t there? |
Andrew Rawnsley (492) 1445 posts |
I thiunk you’d want to be targetting 100-150 MB/s decompression rate to avoid a noticable performance decrease on current SATA systems, but in a sense the algorithm is slightly immaterial in that it can be tried/tested if the idea is deemed worthwhile. |
Steffen Huber (91) 1958 posts |
If you have a fast SATA system, why would you want compression in the first place? It is a solution for slow I/O machines. For a good judgement, you’d also need to investigate the compression ratio for the various compression schemes. And of course have a look at faster vs. slower CPUs – I would expect at least a difference of factor 2 on Cortex-A9 vs. Cortex-A15 machines, and it will get bigger once the RPi 4 is properly supported. |
Stuart Painting (5389) 714 posts |
It’s a matter of perspective. The user of a fast machine wouldn’t want compression, but the user of a slow machine would clearly benefit. The application developer has to choose whether the performance hit on faster machines outweighs the improvement on slower machines, and whether it would be necessary to distribute two versions of the application as a result. |
Andrew Rawnsley (492) 1445 posts |
Yes, as Stuart says. Essentially, if it is fast enough, then decompression won’t incur a noticable impact on fast systems, but would seriously help slower ones. I still run my modern programs through !Squeeze (mostly by habit) because I figure it can’t hurt on SD-systems and I know the decompress time can’t be meaningfully measured due to the speed of modern CPUs, so effectively there’s no downside. They key is finding a good balance. |
Rick Murray (539) 13861 posts |
I think the crux of the problem may lie here – is all of that code necessary? Could the libraries be split into core components and “extras” that can be dynamically loaded if necessary?
I gave up on that as I found it was easier for debugging to look at an uncompressed executable in Zap. My apps are small enough that I doubt there’s much practical difference between load times for compressed vs uncompressed.
It’s compressed, now it’s SLOW. Come on, you know somebody is going to say that. :-) I wonder, just off the top of my head… if it might work as a stop-gap to be able to save and reload a compressed version of the entire SOManager Dynamic Area? So once the system has been set up once (and nothing updated), the whole DA can be loaded from a previously compressed file copy? |
nemo (145) 2569 posts |
Not just useful for debugging…
If your machine is slow enough to notice the decompression speed, it’s not going to have hyper-fast disc IO. So compression is always a win, and of course things can be decompressed before inspection. |