RISC OS Open: Forum: Compression of ELF files and shared libraries

Oct 2, 2019 9:17pm

One of the things I’m seeing with the browser developments is that the shared libraries are pretty huge. Iris on OBrowser are both using 40-100 MB (or more) of share libaries, which have to be loaded before during first startup.

Whilst this isn’t too bad on a SATA machine (let’s say 1 second), on a 10MB/sec SD-card-based system that’s potentially 10 or so secs of just loading libraries.

It struck me that Acorn solved this (sort of) by offering !Squeeze and ModSqueeze in the dev tools. This crunched the executable down and decompressed it on load (I assume).

Is there any scope for something similar for ELF files and shared libraries? The amount of compute power available to decompress is likely to far exceed the performance of the SD bus, making it a worthwhile trade-off, I think?

Or, is something like that already done, and there’s no scope to reduce that?

Oct 2, 2019 9:51pm

Jeffrey Lee (213) 6048 posts

!Squeeze and ModSqueeze in the dev tools. This crunched the executable down and decompressed it on load (I assume).

Correct.

I’d imagine that the default for Unix/Windows/etc. is to not compress libraries, because it’d prevent them from being (efficiently) used in a memory-mapped manner. After all, there’s no point loading a 100MB binary into memory when starting an app if only 10% of that code is actually needed for the app to become ready for use.

Apparently there is a generic mechanism for compressing ELF files (via the SHF_COMPRESSED section header flag), but it’s relatively new (2015-ish?), so some extra work may be needed to add the relevant support to our ELF tools.

http://www.linker-aliens.org/blogs/ali/entry/elf_section_compression/

Or we could start looking at ways to implement proper memory mapped file support.

Oct 3, 2019 4:46am

Jon Abbott (1421) 2652 posts

we could start looking at ways to implement proper memory mapped file support

That would be the route to take, dynamically loading file pages as they’re touched instead of bulk loading large chunks of code into memory.

This could probably be done with DA’s as-is, but might be better if the OS handled it by trapping OS_File calls trying to load files into a DA that’s flagged for the purpose.

Oct 3, 2019 11:02am

Jeffrey Lee (213) 6048 posts

The approach that’s usually used is to allow the OS to decide where in memory the file gets mapped. The program can make some high-level choices (e.g. for RISC OS there’d be the option between mapping to global address space or into app space), but then the OS handles the rest. That way the OS can load it on a page boundary, allowing the mapping to be shared between multiple processes if they all request the same file (potentially at different addresses), and to allow direct integration with the filesystem cache. Bonus points if the binaries are fully read-only, position-independent, otherwise copy-on-write is needed to allow the loader to patch them.

Oct 3, 2019 1:43pm

Steffen Huber (91) 1958 posts

While memory-mapped file support would obviously be a much better solution (along with some FS cacheing), I think that Andrew’s pragmatic compression approach would still go a long way towards a more usable system for applications which use large amounts of shared libs. Integrating simple zlib capability in SOLoader along with a simple “provide compressed alternatives for all shared libs” application would probably be 100 times easier to implement.

Did anyone ever profile the loading process of the ELF stuff when a large app starts? I was always surprised about the large gain when putting things on RAM disc – after all, in benchmarks, RAM disc is not THAT fast. It reminded me of effects usually seen in situations with many small files, where access times are the deciding factor while block transfer speed is not that important.

Oct 3, 2019 3:06pm

Andrew Rawnsley (492) 1445 posts

If zlib is too computationally expensive, then even a simple run length encoding (RLE) would probably be better than nothing.

Jeffrey’s solution is unquestionably better, but sounds like a major undertaking (albeit worthwhile). As Steffen says, I just figured that reducing the amount of stuff to load would be something that could be done in weeks not months. Copying to RAM disc also works, but the time it takes to copy to RAM probably isn’t much better than the time it takes to load the libraries in the first place. It is just very telling how much faster to start the browsers are on second-run when the shared libraries are already loaded in memory.

Oct 3, 2019 3:45pm

Steffen Huber (91) 1958 posts

I am not an expert in data compression, but zlib is pretty much used everywhere and is a lot faster decompressing than compressing. Using a single core on modern PCs, it reaches 500 MB/s. I think that LZ4 is better optimizied for speedy decompression, I found reported speeds of 4500 MB/s.

So on a single core ARM with 1 GHz and reasonably fast RAM, I would expect plain zlib to reach at least 50 MB/s.

There seems to be quite some activity speeding up zlib on ARM, because it matters for HTTP decompression and PNG decoding. See e.g. here for “Chromium zlib”: https://events.linuxfoundation.org/wp-content/uploads/2017/11/Optimizing-Zlib-on-ARM-The-Power-of-NEON-Adenilson-Cavalcanti-ARM.pdf

Oct 3, 2019 3:53pm

nemo (145) 2569 posts

There’s already a module for memory-mapping files isn’t there?

Oct 4, 2019 9:20pm

Andrew Rawnsley (492) 1445 posts

I thiunk you’d want to be targetting 100-150 MB/s decompression rate to avoid a noticable performance decrease on current SATA systems, but in a sense the algorithm is slightly immaterial in that it can be tried/tested if the idea is deemed worthwhile.

Oct 5, 2019 5:28pm

Steffen Huber (91) 1958 posts

I thiunk you’d want to be targetting 100-150 MB/s decompression rate to avoid a noticable performance decrease on current SATA systems, but in a sense the algorithm is slightly immaterial in that it can be tried/tested if the idea is deemed worthwhile.

If you have a fast SATA system, why would you want compression in the first place? It is a solution for slow I/O machines.

For a good judgement, you’d also need to investigate the compression ratio for the various compression schemes. And of course have a look at faster vs. slower CPUs – I would expect at least a difference of factor 2 on Cortex-A9 vs. Cortex-A15 machines, and it will get bigger once the RPi 4 is properly supported.

Oct 5, 2019 6:12pm

Stuart Painting (5389) 714 posts

If you have a fast SATA system, why would you want compression in the first place?

It’s a matter of perspective. The user of a fast machine wouldn’t want compression, but the user of a slow machine would clearly benefit. The application developer has to choose whether the performance hit on faster machines outweighs the improvement on slower machines, and whether it would be necessary to distribute two versions of the application as a result.

Oct 5, 2019 6:55pm

Andrew Rawnsley (492) 1445 posts

Yes, as Stuart says. Essentially, if it is fast enough, then decompression won’t incur a noticable impact on fast systems, but would seriously help slower ones. I still run my modern programs through !Squeeze (mostly by habit) because I figure it can’t hurt on SD-systems and I know the decompress time can’t be meaningfully measured due to the speed of modern CPUs, so effectively there’s no downside. They key is finding a good balance.

Oct 5, 2019 7:15pm

Rick Murray (539) 13861 posts

are both using 40-100 MB (or more) of share libaries,

I think the crux of the problem may lie here – is all of that code necessary? Could the libraries be split into core components and “extras” that can be dynamically loaded if necessary?

I still run my modern programs through !Squeeze

I gave up on that as I found it was easier for debugging to look at an uncompressed executable in Zap. My apps are small enough that I doubt there’s much practical difference between load times for compressed vs uncompressed.
I’ve also reverted to always including function names, and have rebuilt DeskLib with function names (for the C parts, at least) in order that backtraces are useful and not gibberish addresses.

so effectively there’s no downside.

It’s compressed, now it’s SLOW.

Come on, you know somebody is going to say that. :-)

I wonder, just off the top of my head… if it might work as a stop-gap to be able to save and reload a compressed version of the entire SOManager Dynamic Area? So once the system has been set up once (and nothing updated), the whole DA can be loaded from a previously compressed file copy?
Might not be doable, just wonder…if it could be?

Oct 5, 2019 10:20pm

nemo (145) 2569 posts

I’ve also reverted to always including function names

Not just useful for debugging…

It’s compressed, now it’s SLOW.

If your machine is slow enough to notice the decompression speed, it’s not going to have hyper-fast disc IO. So compression is always a win, and of course things can be decompressed before inspection.

Compression of ELF files and shared libraries

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Oct 2, 2019 9:17pm Andrew Rawnsley (492) 1445 posts	One of the things I’m seeing with the browser developments is that the shared libraries are pretty huge. Iris on OBrowser are both using 40-100 MB (or more) of share libaries, which have to be loaded before during first startup. Whilst this isn’t too bad on a SATA machine (let’s say 1 second), on a 10MB/sec SD-card-based system that’s potentially 10 or so secs of just loading libraries. It struck me that Acorn solved this (sort of) by offering !Squeeze and ModSqueeze in the dev tools. This crunched the executable down and decompressed it on load (I assume). Is there any scope for something similar for ELF files and shared libraries? The amount of compute power available to decompress is likely to far exceed the performance of the SD bus, making it a worthwhile trade-off, I think? Or, is something like that already done, and there’s no scope to reduce that?

Oct 2, 2019 9:51pm Jeffrey Lee (213) 6048 posts	!Squeeze and ModSqueeze in the dev tools. This crunched the executable down and decompressed it on load (I assume). Correct. I’d imagine that the default for Unix/Windows/etc. is to not compress libraries, because it’d prevent them from being (efficiently) used in a memory-mapped manner. After all, there’s no point loading a 100MB binary into memory when starting an app if only 10% of that code is actually needed for the app to become ready for use. Apparently there is a generic mechanism for compressing ELF files (via the SHF_COMPRESSED section header flag), but it’s relatively new (2015-ish?), so some extra work may be needed to add the relevant support to our ELF tools. http://www.linker-aliens.org/blogs/ali/entry/elf_section_compression/ Or we could start looking at ways to implement proper memory mapped file support.

Oct 3, 2019 4:46am Jon Abbott (1421) 2652 posts	we could start looking at ways to implement proper memory mapped file support That would be the route to take, dynamically loading file pages as they’re touched instead of bulk loading large chunks of code into memory. This could probably be done with DA’s as-is, but might be better if the OS handled it by trapping OS_File calls trying to load files into a DA that’s flagged for the purpose.

Oct 3, 2019 11:02am Jeffrey Lee (213) 6048 posts	The approach that’s usually used is to allow the OS to decide where in memory the file gets mapped. The program can make some high-level choices (e.g. for RISC OS there’d be the option between mapping to global address space or into app space), but then the OS handles the rest. That way the OS can load it on a page boundary, allowing the mapping to be shared between multiple processes if they all request the same file (potentially at different addresses), and to allow direct integration with the filesystem cache. Bonus points if the binaries are fully read-only, position-independent, otherwise copy-on-write is needed to allow the loader to patch them.

Oct 3, 2019 1:43pm Steffen Huber (91) 1958 posts	While memory-mapped file support would obviously be a much better solution (along with some FS cacheing), I think that Andrew’s pragmatic compression approach would still go a long way towards a more usable system for applications which use large amounts of shared libs. Integrating simple zlib capability in SOLoader along with a simple “provide compressed alternatives for all shared libs” application would probably be 100 times easier to implement. Did anyone ever profile the loading process of the ELF stuff when a large app starts? I was always surprised about the large gain when putting things on RAM disc – after all, in benchmarks, RAM disc is not THAT fast. It reminded me of effects usually seen in situations with many small files, where access times are the deciding factor while block transfer speed is not that important.

Oct 3, 2019 3:06pm Andrew Rawnsley (492) 1445 posts	If zlib is too computationally expensive, then even a simple run length encoding (RLE) would probably be better than nothing. Jeffrey’s solution is unquestionably better, but sounds like a major undertaking (albeit worthwhile). As Steffen says, I just figured that reducing the amount of stuff to load would be something that could be done in weeks not months. Copying to RAM disc also works, but the time it takes to copy to RAM probably isn’t much better than the time it takes to load the libraries in the first place. It is just very telling how much faster to start the browsers are on second-run when the shared libraries are already loaded in memory.

Oct 3, 2019 3:45pm Steffen Huber (91) 1958 posts	I am not an expert in data compression, but zlib is pretty much used everywhere and is a lot faster decompressing than compressing. Using a single core on modern PCs, it reaches 500 MB/s. I think that LZ4 is better optimizied for speedy decompression, I found reported speeds of 4500 MB/s. So on a single core ARM with 1 GHz and reasonably fast RAM, I would expect plain zlib to reach at least 50 MB/s. There seems to be quite some activity speeding up zlib on ARM, because it matters for HTTP decompression and PNG decoding. See e.g. here for “Chromium zlib”: https://events.linuxfoundation.org/wp-content/uploads/2017/11/Optimizing-Zlib-on-ARM-The-Power-of-NEON-Adenilson-Cavalcanti-ARM.pdf

Oct 3, 2019 3:53pm nemo (145) 2569 posts	There’s already a module for memory-mapping files isn’t there?

Oct 4, 2019 9:20pm Andrew Rawnsley (492) 1445 posts	I thiunk you’d want to be targetting 100-150 MB/s decompression rate to avoid a noticable performance decrease on current SATA systems, but in a sense the algorithm is slightly immaterial in that it can be tried/tested if the idea is deemed worthwhile.

Oct 5, 2019 5:28pm Steffen Huber (91) 1958 posts	I thiunk you’d want to be targetting 100-150 MB/s decompression rate to avoid a noticable performance decrease on current SATA systems, but in a sense the algorithm is slightly immaterial in that it can be tried/tested if the idea is deemed worthwhile. If you have a fast SATA system, why would you want compression in the first place? It is a solution for slow I/O machines. For a good judgement, you’d also need to investigate the compression ratio for the various compression schemes. And of course have a look at faster vs. slower CPUs – I would expect at least a difference of factor 2 on Cortex-A9 vs. Cortex-A15 machines, and it will get bigger once the RPi 4 is properly supported.

Oct 5, 2019 6:12pm Stuart Painting (5389) 714 posts	If you have a fast SATA system, why would you want compression in the first place? It’s a matter of perspective. The user of a fast machine wouldn’t want compression, but the user of a slow machine would clearly benefit. The application developer has to choose whether the performance hit on faster machines outweighs the improvement on slower machines, and whether it would be necessary to distribute two versions of the application as a result.

Oct 5, 2019 6:55pm Andrew Rawnsley (492) 1445 posts	Yes, as Stuart says. Essentially, if it is fast enough, then decompression won’t incur a noticable impact on fast systems, but would seriously help slower ones. I still run my modern programs through !Squeeze (mostly by habit) because I figure it can’t hurt on SD-systems and I know the decompress time can’t be meaningfully measured due to the speed of modern CPUs, so effectively there’s no downside. They key is finding a good balance.

Oct 5, 2019 7:15pm Rick Murray (539) 13861 posts	are both using 40-100 MB (or more) of share libaries, I think the crux of the problem may lie here – is all of that code necessary? Could the libraries be split into core components and “extras” that can be dynamically loaded if necessary? I still run my modern programs through !Squeeze I gave up on that as I found it was easier for debugging to look at an uncompressed executable in Zap. My apps are small enough that I doubt there’s much practical difference between load times for compressed vs uncompressed. I’ve also reverted to always including function names, and have rebuilt DeskLib with function names (for the C parts, at least) in order that backtraces are useful and not gibberish addresses. so effectively there’s no downside. It’s compressed, now it’s SLOW. Come on, you know somebody is going to say that. :-) I wonder, just off the top of my head… if it might work as a stop-gap to be able to save and reload a compressed version of the entire SOManager Dynamic Area? So once the system has been set up once (and nothing updated), the whole DA can be loaded from a previously compressed file copy? Might not be doable, just wonder…if it could be?

Oct 5, 2019 10:20pm nemo (145) 2569 posts	I’ve also reverted to always including function names Not just useful for debugging… It’s compressed, now it’s SLOW. If your machine is slow enough to notice the decompression speed, it’s not going to have hyper-fast disc IO. So compression is always a win, and of course things can be decompressed before inspection.