GCC 4.7.4 RISC OS release 2
Theo Markettos (89) 919 posts |
The RISC OS GCCSDK developers are pleased to announce the GNU Compiler Collection version 4.7.4 RISC OS release 2, which is now available for download. The main feature of this GCC release is Vector Floating Point (VFP) support, suitable for use on ARMv6 and ARMv7 platforms. This behaves similarly to VFP on other GCC ARM platforms. See the documentation inside !GCC for more details on how to build programs with VFP support. Other changes include:
The release can be downloaded from the riscos.info repositories using !PackMan, which will fetch and install all the pieces for you, or you can follow the manual installation instructions. For more information see the GCC website Our thanks go to all those who have contributed to this release. Theo Markettos |
David Feugey (2125) 2709 posts |
Fan-tas-tic! |
Jeffrey Lee (213) 6048 posts |
Thanks for putting this release together! Now I can go back to wishing I had the time to write software instead of working on the OS all the time :-P |
Theo Markettos (89) 919 posts |
Thanks Jeffrey for a big chunk of the VFP work ;-) |
Kuemmel (439) 384 posts |
…just to get this right, does it mean, the GCC is now capable of transforming floating point C code into 1) “normal” VFP, 2) “vector” VFP and 3) NEON instructions ? Or was 1) and 3) already there and only 2) is added ? …any speed up’s measured on real world apps ? |
Theo Markettos (89) 919 posts |
I’d ask on the mailing list as I wasn’t involved in this work, but my understanding is that upstream GCC can generate all kinds of floating point code. The main work is in UnixLib to handle context switching (eg in the case of signal handlers, pthreads) and in dynamic linking. I assume all the available context is being saved, but I would check that. |
Jeffrey Lee (213) 6048 posts |
GCC will do 1) and 3), but I don’t think it will do 2). Remember that VFP vector mode is deprecated, and isn’t supported in hardware by most/all modern chips – in terms of RISC OS it’s only Cortex-A8 and ARM11 which support it in hardware, for Cortex-A7, A9, A15 it’ll be emulated by VFPSupport. Of course there’s nothing stopping you from writing bits of assembler that use vector mode, as long as you’re careful to save and restore the vector length on entry & exit to your routine. You can also use VFPSupport_Features 2 to check if the vector mode is supported by hardware or not. |
Kuemmel (439) 384 posts |
Hm…may be someone more into C and usage of GCC could compile a simple test like …and see if the output is done with NEON or VFP in the executable ? I’ve no clue how good the compiler could be in “seeing” the option to use NEON…my guess is he won’t get it…or one has to do some more clever preparation in C that allow’s him to see it better…
|
Jeffrey Lee (213) 6048 posts |
GCC supports automatic vectorisation (via -O3 or -ftree-vectorize). However something I didn’t realise until I tried it just now is that by default it won’t do automatic vectorisation of floats, because the NEON FP unit isn’t 100% IEEE compatible. To get it to vectorise floats you also need to specify -funsafe-math-optimizations Using this test program: typedef struct { float a[10][4]; } stuff; stuff func(stuff a, stuff b) { stuff out; int i; for(i=0;i<10;i++) { out.a[i][0] = a.a[i][0] * b.a[i][0]; out.a[i][1] = a.a[i][1] * b.a[i][1]; out.a[i][2] = a.a[i][2] * b.a[i][2]; out.a[i][3] = a.a[i][3] * b.a[i][3]; } return out; } We end up with the following:
|
Kuemmel (439) 384 posts |
…ah, okay, I forgot about the non-IEEE thingy…could you post how the disassembly output of your example looks like ? |
Jeffrey Lee (213) 6048 posts |
It’s… interesting. It’s unrolled the loop, but it’s also decided to process the entries in a strange order, so it ends up preventing itself from using VLDM/VSTM in a lot of cases. .file "test.c" .text .align 2 .global func .ascii "func\000" .align 2 .word 4278190088 .type func, %function func: @ args = 320, pretend = 12, frame = 160, outgoing = 0 @ frame_needed = 1, uses_anonymous_args = 0 mov ip, sp sub sp, sp, #12 stmfd sp!, {r4, fp, ip, lr, pc} sub fp, ip, #16 cmp sp, sl bllt __rt_stkovf_split_small sub sp, sp, #160 mov r4, r0 stmib fp, {r1, r2, r3} vldr d0, [fp, #164] vldr d1, [fp, #172] mov r1, sp vldr d2, [fp, #4] vldr d3, [fp, #12] mov r2, #160 vmul.f32 q2, q0, q1 vldr d16, [fp, #180] vldr d17, [fp, #188] vldr d18, [fp, #20] vldr d19, [fp, #28] vldr d0, [fp, #196] vldr d1, [fp, #204] vmul.f32 q3, q8, q9 vldr d2, [fp, #36] vldr d3, [fp, #44] vldr d16, [fp, #212] vldr d17, [fp, #220] vmul.f32 q15, q0, q1 vldr d18, [fp, #52] vldr d19, [fp, #60] vldr d0, [fp, #228] vldr d1, [fp, #236] vldr d2, [fp, #68] vldr d3, [fp, #76] vmul.f32 q14, q8, q9 vstmia sp, {d4-d5} vldr d16, [fp, #244] vldr d17, [fp, #252] vldr d18, [fp, #84] vldr d19, [fp, #92] vmul.f32 q13, q0, q1 vstr d6, [sp, #16] vstr d7, [sp, #24] vldr d0, [fp, #260] vldr d1, [fp, #268] vldr d2, [fp, #100] vldr d3, [fp, #108] vmul.f32 q12, q8, q9 vstr d30, [sp, #32] vstr d31, [sp, #40] vldr d16, [fp, #276] vldr d17, [fp, #284] vldr d18, [fp, #116] vldr d19, [fp, #124] vmul.f32 q11, q0, q1 vstr d28, [sp, #48] vstr d29, [sp, #56] vldr d0, [fp, #292] vldr d1, [fp, #300] vldr d2, [fp, #132] vldr d3, [fp, #140] vmul.f32 q10, q8, q9 vstr d26, [sp, #64] vstr d27, [sp, #72] vstr d24, [sp, #80] vstr d25, [sp, #88] vstr d22, [sp, #96] vstr d23, [sp, #104] vmul.f32 q9, q0, q1 vstr d20, [sp, #112] vstr d21, [sp, #120] vldr d0, [fp, #148] vldr d1, [fp, #156] vldr d2, [fp, #308] vldr d3, [fp, #316] vstr d18, [sp, #128] vstr d19, [sp, #136] vmul.f32 q8, q0, q1 vstr d16, [sp, #144] vstr d17, [sp, #152] bl memcpy mov r0, r4 ldmea fp, {r4, fp, sp, pc} .size func, .-func .ident "GCC: (GCCSDK GCC 4.7.4 Release 2) 4.7.4" Interestingly, I get similar results when using the NEON intrinsics. But there is one notable difference – the version which uses the intrinsics doesn’t have that silly memcpy at the end! |