RISC OS Open: Forum: GCC 4.7.4 RISC OS release 2

Sep 26, 2015 6:18pm

The RISC OS GCCSDK developers are pleased to announce the GNU Compiler Collection version 4.7.4 RISC OS release 2, which is now available for download.

The main feature of this GCC release is Vector Floating Point (VFP) support, suitable for use on ARMv6 and ARMv7 platforms. This behaves similarly to VFP on other GCC ARM platforms. See the documentation inside !GCC for more details on how to build programs with VFP support.

Other changes include:

Unixlib: Add support for __cxa_atexit (r6754,r6778,r6916).
Unixlib: Add support for clang/llvm (r6755).
Unixlib: When the size of an mmapped file is not a multiple of the page size, zero the remainder (r6772).
Unixlib: Add VFP support (r6795,r6867).
Unixlib: Add support for naming pthreads.
Unixlib: select() fix; only report a file descriptor as bad if it was given in one of the sets (r6862).
GCC: VFP fixes (r6869).
GCC: Make -mno-unaligned-access the default (r6796).
GCC: Fix handling of section anchors in module code (r6866,r6871).
GCC: Fix potential stack frame corruption (r6897)
Dynamic Linker: Fix handling of ‘:’ in path variables (r6870).
Dynamic Linker: Add VFP support (r6797).
SharedUnixLib: VFP support (v1.13, r6795)
SharedUnixLib: fix memory leak (v1.14, r6905)

The release can be downloaded from the riscos.info repositories using !PackMan, which will fetch and install all the pieces for you, or you can follow the manual installation instructions. For more information see the GCC website

Our thanks go to all those who have contributed to this release.

Theo Markettos
on behalf of the GCCSDK developers

Sep 26, 2015 7:06pm

David Feugey (2125) 2709 posts

Fan-tas-tic!

Sep 26, 2015 9:24pm

Jeffrey Lee (213) 6048 posts

Thanks for putting this release together!

Now I can go back to wishing I had the time to write software instead of working on the OS all the time :-P

Sep 27, 2015 1:03pm

Theo Markettos (89) 919 posts

Thanks Jeffrey for a big chunk of the VFP work ;-)

Sep 27, 2015 9:07pm

Kuemmel (439) 384 posts

…just to get this right, does it mean, the GCC is now capable of transforming floating point C code into 1) “normal” VFP, 2) “vector” VFP and 3) NEON instructions ? Or was 1) and 3) already there and only 2) is added ?

…any speed up’s measured on real world apps ?

Sep 29, 2015 8:41pm

Theo Markettos (89) 919 posts

I’d ask on the mailing list as I wasn’t involved in this work, but my understanding is that upstream GCC can generate all kinds of floating point code. The main work is in UnixLib to handle context switching (eg in the case of signal handlers, pthreads) and in dynamic linking. I assume all the available context is being saved, but I would check that.

Sep 29, 2015 9:56pm

Jeffrey Lee (213) 6048 posts

…just to get this right, does it mean, the GCC is now capable of transforming floating point C code into 1) “normal” VFP, 2) “vector” VFP and 3) NEON instructions ? Or was 1) and 3) already there and only 2) is added ?

GCC will do 1) and 3), but I don’t think it will do 2). Remember that VFP vector mode is deprecated, and isn’t supported in hardware by most/all modern chips – in terms of RISC OS it’s only Cortex-A8 and ARM11 which support it in hardware, for Cortex-A7, A9, A15 it’ll be emulated by VFPSupport.

Of course there’s nothing stopping you from writing bits of assembler that use vector mode, as long as you’re careful to save and restore the vector length on entry & exit to your routine. You can also use VFPSupport_Features 2 to check if the vector mode is supported by hardware or not.

Oct 6, 2015 9:21pm

Kuemmel (439) 384 posts

Hm…may be someone more into C and usage of GCC could compile a simple test like

float a[10][3],b[10][3],c[10][3]
for(int i=0; i<10; i++)
{
  a[i][0]=b[i][0]+c[i][0];
  a[i][1]=b[i][1]+c[i][1];
  a[i][2]=b[i][2]+c[i][2];
  a[i][3]=b[i][3]+c[i][3];
}

…and see if the output is done with NEON or VFP in the executable ? I’ve no clue how good the compiler could be in “seeing” the option to use NEON…my guess is he won’t get it…or one has to do some more clever preparation in C that allow’s him to see it better…

Oct 6, 2015 10:03pm

Jeffrey Lee (213) 6048 posts

GCC supports automatic vectorisation (via -O3 or -ftree-vectorize). However something I didn’t realise until I tried it just now is that by default it won’t do automatic vectorisation of floats, because the NEON FP unit isn’t 100% IEEE compatible. To get it to vectorise floats you also need to specify -funsafe-math-optimizations

Using this test program:

typedef struct
{
  float a[10][4];
} stuff;

stuff func(stuff a, stuff b)
{
  stuff out;
  int i;
  for(i=0;i<10;i++)
  {
    out.a[i][0] = a.a[i][0] * b.a[i][0];
    out.a[i][1] = a.a[i][1] * b.a[i][1];
    out.a[i][2] = a.a[i][2] * b.a[i][2];
    out.a[i][3] = a.a[i][3] * b.a[i][3];
  }
  return out;
}

We end up with the following:

gcc -O3: Uses softfloat function calls
gcc -O3 -mfpu=vfp: Uses scalar VFP
gcc -O3 -mfpu=neon: Uses scalar VFP
gcc -O3 -mfpu=neon -funsafe-math-optimizations: Uses NEON quadword ops

Oct 6, 2015 10:09pm

Kuemmel (439) 384 posts

…ah, okay, I forgot about the non-IEEE thingy…could you post how the disassembly output of your example looks like ?

Oct 6, 2015 10:54pm

Jeffrey Lee (213) 6048 posts

It’s… interesting. It’s unrolled the loop, but it’s also decided to process the entries in a strange order, so it ends up preventing itself from using VLDM/VSTM in a lot of cases.

    .file   "test.c"
    .text
    .align  2
    .global func
    .ascii  "func\000"
    .align  2
    .word   4278190088
    .type   func, %function
func:
    @ args = 320, pretend = 12, frame = 160, outgoing = 0
    @ frame_needed = 1, uses_anonymous_args = 0
    mov ip, sp
    sub sp, sp, #12
    stmfd   sp!, {r4, fp, ip, lr, pc}
    sub fp, ip, #16
    cmp sp, sl
    bllt    __rt_stkovf_split_small
    sub sp, sp, #160
    mov r4, r0
    stmib   fp, {r1, r2, r3}
    vldr    d0, [fp, #164]
    vldr    d1, [fp, #172]
    mov r1, sp
    vldr    d2, [fp, #4]
    vldr    d3, [fp, #12]
    mov r2, #160
    vmul.f32    q2, q0, q1
    vldr    d16, [fp, #180]
    vldr    d17, [fp, #188]
    vldr    d18, [fp, #20]
    vldr    d19, [fp, #28]
    vldr    d0, [fp, #196]
    vldr    d1, [fp, #204]
    vmul.f32    q3, q8, q9
    vldr    d2, [fp, #36]
    vldr    d3, [fp, #44]
    vldr    d16, [fp, #212]
    vldr    d17, [fp, #220]
    vmul.f32    q15, q0, q1
    vldr    d18, [fp, #52]
    vldr    d19, [fp, #60]
    vldr    d0, [fp, #228]
    vldr    d1, [fp, #236]
    vldr    d2, [fp, #68]
    vldr    d3, [fp, #76]
    vmul.f32    q14, q8, q9
    vstmia  sp, {d4-d5}
    vldr    d16, [fp, #244]
    vldr    d17, [fp, #252]
    vldr    d18, [fp, #84]
    vldr    d19, [fp, #92]
    vmul.f32    q13, q0, q1
    vstr    d6, [sp, #16]
    vstr    d7, [sp, #24]
    vldr    d0, [fp, #260]
    vldr    d1, [fp, #268]
    vldr    d2, [fp, #100]
    vldr    d3, [fp, #108]
    vmul.f32    q12, q8, q9
    vstr    d30, [sp, #32]
    vstr    d31, [sp, #40]
    vldr    d16, [fp, #276]
    vldr    d17, [fp, #284]
    vldr    d18, [fp, #116]
    vldr    d19, [fp, #124]
    vmul.f32    q11, q0, q1
    vstr    d28, [sp, #48]
    vstr    d29, [sp, #56]
    vldr    d0, [fp, #292]
    vldr    d1, [fp, #300]
    vldr    d2, [fp, #132]
    vldr    d3, [fp, #140]
    vmul.f32    q10, q8, q9
    vstr    d26, [sp, #64]
    vstr    d27, [sp, #72]
    vstr    d24, [sp, #80]
    vstr    d25, [sp, #88]
    vstr    d22, [sp, #96]
    vstr    d23, [sp, #104]
    vmul.f32    q9, q0, q1
    vstr    d20, [sp, #112]
    vstr    d21, [sp, #120]
    vldr    d0, [fp, #148]
    vldr    d1, [fp, #156]
    vldr    d2, [fp, #308]
    vldr    d3, [fp, #316]
    vstr    d18, [sp, #128]
    vstr    d19, [sp, #136]
    vmul.f32    q8, q0, q1
    vstr    d16, [sp, #144]
    vstr    d17, [sp, #152]
    bl  memcpy
    mov r0, r4
    ldmea   fp, {r4, fp, sp, pc}
    .size   func, .-func
    .ident  "GCC: (GCCSDK GCC 4.7.4 Release 2) 4.7.4"

Interestingly, I get similar results when using the NEON intrinsics. But there is one notable difference – the version which uses the intrinsics doesn’t have that silly memcpy at the end!

GCC 4.7.4 RISC OS release 2

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Sep 26, 2015 6:18pm Theo Markettos (89) 919 posts	The RISC OS GCCSDK developers are pleased to announce the GNU Compiler Collection version 4.7.4 RISC OS release 2, which is now available for download. The main feature of this GCC release is Vector Floating Point (VFP) support, suitable for use on ARMv6 and ARMv7 platforms. This behaves similarly to VFP on other GCC ARM platforms. See the documentation inside !GCC for more details on how to build programs with VFP support. Other changes include: Unixlib: Add support for __cxa_atexit (r6754,r6778,r6916). Unixlib: Add support for clang/llvm (r6755). Unixlib: When the size of an mmapped file is not a multiple of the page size, zero the remainder (r6772). Unixlib: Add VFP support (r6795,r6867). Unixlib: Add support for naming pthreads. Unixlib: select() fix; only report a file descriptor as bad if it was given in one of the sets (r6862). GCC: VFP fixes (r6869). GCC: Make -mno-unaligned-access the default (r6796). GCC: Fix handling of section anchors in module code (r6866,r6871). GCC: Fix potential stack frame corruption (r6897) Dynamic Linker: Fix handling of ‘:’ in path variables (r6870). Dynamic Linker: Add VFP support (r6797). SharedUnixLib: VFP support (v1.13, r6795) SharedUnixLib: fix memory leak (v1.14, r6905) The release can be downloaded from the riscos.info repositories using !PackMan, which will fetch and install all the pieces for you, or you can follow the manual installation instructions. For more information see the GCC website Our thanks go to all those who have contributed to this release. Theo Markettos on behalf of the GCCSDK developers

Sep 26, 2015 7:06pm David Feugey (2125) 2709 posts	Fan-tas-tic!

Sep 26, 2015 9:24pm Jeffrey Lee (213) 6048 posts	Thanks for putting this release together! Now I can go back to wishing I had the time to write software instead of working on the OS all the time :-P

Sep 27, 2015 1:03pm Theo Markettos (89) 919 posts	Thanks Jeffrey for a big chunk of the VFP work ;-)

Sep 27, 2015 9:07pm Kuemmel (439) 384 posts	…just to get this right, does it mean, the GCC is now capable of transforming floating point C code into 1) “normal” VFP, 2) “vector” VFP and 3) NEON instructions ? Or was 1) and 3) already there and only 2) is added ? …any speed up’s measured on real world apps ?

Sep 29, 2015 8:41pm Theo Markettos (89) 919 posts	I’d ask on the mailing list as I wasn’t involved in this work, but my understanding is that upstream GCC can generate all kinds of floating point code. The main work is in UnixLib to handle context switching (eg in the case of signal handlers, pthreads) and in dynamic linking. I assume all the available context is being saved, but I would check that.

Sep 29, 2015 9:56pm Jeffrey Lee (213) 6048 posts	…just to get this right, does it mean, the GCC is now capable of transforming floating point C code into 1) “normal” VFP, 2) “vector” VFP and 3) NEON instructions ? Or was 1) and 3) already there and only 2) is added ? GCC will do 1) and 3), but I don’t think it will do 2). Remember that VFP vector mode is deprecated, and isn’t supported in hardware by most/all modern chips – in terms of RISC OS it’s only Cortex-A8 and ARM11 which support it in hardware, for Cortex-A7, A9, A15 it’ll be emulated by VFPSupport. Of course there’s nothing stopping you from writing bits of assembler that use vector mode, as long as you’re careful to save and restore the vector length on entry & exit to your routine. You can also use VFPSupport_Features 2 to check if the vector mode is supported by hardware or not.

Oct 6, 2015 9:21pm Kuemmel (439) 384 posts	Hm…may be someone more into C and usage of GCC could compile a simple test like `float a[10][3],b[10][3],c[10][3] for(int i=0; i<10; i++) { a[i][0]=b[i][0]+c[i][0]; a[i][1]=b[i][1]+c[i][1]; a[i][2]=b[i][2]+c[i][2]; a[i][3]=b[i][3]+c[i][3]; }` …and see if the output is done with NEON or VFP in the executable ? I’ve no clue how good the compiler could be in “seeing” the option to use NEON…my guess is he won’t get it…or one has to do some more clever preparation in C that allow’s him to see it better…

Oct 6, 2015 10:03pm Jeffrey Lee (213) 6048 posts	GCC supports automatic vectorisation (via -O3 or -ftree-vectorize). However something I didn’t realise until I tried it just now is that by default it won’t do automatic vectorisation of floats, because the NEON FP unit isn’t 100% IEEE compatible. To get it to vectorise floats you also need to specify -funsafe-math-optimizations Using this test program: typedef struct { float a[10][4]; } stuff; stuff func(stuff a, stuff b) { stuff out; int i; for(i=0;i<10;i++) { out.a[i][0] = a.a[i][0] * b.a[i][0]; out.a[i][1] = a.a[i][1] * b.a[i][1]; out.a[i][2] = a.a[i][2] * b.a[i][2]; out.a[i][3] = a.a[i][3] * b.a[i][3]; } return out; } We end up with the following: gcc -O3: Uses softfloat function calls gcc -O3 -mfpu=vfp: Uses scalar VFP gcc -O3 -mfpu=neon: Uses scalar VFP gcc -O3 -mfpu=neon -funsafe-math-optimizations: Uses NEON quadword ops

Oct 6, 2015 10:09pm Kuemmel (439) 384 posts	…ah, okay, I forgot about the non-IEEE thingy…could you post how the disassembly output of your example looks like ?

Oct 6, 2015 10:54pm Jeffrey Lee (213) 6048 posts	It’s… interesting. It’s unrolled the loop, but it’s also decided to process the entries in a strange order, so it ends up preventing itself from using VLDM/VSTM in a lot of cases. .file "test.c" .text .align 2 .global func .ascii "func\000" .align 2 .word 4278190088 .type func, %function func: @ args = 320, pretend = 12, frame = 160, outgoing = 0 @ frame_needed = 1, uses_anonymous_args = 0 mov ip, sp sub sp, sp, #12 stmfd sp!, {r4, fp, ip, lr, pc} sub fp, ip, #16 cmp sp, sl bllt __rt_stkovf_split_small sub sp, sp, #160 mov r4, r0 stmib fp, {r1, r2, r3} vldr d0, [fp, #164] vldr d1, [fp, #172] mov r1, sp vldr d2, [fp, #4] vldr d3, [fp, #12] mov r2, #160 vmul.f32 q2, q0, q1 vldr d16, [fp, #180] vldr d17, [fp, #188] vldr d18, [fp, #20] vldr d19, [fp, #28] vldr d0, [fp, #196] vldr d1, [fp, #204] vmul.f32 q3, q8, q9 vldr d2, [fp, #36] vldr d3, [fp, #44] vldr d16, [fp, #212] vldr d17, [fp, #220] vmul.f32 q15, q0, q1 vldr d18, [fp, #52] vldr d19, [fp, #60] vldr d0, [fp, #228] vldr d1, [fp, #236] vldr d2, [fp, #68] vldr d3, [fp, #76] vmul.f32 q14, q8, q9 vstmia sp, {d4-d5} vldr d16, [fp, #244] vldr d17, [fp, #252] vldr d18, [fp, #84] vldr d19, [fp, #92] vmul.f32 q13, q0, q1 vstr d6, [sp, #16] vstr d7, [sp, #24] vldr d0, [fp, #260] vldr d1, [fp, #268] vldr d2, [fp, #100] vldr d3, [fp, #108] vmul.f32 q12, q8, q9 vstr d30, [sp, #32] vstr d31, [sp, #40] vldr d16, [fp, #276] vldr d17, [fp, #284] vldr d18, [fp, #116] vldr d19, [fp, #124] vmul.f32 q11, q0, q1 vstr d28, [sp, #48] vstr d29, [sp, #56] vldr d0, [fp, #292] vldr d1, [fp, #300] vldr d2, [fp, #132] vldr d3, [fp, #140] vmul.f32 q10, q8, q9 vstr d26, [sp, #64] vstr d27, [sp, #72] vstr d24, [sp, #80] vstr d25, [sp, #88] vstr d22, [sp, #96] vstr d23, [sp, #104] vmul.f32 q9, q0, q1 vstr d20, [sp, #112] vstr d21, [sp, #120] vldr d0, [fp, #148] vldr d1, [fp, #156] vldr d2, [fp, #308] vldr d3, [fp, #316] vstr d18, [sp, #128] vstr d19, [sp, #136] vmul.f32 q8, q0, q1 vstr d16, [sp, #144] vstr d17, [sp, #152] bl memcpy mov r0, r4 ldmea fp, {r4, fp, sp, pc} .size func, .-func .ident "GCC: (GCCSDK GCC 4.7.4 Release 2) 4.7.4" Interestingly, I get similar results when using the NEON intrinsics. But there is one notable difference – the version which uses the intrinsics doesn’t have that silly memcpy at the end!