RISC OS Open: Forum: Sockets limit

Nov 8, 2015 10:19pm

Why, in 2015, does RISC OS have such a miserably low limit to the number of open sockets?

There is a NetSurf issue currently being looked at where one contributor may be the sockets limit. (I say “may” because we haven’t diagnosed it fully yet.)

Nov 9, 2015 8:12am

Jon Abbott (1421) 2651 posts

What’s the issue in NetSurf, is there a link?

I ask, as I repeatedly get “failed to connect” issues when using FTPc and have to reboot to fix. I’ve long suspected sockets aren’t being released or are in a permanent locked state, but never looked into it.

Nov 9, 2015 9:44am

Colin (478) 2433 posts

Are your problems server specific? does changing to a passive connection make any difference?

Nov 9, 2015 12:37pm

Steve Pampling (1551) 8172 posts

I’ve long suspected sockets aren’t being released or are in a permanent locked state, but never looked into it.

That’s pretty much it. You can generate the issue by crashing various items that are using sockets as well as trying to connect to what these days is considered a moderate number of endpoints simulktaneously.
The problem has been around for years and afflicts RO4.02.

Basically the RO IP stack needs some serious attention.

Nov 9, 2015 1:07pm

Rick Murray (539) 13851 posts

You can generate the issue by crashing various items that are using sockets

Had that in the early days of my server. Specifying SOL_REUSEADDR (or something like that) made the lengthy timeout problems cease.

Nov 23, 2015 1:53pm

Jeffrey Lee (213) 6048 posts

Having had a look at a couple of socket-related things over the weekend, I think the problems that RISC OS is facing are two-fold:

SO_KEEPALIVE defaults to false. I believe this is in line with most other OS’s, but it’s something that programs might overlook. Also it’s not a magic bullet that will make stuck sockets go away (based on the experiment I did last night I think it will make some sockets go away, but technically the socket needs to stay around so that the error can be reported the next time the socket is used – so maybe don’t trust what I’m saying here!)
Sockets under RISC OS are a global resource – the OS will not close them automatically when a program exits. So if a program crashes and doesn’t have an exit handler which closes all its sockets, or even if the program exits normally and forgets to clean some sockets up, those sockets will be left in the system. And because the socket is still present as far as the system is concerned, SO_KEEPALIVE won’t make it go away).

If people are running into issues it would be nice to see what the output of ‘inetstat -a’ or ‘inetstat -an’ is – whether there are lots of sockets stuck in some state.

Nov 23, 2015 3:11pm

Rick Murray (539) 13851 posts

Would it be possible to implement a *SocketClose <socket> and *SocketList commands?
The latter is because inetstat doesn’t list which sockets are in use.

Nov 23, 2015 3:25pm

Jeffrey Lee (213) 6048 posts

Extending inetstat to list the RISC OS socket numbers would seem more sensible than adding a new command for it.

Not sure if we want/need a command built into the OS to close sockets – but I guess adding it to DebugTools would make sense. (Or, you could just drop into BASIC and SYS “Socket_Close”,<socket>)

Nov 23, 2015 4:29pm

Colin (478) 2433 posts

It must be a deliberate policy to hide the socket number I can see no other reason for

inetstat -A

to list the active socket as an address.

Nov 23, 2015 4:49pm

Dave Higton (1515) 3534 posts

SYS “Socket_Close”, <socket> doesn’t always close a socket. It can linger on.

I’d like there to be a way to definitely kill a socket.

Nov 23, 2015 5:36pm

Colin (478) 2433 posts

It’s a shame !Socketmgr no longer works it was useful when I was doing socket programming. Setting the socket option SO_Linger to ‘on’ with a timeout of ‘0’ before closing the socket should abort it.

Nov 23, 2015 9:18pm

David Feugey (2125) 2709 posts

Sockets under RISC OS are a global resource – the OS will not close them automatically when a program exits. So if a program crashes and doesn’t have an exit handler which closes all its sockets, or even if the program exits normally and forgets to clean some sockets up, those sockets will be left in the system.

Exactly the problem with WebJames. It hangs, but it’s impossible to launch it again. No socket left…

Nov 24, 2015 6:08pm

Rick Murray (539) 13851 posts

Exactly the problem with WebJames. It hangs, but it’s impossible to launch it again. No socket left…

What exactly was the message – do you remember? And what else was running on the machine at the time?

WebJames initialises the sockets with this:

    /* start listening */
    listen = ip_create(0);
    if (listen == socket_CLOSED) {
      webjames_writelog(LOGLEVEL_ALWAYS, "Couldn't create socket...");
      continue;
    }

    arg = 1;
    ip_setsocketopt(listen, SOCKETOPT_REUSEADDR, &arg, 4);

    ip_linger(listen, 10);

    if (!ip_bind(listen, 0, serverinfo.servers[i].port)) {
      webjames_writelog(LOGLEVEL_ALWAYS, "Couldn't bind to port %d...",
                        serverinfo.servers[i].port);
      ip_close(listen);
      continue;
    }

The important line is the one that sets the SOCKETOPT_REUSEADDR option. This tells the stack that it can assign the socket to a port even if that port is stuck in “linger”.

My Pi has been ‘up’ for 6 days, 19 hours, and 32 minutes since the last reboot. That’ll be my giving Ovation a whirl on the ZPP build of RISC OS, then swapping back. Following Jeffrey’s ticker chain mods, my server and WebJames have both been solid. I’m sure I see a lot less traffic than you, 1070 requests since the start of the month. Mostly bots looking for stuff that doesn’t exist “POST /tmUnblock.cgi” or “GET hxxp://testp3.pospr.waw.pl/testproxy.php” (‘hxxp’ to defeat Textile) or “GET //phpMyAdmin/scripts/setup.php”, to give some examples.

However, if I now inetstat -a, I can see ten sockets in CLOSE_WAIT state, four in LAST_ACK. Half of these are on port 443 to 91.203.57.172. I have no idea what that is… ;-)
The rest are HTTP connections.
Two sockets (WebJames and my server) are in LISTEN state. And three (49152, netbios-ns, and bootpc) don’t have a state.

That’s after 6+ days.

Nov 24, 2015 7:23pm

David Feugey (2125) 2709 posts

What exactly was the message – do you remember? And what else was running on the machine at the time?

No socket left.

The problem was when Google tried to crawl hundreds of resources in a few minutes (every day).

I made a system to reboot the computer, but there was another problem: very slow answers after a few requests. I should try again with a Pi.

Nov 24, 2015 8:51pm

Chris Evans (457) 1614 posts

Half of these are on port 443 to 91.203.57.172. I have no idea what that is… ;-)

If you mean you don’t know who 91.203.57.172 is, https://www.whatismyip.com/ip-whois-lookup/ reports:
inetnum: 91.203.56.0 – 91.203.59.255
netname: ARACHSYS-LTD
descr: Arachsys Internet Services Ltd
country: GB
org: ORG-AA426-RIPE

Nov 24, 2015 8:57pm

Rick Murray (539) 13851 posts

If you mean you don’t know who 91.203.57.172 is

Thanks Chris.

Given my propensity for sarcasm, I’m surprised you didn’t try:

*ping <a href="http://www.riscosopen.org">www.riscosopen.org</a>
PING <a href="http://www.riscosopen.org">www.riscosopen.org</a> (91.203.57.172): 56 data bytes
64 bytes from 91.203.57.172: icmp_seq=0 ttl=54 time=173.579 ms

--- <a href="http://www.riscosopen.org">www.riscosopen.org</a> ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 173.579/173.579/173.579 ms
*

(^_^)

Nov 24, 2015 9:57pm

Rick Murray (539) 13851 posts

The problem was when Google tried to crawl hundreds of resources in a few minutes (every day).

First up, your robots.txt file doesn’t make sense. It says:

User-Agent: *

What’s the point of specifying if you don’t give further instructions? ;-)

Try:

User-Agent: *
Crawl-Delay: 60

This requests that the Google crawler wait 60 seconds in between successive visits to your site; so if you have thirty pages, it will take half an hour for Google to index them.
I say Google repeatedly here as not many crawlers bother to obey this.

The problem may be that your current server sends these headers:

HTTP/1.1 200 OK
Date: Tue, 24 Nov 2015 21:42:46 GMT
Server: Apache/2.4.6 (CentOS)
Last-Modified: Sun, 25 Oct 2015 13:14:35 GMT
ETag: "988-522eda044a4c0"
Accept-Ranges: bytes
Content-Length: 2440
Connection: close
Content-Type: text/html; charset=UTF-8

WebJames, on the other hand, sends this:

HTTP/1.0 200 OK
Content-Length: 2367
Content-Type: text/html
Date: Tue, 24 Nov 2015 22:44:29
X-Server-Info: WebJames on RISC OS on a RaspberryPi.
Server: WebJames/0.48

Somebody probably ought to hack WebJames to support Last-Modified and Cache-Control. If it is just static content being served up, src.c.staticcontent → staticcontent_start() looks to be the place to insert these. If you fancy trying it out. (sorry, David, it’s written in C…)

Nov 25, 2015 7:28am

Steve Pampling (1551) 8172 posts

*ping www.riscosopen.org

Lots of typing

ping -a is the one I tend to use first.

nslookup also

rather like “netsh int isa set state dis” instead of “netsh interface isatap set state disabled”

or “wr” (or “copy run start”) instead of “copy running-config startup-config”

Yes, those are PC and cisco

Nov 25, 2015 7:36am

David Feugey (2125) 2709 posts

First up, your robots.txt file doesn’t make sense. It says:
User-Agent: *

Because some robots are too prudent today, and don’t crawl your site if no directive.

This requests that the Google crawler wait 60 seconds in between successive visits to your site; so if you have thirty pages, it will take half an hour for Google to index them.
I say Google repeatedly here as not many crawlers bother to obey this.

Yep, and Google will assume that your server is slow. And your position will go down on Search.
I’m not on ARM, so I have no problem today when using this directive :)
Anyway, it just should work, even on WebJames.

Anyway, WebJames stopped working correctly at some point, and never come back to a normal state later (PandaBoard). Even with a new installation. So, i just stop using it. I’ll perhaps use it again later. Need to make tests.

Nov 25, 2015 12:17pm

Rick Murray (539) 13851 posts

And your position will go down on Search.

That might be a problem if you’re selling screen protectors for iPhones; but you – RISC OS resources and information in French. That’s you and…..?

Google will mark you down anyway – https://www.riscos.fr/

Nov 25, 2015 8:50pm

David Feugey (2125) 2709 posts

And a lot of old pages that were better referenced than RISC OS FR. Not the case any more.

Nov 25, 2015 11:31pm

Theo Markettos (89) 919 posts

Why, in 2015, does RISC OS have such a miserably low limit to the number of open sockets?

Because the network stack is based on 4.3BSD which is very, very old.

Nov 26, 2015 8:06am

Dave Higton (1515) 3534 posts

“Based on” allows it to move forward.

Sockets limit

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Nov 8, 2015 10:19pm Dave Higton (1515) 3534 posts	Why, in 2015, does RISC OS have such a miserably low limit to the number of open sockets? There is a NetSurf issue currently being looked at where one contributor may be the sockets limit. (I say “may” because we haven’t diagnosed it fully yet.)

Nov 9, 2015 8:12am Jon Abbott (1421) 2651 posts	What’s the issue in NetSurf, is there a link? I ask, as I repeatedly get “failed to connect” issues when using FTPc and have to reboot to fix. I’ve long suspected sockets aren’t being released or are in a permanent locked state, but never looked into it.

Nov 9, 2015 9:44am Colin (478) 2433 posts	Are your problems server specific? does changing to a passive connection make any difference?

Nov 9, 2015 12:37pm Steve Pampling (1551) 8172 posts	I’ve long suspected sockets aren’t being released or are in a permanent locked state, but never looked into it. That’s pretty much it. You can generate the issue by crashing various items that are using sockets as well as trying to connect to what these days is considered a moderate number of endpoints simulktaneously. The problem has been around for years and afflicts RO4.02. Basically the RO IP stack needs some serious attention.

Nov 9, 2015 1:07pm Rick Murray (539) 13851 posts	You can generate the issue by crashing various items that are using sockets Had that in the early days of my server. Specifying SOL_REUSEADDR (or something like that) made the lengthy timeout problems cease.

Nov 23, 2015 1:53pm Jeffrey Lee (213) 6048 posts	Having had a look at a couple of socket-related things over the weekend, I think the problems that RISC OS is facing are two-fold: SO_KEEPALIVE defaults to false. I believe this is in line with most other OS’s, but it’s something that programs might overlook. Also it’s not a magic bullet that will make stuck sockets go away (based on the experiment I did last night I think it will make some sockets go away, but technically the socket needs to stay around so that the error can be reported the next time the socket is used – so maybe don’t trust what I’m saying here!) Sockets under RISC OS are a global resource – the OS will not close them automatically when a program exits. So if a program crashes and doesn’t have an exit handler which closes all its sockets, or even if the program exits normally and forgets to clean some sockets up, those sockets will be left in the system. And because the socket is still present as far as the system is concerned, SO_KEEPALIVE won’t make it go away). If people are running into issues it would be nice to see what the output of ‘inetstat -a’ or ‘inetstat -an’ is – whether there are lots of sockets stuck in some state.

Nov 23, 2015 3:11pm Rick Murray (539) 13851 posts	Would it be possible to implement a `SocketClose <socket>` and `SocketList` commands? The latter is because `inetstat` doesn’t list which sockets are in use.

Nov 23, 2015 3:25pm Jeffrey Lee (213) 6048 posts	Extending inetstat to list the RISC OS socket numbers would seem more sensible than adding a new command for it. Not sure if we want/need a command built into the OS to close sockets – but I guess adding it to DebugTools would make sense. (Or, you could just drop into BASIC and SYS “Socket_Close”,<socket>)

Nov 23, 2015 4:29pm Colin (478) 2433 posts	It must be a deliberate policy to hide the socket number I can see no other reason for `inetstat -A` to list the active socket as an address.

Nov 23, 2015 4:49pm Dave Higton (1515) 3534 posts	SYS “Socket_Close”, <socket> doesn’t always close a socket. It can linger on. I’d like there to be a way to definitely kill a socket.

Nov 23, 2015 5:36pm Colin (478) 2433 posts	It’s a shame !Socketmgr no longer works it was useful when I was doing socket programming. Setting the socket option SO_Linger to ‘on’ with a timeout of ‘0’ before closing the socket should abort it.

Nov 23, 2015 9:18pm David Feugey (2125) 2709 posts	Sockets under RISC OS are a global resource – the OS will not close them automatically when a program exits. So if a program crashes and doesn’t have an exit handler which closes all its sockets, or even if the program exits normally and forgets to clean some sockets up, those sockets will be left in the system. Exactly the problem with WebJames. It hangs, but it’s impossible to launch it again. No socket left…

Nov 24, 2015 6:08pm Rick Murray (539) 13851 posts	Exactly the problem with WebJames. It hangs, but it’s impossible to launch it again. No socket left… What exactly was the message – do you remember? And what else was running on the machine at the time? WebJames initialises the sockets with this: `/* start listening */ listen = ip_create(0); if (listen == socket_CLOSED) { webjames_writelog(LOGLEVEL_ALWAYS, "Couldn't create socket..."); continue; } arg = 1; ip_setsocketopt(listen, SOCKETOPT_REUSEADDR, &arg, 4); ip_linger(listen, 10); if (!ip_bind(listen, 0, serverinfo.servers[i].port)) { webjames_writelog(LOGLEVEL_ALWAYS, "Couldn't bind to port %d...", serverinfo.servers[i].port); ip_close(listen); continue; }` The important line is the one that sets the SOCKETOPT_REUSEADDR option. This tells the stack that it can assign the socket to a port even if that port is stuck in “linger”. My Pi has been ‘up’ for 6 days, 19 hours, and 32 minutes since the last reboot. That’ll be my giving Ovation a whirl on the ZPP build of RISC OS, then swapping back. Following Jeffrey’s ticker chain mods, my server and WebJames have both been solid. I’m sure I see a lot less traffic than you, 1070 requests since the start of the month. Mostly bots looking for stuff that doesn’t exist “`POST /tmUnblock.cgi`” or “`GET hxxp://testp3.pospr.waw.pl/testproxy.php`” (‘hxxp’ to defeat Textile) or “`GET //phpMyAdmin/scripts/setup.php`”, to give some examples. However, if I now `inetstat -a`, I can see ten sockets in CLOSE_WAIT state, four in LAST_ACK. Half of these are on port 443 to 91.203.57.172. I have no idea what that is… ;-) The rest are HTTP connections. Two sockets (WebJames and my server) are in LISTEN state. And three (49152, netbios-ns, and bootpc) don’t have a state. That’s after 6+ days.

Nov 24, 2015 7:23pm David Feugey (2125) 2709 posts	What exactly was the message – do you remember? And what else was running on the machine at the time? No socket left. The problem was when Google tried to crawl hundreds of resources in a few minutes (every day). I made a system to reboot the computer, but there was another problem: very slow answers after a few requests. I should try again with a Pi.

Nov 24, 2015 8:51pm Chris Evans (457) 1614 posts	Half of these are on port 443 to 91.203.57.172. I have no idea what that is… ;-) If you mean you don’t know who 91.203.57.172 is, https://www.whatismyip.com/ip-whois-lookup/ reports: inetnum: 91.203.56.0 – 91.203.59.255 netname: ARACHSYS-LTD descr: Arachsys Internet Services Ltd country: GB org: ORG-AA426-RIPE

Nov 24, 2015 8:57pm Rick Murray (539) 13851 posts	If you mean you don’t know who 91.203.57.172 is Thanks Chris. Given my propensity for sarcasm, I’m surprised you didn’t try: ping <a href="http://www.riscosopen.org">www.riscosopen.org</a> PING <a href="http://www.riscosopen.org">www.riscosopen.org</a> (91.203.57.172): 56 data bytes 64 bytes from 91.203.57.172: icmp_seq=0 ttl=54 time=173.579 ms --- <a href="http://www.riscosopen.org">www.riscosopen.org</a> ping statistics --- 1 packets transmitted, 1 packets received, 0% packet loss round-trip min/avg/max = 173.579/173.579/173.579 ms (^_^)

Nov 24, 2015 9:57pm Rick Murray (539) 13851 posts	The problem was when Google tried to crawl hundreds of resources in a few minutes (every day). First up, your robots.txt file doesn’t make sense. It says: User-Agent: * What’s the point of specifying if you don’t give further instructions? ;-) Try: User-Agent: * Crawl-Delay: 60 This requests that the Google crawler wait 60 seconds in between successive visits to your site; so if you have thirty pages, it will take half an hour for Google to index them. I say Google repeatedly here as not many crawlers bother to obey this. The problem may be that your current server sends these headers: HTTP/1.1 200 OK Date: Tue, 24 Nov 2015 21:42:46 GMT Server: Apache/2.4.6 (CentOS) Last-Modified: Sun, 25 Oct 2015 13:14:35 GMT ETag: "988-522eda044a4c0" Accept-Ranges: bytes Content-Length: 2440 Connection: close Content-Type: text/html; charset=UTF-8 WebJames, on the other hand, sends this: HTTP/1.0 200 OK Content-Length: 2367 Content-Type: text/html Date: Tue, 24 Nov 2015 22:44:29 X-Server-Info: WebJames on RISC OS on a RaspberryPi. Server: WebJames/0.48 Somebody probably ought to hack WebJames to support Last-Modified and Cache-Control. If it is just static content being served up, src.c.staticcontent → staticcontent_start() looks to be the place to insert these. If you fancy trying it out. (sorry, David, it’s written in C…)

Nov 25, 2015 7:28am Steve Pampling (1551) 8172 posts	*ping www.riscosopen.org Lots of typing ping -a is the one I tend to use first. nslookup also rather like “netsh int isa set state dis” instead of “netsh interface isatap set state disabled” or “wr” (or “copy run start”) instead of “copy running-config startup-config” Yes, those are PC and cisco

Nov 25, 2015 7:36am David Feugey (2125) 2709 posts	First up, your robots.txt file doesn’t make sense. It says: User-Agent: * Because some robots are too prudent today, and don’t crawl your site if no directive. This requests that the Google crawler wait 60 seconds in between successive visits to your site; so if you have thirty pages, it will take half an hour for Google to index them. I say Google repeatedly here as not many crawlers bother to obey this. Yep, and Google will assume that your server is slow. And your position will go down on Search. I’m not on ARM, so I have no problem today when using this directive :) Anyway, it just should work, even on WebJames. Anyway, WebJames stopped working correctly at some point, and never come back to a normal state later (PandaBoard). Even with a new installation. So, i just stop using it. I’ll perhaps use it again later. Need to make tests.

Nov 25, 2015 12:17pm Rick Murray (539) 13851 posts	And your position will go down on Search. That might be a problem if you’re selling screen protectors for iPhones; but you – RISC OS resources and information in French. That’s you and…..? Google will mark you down anyway – https://www.riscos.fr/

Nov 25, 2015 8:50pm David Feugey (2125) 2709 posts	And a lot of old pages that were better referenced than RISC OS FR. Not the case any more.

Nov 25, 2015 11:31pm Theo Markettos (89) 919 posts	Why, in 2015, does RISC OS have such a miserably low limit to the number of open sockets? Because the network stack is based on 4.3BSD which is very, very old.

Nov 26, 2015 8:06am Dave Higton (1515) 3534 posts	“Based on” allows it to move forward.