RISC OS Open: Forum: Weak TCP/IP stack

Apr 15, 2015 12:30am

I just realise that the TCP/IP stack is very weak with recent roms.
My test: WebJames, and a few connections (from me and from Internet).

After a few loads of pages, I get this request in WebJames’s log:
192.168.0.254 – - [15/Apr/2015:00:15:05 +0000] "dôådôådôådôådôådôådôåàNâØÐâ â@-éà á" 200 0 "" ""
Forget special characters: it’s the famous †óflå†óflå†óflå†…

Then: new connections are impossible (WebJames just don’t receive any request).
After a long time, connections become possible again… or not.

Conclusion: riscos.fr is unavailable, and I have no idea of when it’ll be OK again.
My ROM is not really the latest one, but it’s no more than 2 months old.

Apr 15, 2015 5:38am

David Feugey (2125) 2709 posts

A few precisions:
- I changed of PandaBoard, of SD Card, of network cable: no effect.
- I also replace WebJames with HTTPServ: no effect. They both do not receive requests.
- ShareFS works well even then the web services are not relayed.

Apr 15, 2015 6:51am

David Feugey (2125) 2709 posts

Nota: if you have some older ROM for Panda (end of 2014?), please send them to me. Not to be able to reboot is a problem, but a broken website is just impossible :)

Apr 15, 2015 6:58am

Steve Pampling (1551) 8170 posts

if you have some older ROM for Panda (end of 2014?),

Pick a date and I will do the nearest if there isn’t an exact match.

Apr 15, 2015 11:09am

David Feugey (2125) 2709 posts

Thanks: let’s try first with a end december release…
Address is still temp1267@riscos.fr

Apr 15, 2015 5:58pm

Steve Pampling (1551) 8170 posts

December 22nd 2014

Should be in your mailbox by the time you read this

Apr 15, 2015 6:34pm

David Feugey (2125) 2709 posts

Yep, thanks Steve. I made a test. It seems to work much better.

The ‘sleeping socket problem’ is gone. I explain: with current ROM, all works OK, even if I load a lot of pages (of course, there is a limited of sockets, so massive slowdowns can appear). But if I stop loading pages and wait for the server to go to 300 MHz, then, it’s finished: no way to load pages any more. Just need to wait 5-10 minutes, for a new laps of time of normal use. With the 20141222 ROM, this problem is clearly not present.

Yeeeeesss.

Apr 15, 2015 6:42pm

Steve Pampling (1551) 8170 posts

With the 20141222 ROM, this problem is clearly not present.

The next task then is to look through the CVS updates for things that have changed that affect the network provision.

With the OMAP boards that would be USB as well as the logical network stack elements. If you narrow down the specific items people stand more chance of identifying the cause.

Apr 15, 2015 7:27pm

David Feugey (2125) 2709 posts

I spoke too fast. There are still problems to wake up the server when speed is low (300 MHz), but it seems to wake up in less time (around 30 sec. VS around 10 minutes). I suspect something not good with frequency management under the classic PandaBoard. Pandaboard ES probably works better (it did, but I changed it for a classic PandaBoard). Perhaps if I force the motherboard to use higher slow speed…

Apr 15, 2015 7:49pm

David Feugey (2125) 2709 posts

I tried 800 MHz as slow speed and 1 GHz for high speed to reduce timing issues linked to the change of frequency. Slowdowns are less massive, but seem much more frequent. I’ll check again tomorrow, as Internet is a bit slow tonight (and so is my server).

Apr 17, 2015 10:36pm

Rick Murray (539) 13840 posts

Then: new connections are impossible (WebJames just don’t receive any request).
After a long time, connections become possible again… or not.

I’ve noticed my server sometimes stops responding to connections and just acts dead. I have built the module with a load of tracing information to see if the problem is my code (probably!) or RISC OS (hope not!) but as is often the case, the problem doesn’t show itself when I’m looking for it!

Making changes to the module means reloading the module which means closing and re-opening the socket, which means everything will work again. Hmm! On the other hand, it does mean that if I identify this as a real problem, then a workaround while investigation is “in progress” could be to close and re-open the socket after a period of time has elapsed?
I would have thought it was a problem with my code, but your description sounds remarkably similar.

[Pi B, self-built ROM of 2nd November 2014 vintage]

Apr 18, 2015 6:57pm

Rick Murray (539) 13840 posts

For what it is worth, I am testing my server and I have the Livebox forwarding the port to the Pi so I can test it outside of the LAN by using my phone on GSM/3G.

Connections on port 23 from:

90.40.66.192 (Strasbourg region, France Telecom ADSL)
79.177.180.239 (Jerusalem region, Bezequint DSL)
90.212.211.10 (Norwich region, BSkyB broadband – connected twice)
192.168.1.11 (somewhere in rural France! ☺)
154.52.116.120 (Istanbul province, Goonet telekomunikasyon hizmetleri)

Connections on port 80 from… nobody that isn’t me.

Interesting. Logs show no login attempts. The person/script probably aborts as soon as it fails to look like a standard Unix login.
The BSkyB could be a friend of mine, but he lives in London. I’m not certain it was as no login was attempted.

The lesson? If you write world-facing code, it should be fairly bulletproof as there are those scanning IP ranges and I would imagine not for friendly purposes (my server has not been announced anywhere and it is dynamic IP anyway…). Still, at least I haven’t heard from Szechuan yet today…must be a quiet day in China. Or maybe our threats now come from the Middle East?

[update – the DADebug output now records the time of connection attempt]

Apr 18, 2015 7:17pm

David Feugey (2125) 2709 posts

I found where is the problem. in fact, there are two of them:
1/ the Pandaboard non ES has some timing issues, especially with SD card accesses. Not a big deal, but it’s a bit slower than the ES.
2/ I discovered, some weeks ago, that recent roms have a big problem with file writes. With benchmarks, reads are very fast, and writes incredibly slow. In fact, when the system is idle, the first write can take several seconds to be made. And here is the problem: WebJames writes logs.

My solution:
1/ scrap in ramdisc (with Memphis), as WebJames sometimes writes data inside scrap.
2/ no log any more.

And voilà! all seems OK now. Please make some tests.

Nota: integrated cache does not seems to work very well inside WebJames. Perhaps I’ll be more lucky with HTTPServ…
It’s a bit a pity that we can not set a bigger read cache for filecore in !Configure. 256 MB would be great for me :)

Apr 18, 2015 7:20pm

David Feugey (2125) 2709 posts

Grumpf. No file writes problems any more, but after a few minute of non activity the panda still forgets to answer requests.
I switch to the PandaBoard ES to see if it works better.

Apr 18, 2015 7:22pm

Rick Murray (539) 13840 posts

Ah, we appear to have different problems then. Mine is specifically the socket appears to just cease responding – but as said I’ve not identified where or why this is happening. It could well be my code messing up. I really hate these “random” problems, it is a pain to try debugging that which isn’t constant. That’s why I have left the port open to the world, I wonder if some of the connection attempts are sending malformed packets or somesuch… it is grasping at straws, yes, but I can’t do diddly when it is working like it should. ;-)
Well, I could always, you know, start writing the hard bits. Hihihihi…

Apr 18, 2015 7:29pm

David Feugey (2125) 2709 posts

the socket appears to just cease responding

The same here. It seems to wake up faster on the ES.

That’s really strange. After around 10 minutes of inactivity, first request gets a timeout. Second is ok, but very slow. And then it’s going faster and faster, and after 5-10 loads, all is OK again… until next inactivity time. I simply suspect sync problem when going from 1.2 GHz to 350 MHz.

So SD problem + socket problem.
Do you have a Pandaboard ES?

Apr 18, 2015 7:48pm

David Feugey (2125) 2709 posts

Update: problem seems to vanish (almost completely) when switching to a PandaBoard ES.
I’ll need to make more tests, but it’s OK now.

The only good news its that I now have a super optimized setup.

Apr 18, 2015 9:28pm

Rick Murray (539) 13840 posts

Finally heard from China…

42.56.234.1 (Shanghai, China UniCom Liaoning)

It is 5am over there. 19C, and raining.

Apr 19, 2015 4:36pm

David Feugey (2125) 2709 posts

Problem comes back this afternoon, but PandaBoard ES stays silent for a shorter time than PandaBoard. I make some requests, Ethernet LED blinks, but WebJames doesn’t react (and doesn’t received the request). Ethernet blinks again and again (my web browser :) ), then, suddenly, all wakes up, request is received, disc access is made, answer is sent. Strange. Network packets or sockets seems to disappear.

Apr 19, 2015 4:47pm

David Feugey (2125) 2709 posts

I tried a new experiment: to set slow and fast speed to the same value (700 MHz). I’ll see if socket issue is linked to sync problems linked to changes of frequency.

Apr 19, 2015 5:15pm

David Feugey (2125) 2709 posts

(I can confirm that at 700 MHz, the problem occurs almost immediately. 920 MHz works better, but with some almost permanent lags. 1.2 GHz is really better, but not perfect.)

CORRECTION

Lags are the same at all speeds, but are coming faster with low speeds.
Ping still workings when web request are not transmitted any more to WebJames.
No explanation…

Apr 19, 2015 6:04pm

David Feugey (2125) 2709 posts

The same with HTTPServ. It’s even worse. It works OK, then, after some idle time, it does not work any more. Completely. Data is coming to network card, then lost.

Conclusion: I have no solution. I just can’t use RISC OS for web server any more. Old ROMs did not permit me to reboot (hey, I cannot live near my Panda). New ROMs have a very big issue with network.

Apr 19, 2015 6:35pm

David Feugey (2125) 2709 posts

Same setup under a Pi Model B+, generic boot + latest RC14 ROM.
Much faster at boot. Let’s see if performances are going down after a few minutes/jours/days…

Apr 19, 2015 7:12pm

David Feugey (2125) 2709 posts

Test finished. I confirm dead sockets problem on all tested set up : Pi Model B (wakes up after 5-10 sec), PandaBoard non ES (does not wake up easily), PandaBoard non ES (wake up a bit faster than non ES, but slower than Pi). RC12A Rom works about the same, but slowly, because of other network issues.

Apr 19, 2015 7:43pm

Rick Murray (539) 13840 posts

Strange. Network packets or sockets seems to disappear.

I had the problem just now. What my code does is it sets a CallAfter to fire in 50cs. This then schedules a CallBack which returns whenever (close enough to not worry).

My code will then do:

   FD_ZERO(&isready);
   FD_SET(sock, &isready);
   timeout.tv_sec = 0;
   select((sock + 1), &isready, 0, 0, &timeout);
   if (FS_ISSET(sock, &isready))
   ...

So, essentially, the module checks the socket twice per second, so it doesn’t impact the system but responds fast enough to be covered by general network latency. ;-)

After each eventuality has been handled, the CallAfter is scheduled anew and life carries on.

I have a suspicion that something is causing the CallAfter to not be scheduled. I don’t see any obvious exit-without-doing-it in the code, so I have added a debug command to tell me if the module thinks a CallAfter is pending (it should always be at any point when I can issue the command). That will tell me if it was failing to be set or if something else is going on.

It may also work to take all of this out and replace it with a simple CallEvery instead. The socket handling code doesn’t take anything remotely near half a second to do its work; and anyway there is an interlock to prevent a CallBack being set if one is pending, so it should be safe to handle the CallEvery on time – if a CallBack is pending, another won’t be set.
Let’s see what happens the next time it fails to respond.

I should point out (again) that my Pi ROM is 2nd November 2014. I have not updated the sources more recently; and since I need localtime() to work for an additional CE(S)T timezone in the UK territory, it rather implies some specific code patches. ;-)

In short – David’s problem concerns me – but I don’t think the causes to our problems are the same despite superficially looking similar.
Does anybody else run a web server on RISC OS? I have no issues here with WebJames which is running in the background and has been since some time this morning when I started the machine.

Weak TCP/IP stack

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Apr 15, 2015 12:30am David Feugey (2125) 2709 posts	I just realise that the TCP/IP stack is very weak with recent roms. My test: WebJames, and a few connections (from me and from Internet). After a few loads of pages, I get this request in WebJames’s log: 192.168.0.254 – - [15/Apr/2015:00:15:05 +0000] "dôådôådôådôådôådôådôåàNâØÐâ â@-éà á" 200 0 "" "" Forget special characters: it’s the famous †óflå†óflå†óflå†… Then: new connections are impossible (WebJames just don’t receive any request). After a long time, connections become possible again… or not. Conclusion: riscos.fr is unavailable, and I have no idea of when it’ll be OK again. My ROM is not really the latest one, but it’s no more than 2 months old.

Apr 15, 2015 5:38am David Feugey (2125) 2709 posts	A few precisions: - I changed of PandaBoard, of SD Card, of network cable: no effect. - I also replace WebJames with HTTPServ: no effect. They both do not receive requests. - ShareFS works well even then the web services are not relayed.

Apr 15, 2015 6:51am David Feugey (2125) 2709 posts	Nota: if you have some older ROM for Panda (end of 2014?), please send them to me. Not to be able to reboot is a problem, but a broken website is just impossible :)

Apr 15, 2015 6:58am Steve Pampling (1551) 8170 posts	if you have some older ROM for Panda (end of 2014?), Pick a date and I will do the nearest if there isn’t an exact match.

Apr 15, 2015 11:09am David Feugey (2125) 2709 posts	Thanks: let’s try first with a end december release… Address is still temp1267@riscos.fr

Apr 15, 2015 5:58pm Steve Pampling (1551) 8170 posts	December 22nd 2014 Should be in your mailbox by the time you read this

Apr 15, 2015 6:34pm David Feugey (2125) 2709 posts	Yep, thanks Steve. I made a test. It seems to work much better. The ‘sleeping socket problem’ is gone. I explain: with current ROM, all works OK, even if I load a lot of pages (of course, there is a limited of sockets, so massive slowdowns can appear). But if I stop loading pages and wait for the server to go to 300 MHz, then, it’s finished: no way to load pages any more. Just need to wait 5-10 minutes, for a new laps of time of normal use. With the 20141222 ROM, this problem is clearly not present. Yeeeeesss.

Apr 15, 2015 6:42pm Steve Pampling (1551) 8170 posts	With the 20141222 ROM, this problem is clearly not present. The next task then is to look through the CVS updates for things that have changed that affect the network provision. With the OMAP boards that would be USB as well as the logical network stack elements. If you narrow down the specific items people stand more chance of identifying the cause.

Apr 15, 2015 7:27pm David Feugey (2125) 2709 posts	I spoke too fast. There are still problems to wake up the server when speed is low (300 MHz), but it seems to wake up in less time (around 30 sec. VS around 10 minutes). I suspect something not good with frequency management under the classic PandaBoard. Pandaboard ES probably works better (it did, but I changed it for a classic PandaBoard). Perhaps if I force the motherboard to use higher slow speed…

Apr 15, 2015 7:49pm David Feugey (2125) 2709 posts	I tried 800 MHz as slow speed and 1 GHz for high speed to reduce timing issues linked to the change of frequency. Slowdowns are less massive, but seem much more frequent. I’ll check again tomorrow, as Internet is a bit slow tonight (and so is my server).

Apr 17, 2015 10:36pm Rick Murray (539) 13840 posts	Then: new connections are impossible (WebJames just don’t receive any request). After a long time, connections become possible again… or not. I’ve noticed my server sometimes stops responding to connections and just acts dead. I have built the module with a load of tracing information to see if the problem is my code (probably!) or RISC OS (hope not!) but as is often the case, the problem doesn’t show itself when I’m looking for it! Making changes to the module means reloading the module which means closing and re-opening the socket, which means everything will work again. Hmm! On the other hand, it does mean that if I identify this as a real problem, then a workaround while investigation is “in progress” could be to close and re-open the socket after a period of time has elapsed? I would have thought it was a problem with my code, but your description sounds remarkably similar. [Pi B, self-built ROM of 2nd November 2014 vintage]

Apr 18, 2015 6:57pm Rick Murray (539) 13840 posts	For what it is worth, I am testing my server and I have the Livebox forwarding the port to the Pi so I can test it outside of the LAN by using my phone on GSM/3G. Connections on port 23 from: 90.40.66.192 (Strasbourg region, France Telecom ADSL) 79.177.180.239 (Jerusalem region, Bezequint DSL) 90.212.211.10 (Norwich region, BSkyB broadband – connected twice) 192.168.1.11 (somewhere in rural France! ☺) 154.52.116.120 (Istanbul province, Goonet telekomunikasyon hizmetleri) Connections on port 80 from… nobody that isn’t me. Interesting. Logs show no login attempts. The person/script probably aborts as soon as it fails to look like a standard Unix login. The BSkyB could be a friend of mine, but he lives in London. I’m not certain it was as no login was attempted. The lesson? If you write world-facing code, it should be fairly bulletproof as there are those scanning IP ranges and I would imagine not for friendly purposes (my server has not been announced anywhere and it is dynamic IP anyway…). Still, at least I haven’t heard from Szechuan yet today…must be a quiet day in China. Or maybe our threats now come from the Middle East? [update – the DADebug output now records the time of connection attempt]

Apr 18, 2015 7:17pm David Feugey (2125) 2709 posts	I found where is the problem. in fact, there are two of them: 1/ the Pandaboard non ES has some timing issues, especially with SD card accesses. Not a big deal, but it’s a bit slower than the ES. 2/ I discovered, some weeks ago, that recent roms have a big problem with file writes. With benchmarks, reads are very fast, and writes incredibly slow. In fact, when the system is idle, the first write can take several seconds to be made. And here is the problem: WebJames writes logs. My solution: 1/ scrap in ramdisc (with Memphis), as WebJames sometimes writes data inside scrap. 2/ no log any more. And voilà! all seems OK now. Please make some tests. Nota: integrated cache does not seems to work very well inside WebJames. Perhaps I’ll be more lucky with HTTPServ… It’s a bit a pity that we can not set a bigger read cache for filecore in !Configure. 256 MB would be great for me :)

Apr 18, 2015 7:20pm David Feugey (2125) 2709 posts	Grumpf. No file writes problems any more, but after a few minute of non activity the panda still forgets to answer requests. I switch to the PandaBoard ES to see if it works better.

Apr 18, 2015 7:22pm Rick Murray (539) 13840 posts	Ah, we appear to have different problems then. Mine is specifically the socket appears to just cease responding – but as said I’ve not identified where or why this is happening. It could well be my code messing up. I really hate these “random” problems, it is a pain to try debugging that which isn’t constant. That’s why I have left the port open to the world, I wonder if some of the connection attempts are sending malformed packets or somesuch… it is grasping at straws, yes, but I can’t do diddly when it is working like it should. ;-) Well, I could always, you know, start writing the hard bits. Hihihihi…

Apr 18, 2015 7:29pm David Feugey (2125) 2709 posts	the socket appears to just cease responding The same here. It seems to wake up faster on the ES. That’s really strange. After around 10 minutes of inactivity, first request gets a timeout. Second is ok, but very slow. And then it’s going faster and faster, and after 5-10 loads, all is OK again… until next inactivity time. I simply suspect sync problem when going from 1.2 GHz to 350 MHz. So SD problem + socket problem. Do you have a Pandaboard ES?

Apr 18, 2015 7:48pm David Feugey (2125) 2709 posts	Update: problem seems to vanish (almost completely) when switching to a PandaBoard ES. I’ll need to make more tests, but it’s OK now. The only good news its that I now have a super optimized setup.

Apr 18, 2015 9:28pm Rick Murray (539) 13840 posts	Finally heard from China… 42.56.234.1 (Shanghai, China UniCom Liaoning) It is 5am over there. 19C, and raining.

Apr 19, 2015 4:36pm David Feugey (2125) 2709 posts	Problem comes back this afternoon, but PandaBoard ES stays silent for a shorter time than PandaBoard. I make some requests, Ethernet LED blinks, but WebJames doesn’t react (and doesn’t received the request). Ethernet blinks again and again (my web browser :) ), then, suddenly, all wakes up, request is received, disc access is made, answer is sent. Strange. Network packets or sockets seems to disappear.

Apr 19, 2015 4:47pm David Feugey (2125) 2709 posts	I tried a new experiment: to set slow and fast speed to the same value (700 MHz). I’ll see if socket issue is linked to sync problems linked to changes of frequency.

Apr 19, 2015 5:15pm David Feugey (2125) 2709 posts	(I can confirm that at 700 MHz, the problem occurs almost immediately. 920 MHz works better, but with some almost permanent lags. 1.2 GHz is really better, but not perfect.) CORRECTION Lags are the same at all speeds, but are coming faster with low speeds. Ping still workings when web request are not transmitted any more to WebJames. No explanation…

Apr 19, 2015 6:04pm David Feugey (2125) 2709 posts	The same with HTTPServ. It’s even worse. It works OK, then, after some idle time, it does not work any more. Completely. Data is coming to network card, then lost. Conclusion: I have no solution. I just can’t use RISC OS for web server any more. Old ROMs did not permit me to reboot (hey, I cannot live near my Panda). New ROMs have a very big issue with network.

Apr 19, 2015 6:35pm David Feugey (2125) 2709 posts	Same setup under a Pi Model B+, generic boot + latest RC14 ROM. Much faster at boot. Let’s see if performances are going down after a few minutes/jours/days…

Apr 19, 2015 7:12pm David Feugey (2125) 2709 posts	Test finished. I confirm dead sockets problem on all tested set up : Pi Model B (wakes up after 5-10 sec), PandaBoard non ES (does not wake up easily), PandaBoard non ES (wake up a bit faster than non ES, but slower than Pi). RC12A Rom works about the same, but slowly, because of other network issues.

Apr 19, 2015 7:43pm Rick Murray (539) 13840 posts	Strange. Network packets or sockets seems to disappear. I had the problem just now. What my code does is it sets a CallAfter to fire in 50cs. This then schedules a CallBack which returns whenever (close enough to not worry). My code will then do: `FD_ZERO(&isready); FD_SET(sock, &isready); timeout.tv_sec = 0; select((sock + 1), &isready, 0, 0, &timeout); if (FS_ISSET(sock, &isready)) ...` So, essentially, the module checks the socket twice per second, so it doesn’t impact the system but responds fast enough to be covered by general network latency. ;-) After each eventuality has been handled, the CallAfter is scheduled anew and life carries on. I have a suspicion that something is causing the CallAfter to not be scheduled. I don’t see any obvious exit-without-doing-it in the code, so I have added a debug command to tell me if the module thinks a CallAfter is pending (it should always be at any point when I can issue the command). That will tell me if it was failing to be set or if something else is going on. It may also work to take all of this out and replace it with a simple CallEvery instead. The socket handling code doesn’t take anything remotely near half a second to do its work; and anyway there is an interlock to prevent a CallBack being set if one is pending, so it should be safe to handle the CallEvery on time – if a CallBack is pending, another won’t be set. Let’s see what happens the next time it fails to respond. I should point out (again) that my Pi ROM is 2nd November 2014. I have not updated the sources more recently; and since I need localtime() to work for an additional CE(S)T timezone in the UK territory, it rather implies some specific code patches. ;-) In short – David’s problem concerns me – but I don’t think the causes to our problems are the same despite superficially looking similar. Does anybody else run a web server on RISC OS? I have no issues here with WebJames which is running in the background and has been since some time this morning when I started the machine.