Incomplete fetch via AcornHTTP
Matthew Phillips (473) 721 posts |
I am working on an application which uses the URL_Fetcher module, AcornHTTP and AcornSSL to fetch about 2MB of XML from an API. I am processing the data as it arrives, and because of inefficiencies in my part of the process, it can take quite a long time. I find that after about 8 minutes of slowly fetching the data and processing it, URL_ReadData suddenly reports that the fetch is complete (R0 = 32) and that there are no bytes read and no more expected. The data received is incomplete, stopping part way through an XML tag. The previous call to URL_ReadData had the number of bytes still to read (R5) showing about 900K more. Using a debugging version of AcornHTTP I can see that the module is fetching an uncompressed, unchunked response from the SWI AcornSSL_Recv and that call is returning zero in R0 and does not generate an error. I suspect that the server at the other end has got bored with the slow progress and closed the socket, but maybe AcornSSL has got bored. I’m not sure how to tell. If the server closed the socket, is there any way that that could be distinguished from a completed fetch? As far as I understand it, after the server has transmitted all the bytes, it would be quite at liberty to close the socket, so the only way we could tell the transmission was incomplete would be with reference to the Content-Length header, and we cannot necessarily rely on that. (I expect I have to solve this problem by speeding up my processing, and by trapping the XML parsing error that results from an incomplete transmission. If there could be a way for the modules to distinguish between a complete transmission and an aborted one, I suppose I could look into writing an enhancement, but after a couple of weeks messing around with AcornHTTP I would be quite glad to be told that it’s impossible.) |
Chris Mahoney (1684) 2165 posts |
Is it possible to fetch the same data using something other than AcornSSL/HTTP, such as using NetSurf? Edit: Sorry, misunderstood. That, of course, won’t be slow enough. |
Matthew Phillips (473) 721 posts |
Correct: NetSurf will fetch the complete 2MB response quite happily. So can I if I just save it to a file and don’t process as I go. The problem may seem academic, as I can probably avoid the problem in various ways, but having encountered it I would like to make the code more robust so that it detects the incomplete transfer at the earliest stage possible. |
Dave Higton (1515) 3526 posts |
Most weeks, I receive an email with 10MB or so of attachments. This is fetched securely using AornSSL. It never fails, but it completes much quicker than your situation. I wonder if something times out deliberately because such a slow transfer implies a possible security problem? Maybe that’s far fetched. But can you use the same code except not slow down to process it, i.e. just dump the data, and see if the behaviour is any different? Edit: Oops, I see you already said you tried it and it works. |
Dave Higton (1515) 3526 posts |
Get all the data to file, then parse from file? |
Paolo Fabio Zaino (28) 1882 posts |
@ Matthew, bq . I find that after about 8 minutes of slowly fetching the data and processing it, URL_ReadData suddenly reports that the fetch is complete (R0 = 32) and that there are no bytes read and no more expected. AFAIR, there can be multiple timeous that can be set for an HTTP Sever (depending on the server), generally (the most commons) are two types of timeouts: - Request Timeout (generally used during the open phase) What is possibly happening to you, is that you are holding a connection for an amount of time that exceeds the maximum allowed time on the specific server. Do you have more details on the packets you are receiving from that Server? You may be able to monitor the traffic that is happening between the two using either Wireshark (if you have a PC that could sample that connection) or also WireSalmon on RISC OS itself. By capturing the traffic you’ll certainly see which of the two ends is sending the FIN or the RST packet. Hope this helps |
Matthew Phillips (473) 721 posts |
@Pablo Thanks, WireSalmon is a good idea. As you’ll appreciate, I have got as far as finding what is happening inside AcornHTTP, a module I am now reasonably familiar with. What I don’t know is whether AcornSSL could be any better informed as to the reason for the traffic ceasing, and whether it could give better information to AcornHTTP. As I am not keen to delve into AcornSSL just yet, I’ll try the WireSalmon angle. @Dave I agree, fetching all the data before processing it would help, and I may well need to do that anyway. It was just, having encountered the problem and the fact that my application could not tell if the transmission was compelete, I thought it was worth exploring this a bit to see if the modules could be improved. The reason I am taking so long over the processing is that the API returns hundreds of objects which I want to plot on a map in RiscOSM. If I process them as they come in, to keep my memory requirements down, that involves my application sending a series of GeoData Wimp messages to RiscOSM. RiscOSM acts on these messages immediately, but as the number of objects reach the hundreds, its redraw is taking a long time, so my application does not get polled very much, and hits the timeout issue. It would clearly be a good idea for me to accumulate the messages and only send them to RiscOSM after I have finished the complete fetch. Hilary is also working on improving the redraw in RiscOSM so that it is more efficient at handling the incoming messages. |
Matthew Phillips (473) 721 posts |
Where can you find WireSalmon these days? It was written by Alex Waugh, but I cannot find his site anymore. |
Andrew Conroy (370) 740 posts |
Have you tried here ? |
Matthew Phillips (473) 721 posts |
Thank you. I could not find via Google. |
Matthew Phillips (473) 721 posts |
I’ve now had time to play with WireSalmon (on RISC OS) and look at the captured packets with WireShark (on Linux). I’ve never looked at the packet level before, so I don’t understand much about it. Quick reminder of background. I was finding that a remote sever was cutting the connection before the complete HTTPS response had come through because my application was being very slow about processing the incoming data. I have solved those issues, but I wanted to see if it might be possible to enhance the URL_Fetcher / AcornHTTP / AcornSSL modules so that the premature termination of the connection could be detected and a signal passed to the client application so that it knew that a failure had occurred, because at present the URL_ReadData calls look no different from a successful fetch. My first test was to see what happens if I make an HTTPS request via the URL_Fetcher module and it succeeds. I fetched a fairly small amount of data to keep the number of packets down. I can see from my application logs that the HTTP response, including header, was 3355 bytes in total, but that will be after AcornHTTP has doctored it. The screenshot below covers almost all the conversation. There is some SYN ACK stuff just off the top of the screen. I imagine that frame 18 is where my HTTP request gets transmitted to the server. The next four frames carry enough bytes to be the full response, but I do not know why 22 is described as protocol TLSv1.2 and the others only as TCP. Frame 22 has a “Secure Sockets Layer” section in WireShark which says “Length: 3398” which is consistent with what I get back at the application. I’m puzzled as to why that is the last packet, and as there isn’t compression involved, the actual response must be divided across all four packets. I don’t know the significance of the “Encrypted Alert” in frame 23. Then it looks like the socket gets closed with the “FIN, ACK” from each end. Next I fetched a much larger quantity of data and made sure that my application deliberately waited a long time between calls to URL_ReadData in order to encourage the remote server to terminate the connection before the transmission was complete. The following screenshot shows the tail end of the conversation. You will see a lot of black frames which I think must be where the process is waiting till the RISC OS TCP/IP stack has more room available to accept more data. Frames 921-923 look like normal transmissions, similar to frames 20 and 21 above. But then we do not get anything like frames 22 and 23, and instead get a frame (924) with flags “FIN, PSH, ACK”. After this the conversation seems to wrap up in a similar way to the successful one above. My conclusion is that it may be possible for the operating system to tell that the response from the web server is incomplete, but I have no idea how any of this might get surfaced at the AcornSSL stage, or how the information could be passed up to Acorn HTTP, the URL_Fetcher and the application. |
Jeffrey Lee (213) 6048 posts |
It’s probably just a quirk of wireshark’s protocol analysers. Unless you explicitly tell it “this is TLS”, it’s probably erring on the side of caution and defaulting to “TCP” instead of assuming that everything after the TLS negotiation phase is valid TLS traffic. If you’re new to wireshark, one very useful thing you can do is right-click one of the entries and select “follow TCP stream”. That’ll set a filter so that you only see the packets from that stream. |
Paolo Fabio Zaino (28) 1882 posts |
@ Matthew
The first packets (where an encrypted protocol is agreed between the two parts) are not encrypted, and so WireShark can read the full payload and tell you everything about it. The packets that come after the encryption takes place cannot be decrypted by WireShark. (unless you know the secret key of the specific asymmetrical encryption protocol and cipher suite used), so they appear to WireShark as generic TCP packets. Remember, during an encryption handshake only public keys can be exchanged and those are used to encrypt traffic, but cannot be used to decrypt it. So the reason why you see TCP in the subsequent packets is because that is how far WireShark has understood those encrypted ones. To isolate the specific packet stream use a filter, for instance: ip.addr == 3.9.2.66 && ip.addr == 192.168.1.17 && tcp
Your conclusion is correct, in the sense that the FIN ACK is part of the ISO/OSI L4 and therefore it’s handled by the OS TCP Stack. In a module or code that is aware of the nature of the response, you can add checks by parsing the response content, for instance if it’s HTTP does it has a |