Cell Tower

SWUpdate, Suricatta and Unreliable Connectivity

Many product makers choose to use the open-source SWUpdate software to provide an off the shelf method of software update for their embedded devices. For internet connected devices this is typically configured in Suricatta daemon mode which allows for SWUpdate to poll for software updates from an update server such as the open-source Eclipse Hawkbit. This solution can provide for fleet management, monitoring and roll out of new software. But how well does this work when devices are connected via intermittent, unreliable and slow connections for example cellular? In this blog, post we’ll explore some of the issues and lessons learnt when supporting a customer in this situation.

Our customer reported to us that updates on cellular connected sites would often fail to update. After looking at the logs we felt it would be desirable to reproduce with hardware on our desk and a local Hawkbit instance. But how would we reproduce an unreliable connection? We decided to make use of the ‘traffic control’ utility – this is a utility that can be used to configure the kernel’s packet scheduler. Specifically we used it’s ‘queueing discipline’ feature which allows us to add latency and drop packets. We run the following commands when we wanted to emulate a really really rubbish connection, usually just after we’ve seen that SWUpdate has started its download.

$ sudo tc qdisc del dev eno1 root
$ sudo tc qdisc add dev eno1 root handle 1: prio
$ sudo tc qdisc add dev eno1 parent 1:1 handle 2: netem loss 99%
$ sudo tc filter add dev eno1 parent 1:0 protocol ip prio 1 u32 match ip dst 192.168.10.188 flowid 2:1
$ # run this command to re-enable traffic: sudo tc qdisc del dev eno1 root

The above commands will lose 99% of the packets that are directed to our hawkbit server. In our case we used SSH port forwarding (-R) such that the device attempted to contact Hawkbit via 127.0.0.1:8080 which was tunnelled to our Hawkbit server on a remote machine (192.168.10.188:8080). This allowed us to run the above commands on our development machine.

The typical output we see from SWUpdate is as follows, unfortunately it doesn’t tell us much:

[TRACE] : SWUPDATE running :  [channel_get_file] : Channel sleeps for 5 seconds now.
[TRACE] : SWUPDATE running :  [channel_get_file] : Channel awakened from sleep.
[TRACE] : SWUPDATE running :  [channel_get_file] : Channel sleeps for 5 seconds now.
[TRACE] : SWUPDATE running :  [channel_get_file] : Channel awakened from sleep.
[TRACE] : SWUPDATE running :  [channel_get_file] : Channel sleeps for 5 seconds now.
[TRACE] : SWUPDATE running :  [channel_get_file] : Channel awakened from sleep.
[TRACE] : SWUPDATE running :  [channel_get_file] : Channel sleeps for 5 seconds now.                                                                                                                        
[TRACE] : SWUPDATE running :  [channel_get_file] : Channel awakened from sleep.
[ERROR] : SWUPDATE failed [0] ERROR : Channel get operation aborted because of too many failed download attempts (4).
[ERROR] : SWUPDATE failed [0] ERROR : Checksum WRONG ! Computed 0x3977129011x, it should be 0x840121544x
[ERROR] : SWUPDATE failed [1] Image invalid or corrupted. Not installing …

To understand what’s going on here we need to better understand how SWUpdate handles downloads, for example what happens when a connection is dropped, does it retry? start again? how many times does it retry? Let’s take a closer look.

The code of interest is in corelib/channel_curl.c in a do/while loop inside the channel_get_file function. This function is used when SWUpdate wishes to download the update and it does so via libcurl. If we reduce it to pseudo code we get something like this:

do {
    if (try_count > 4)
        print "Channel get operation aborted because of too many failed download attempts (retries)"
        break
    if (try_count > 0)
        start download with libcurl and resume from where we left off
        print "Channel sleeps for 5 seconds"
        sleep(5)
        print "Channel awakened from sleep."
    else
        start download with libcurl

    try_count++
} while (download not succeeded)

As you can see SWUpdate uses libcurl to perform the download. If libcurl returns without having downloaded all the file then it will go around the loop and try again, each time attempting to resume from where it left off. Eventually the download will succeed or the the number of retries around the loop will hit a configurable limit and it will give up.

The first point to make here is that on each retry SWUpdate will attempt to resume the download, under the hood this is achieved with HTTP range requests and ensures that SWUpdate won’t download the part of the file it has already downloaded. This allows it to make progress and not unnecessarily use potentially expensive cellular data.

The second point is that there are many reasons why libcurl will return without having downloaded all of the requested file. SWupdate invokes libcurl via the curl_easy_perform function and maps it’s error into a SWUpdate error via a call to channel_map_curl_error. By adding additional debug to see the libcurl error we found that with our emulated bad connection we came across the following errors:

  • -28 (CURLE_OPERATION_TIMEDOUT)
  • -56 (CURLE_RECV_ERROR)

Unfortunately SWUpdate doesn’t display these the raw libcurl errors in the log, and it doesn’t print anything to suggest they’ve happened, as a result we see the following debug each time libcurl has returned with one of these errors and it’s gone around the loop again:

[TRACE] : SWUPDATE running :  [channel_get_file] : Channel awakened from sleep.
[TRACE] : SWUPDATE running :  [channel_get_file] : Channel sleeps for 5 seconds now.

It does however print something (“Lost connection. Retrying after %d seconds”) when libcurl reports connection errors, e.g. CURLE_COULDNT_RESOLVE_PROXY (5), CURLE_COULDNT_CONNECT (7), CURLE_USE_SSL_FAILED (64), etc. Perhaps it should be updated to include the more common errors that we’ve seen.

It’s also interesting to note that libcurl will timeout if an operation takes to long (-28) – there are also related libcurl parameters that can be used to tell libcurl to additionally timeout if a download has a low data rate (on average) for a specified period of time.

What we’ve learnt thus far is that on an unreliable connection libcurl will timeout often, on each timeout it will retry and resume where it left off – however by default the number of retries is just 4. After which any further attempts will start afresh and re download parts of the file that have previously been downloaded. With a slow and unreliable connection you can easily see how SWUpdate will never manage to download the update.

Fortunately the number of retries in SWUpdate can be configured. This is done via the ‘retry‘ parameter in the suricatta section of the /etc/swupdate.cfg file. In my view there is little harm in setting this to a large number and this seems to be the best way of giving unreliable connections more time to download their update.

The other parameter of interest is the ‘retrywait‘ parameter, also in the suricatta section. This controls how long SWUpdate waits for after libcurl has returned unsuccessfully – the default value is 5 seconds.

Finally there are two other solutions that ought to be considered.

The first approach is to tell SWUpdate to keep the portion of file it has downloaded. The behaviour we’ve described above will attempt to resume the download until a number of retries has been reached at which point the update has failed. If Hawkbit then instructs another attempt at the update then the file will be downloaded afresh – it is possible to tell SWUpdate to keep the file such that it can be resumed across update attempts. This is achieved by using the -o argument of swupdate and the undocumented -2 argument of Suricatta – telling swupdate to keep the downloaded file and Suricatta to resume from that file. However for various reasons we found this doesn’t work well in practice, it also results in more storage being required.

The second approach is to use delta updates – we’ve previously covered this in an earlier blog post – SWUpdate provides a means of only downloading the changed parts of an software image. Therefore this reduces the amount of data that needs to be transferred.

In supporting our customer we’ve gained a lot of knowledge of how SWUpdate and Hawkbit play together, the features offered by SWUpdate for tuning the download behaviour and alternative solutions. We eventually found that using delta updates where possible and increasing the ‘retry’ parameter for non-delta downloads worked nicely.

You may also like...

Popular Posts