Thursday, 12 March 2015

What does it mean that a software system is fragile, robust, or antifragile? Considerations, ideas, and examples

In what follows is my answer to the second question of the Webinar on Antifragility: "Antifragility Webinars: Practice Beyond the Rhetoric!" that I mentioned in my previous post:
 
How do I envision, and how am I actually translating Professor Taleb's antifragility into practice?

In order to answer this question I will need to spend a few words about systems made of software.

Software is nowadays becoming more and more complex; this is because it's becoming easier and easier to create complex applications from off the shelf components. Complexity can be easily manipulated, combined and recombined into ever more powerful software systems. On the other hand said systems become more and more fragile, as there is often little or no guarantee about the quality of the "bricks" one uses for their construction.
 
Are those bricks robust? Are they fragile? Antifragile? There's no easy way to tell and no standard that helps at the moment.
 
So what we have today is gigantic castles of sand that are precariously built, in that their solidity and stability depend on a chain of assumptions: A assumes B is going to be reliable and available, B assumes C and D will work as expected, and so on and so forth.
 
(Regrettably this chain of dependencies extends beyond software. Today's critical infrastructure are based on the same principle and share the same weakness.)
 
OK so what do you do to prevent failures? The typical answer is that of using redundant resources. Instead of using a single component, you use several replicas. If one fails, you use another one. Or you use them all at once and then you select the output based on some criterion for instance a voting scheme. If there's a majority in consensus, you assume the majority is right.

The key word here is redundancy. To better understand what this word means, let me describe you a videogame.
 
You play General Grant; you want to send an important message to a part of your troops so that they are informed of the next steps in your war strategy.  The message has to go through a battlefield that is under the sphere of action of your Enemy. What do you decide to do?
 
A possibility is, you send a cavalryman with your message. Of course the carrier of your message may be hit; in other words, this is a fragile scheme
 
Grant knows better, so he sends several cavalrymen in the hope that at least one will reach their destination. For instance, he may choose to have three cavalrymen. This is a better scheme, 'cause it shields from up to two failures; on the other hand, this is a scheme that does not take into account how the situation evolves on the battlefield. You use three cavalrymen because you think that this number is big enough; but your reference is an estimation of the current condition. In fact, conditions may vary. Say the enemy doubles in number, or is joined by an artillery team that increases considerably its firepower. What then? The three cavalrymen may be all wiped out and the message be lost. If you compare it to sending just one cavalryman, this second scheme is much more robust; and though, this is not at all sufficient to counterbalance changing conditions — conditions that mutate, possibly unexpectedly, and possibly very rapidly. The technical word that is typically used is turbulent environments. A simple robust scheme is one that "does not care too much" [as Prof. Taleb says] about the evolution of its environment, and because of this often ends up caring too little.

What then? Well, if one could track the environment (e.g. the firepower) and the way our current scheme matches the environment (basically, how many cavalrymen are left at any point in time) then one could have a more robust scheme — one that is resilient, namely adaptive to changing conditions. New cavalrymen could be added in dire conditions, and their number could even be reduced through more relaxing conditions.

But this is still not antifragile. In fact, the system stays the same: each time you face the problem you launch the same solution — a solution that is not changed by the experience. What we are trying to do is to change this. To change the software "DNA" after each "run" while taking into account the past runs.
 
What we do in practice is, we use web services (representing our cavalrymen); the system tracks the performance of our group of cavalrymen considering both "the parts" and "the whole": each individual "cavalryman" is tracked (one checks whether he's loyal and trustworthy, and to what extent he is) and the ability of the overall group is also tracked (how close we are to failure and disasters over time). 
 
(For more information about the above schemes and especially on
Distance-To-Failure please refer to this and this paper.)
 
When performance is not satisfactory, the scheme is revised. Not just the amount of cavalrymen, but even the choice of which "cavalryman" to use is constantly revised.
 
(For more information please
refer to this and this paper.)
 
Next steps will be to include machine learning schemes to tell which solution works better and best-matches the foreseen next condition. And we want this match to feed back on the solution itself, and be persisted in future runs. In other words we want to change the "genetic code" of the solution. For instance, instead of individual cavalrymen (webservices), we could learn that the scheme could work well with teams of cavalrymen (webservice groups). Said teams could work as a specialized "organism", with different roles within each group. Instead of working independently of one another, those teams could... team up into a fractal organization of webservices functioning as a Fractal Social Organization.
 
(For more information about
Fractal Social Organizations, please
have a look at my ERACLIOS
posts [a, b, c]  and the papers
here and here.)

Creative Commons License
What does it mean that a software system is fragile, robust, or antifragile? Considerations, ideas, and examples by Vincenzo De Florio is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.
Permissions beyond the scope of this license may be available at vincenzo.deflorio@gmail.com.