Defensive Programming and Resilient Systems

Although you have developed your software with TDD and your application has 100% coverage. It is likely which there will be some bugs in production environment, due to Murphy’s Law:

“Anything that can possibly go wrong, does”

As pragmatic programmers, we have to be critical with our code and our programming way, but sometimes this is not enough, and we have to try to guard our application of the production bugs, in addition our system should survive failures.

In my opinion, It is more important fail quick than the user or customer is waiting for any response of our application which never happens, besides it is very important to detect when the system fails, due to the error can be caught and fixed, also an email or another kind of alarm could be sent  in order to start to research what problem has happened.

Assertions

In my experience as developer in different Companies, when you have to do a report for some people, you have to create a file, you have to save it in some directory and perhaps, that directory is not still created or the application did not have permissions to write in it. So the process will fail and an issue will be opened.

For this reason, the unexpected conditions should be checked and Assertions (asserts) can be used. In our case, with an assertion or another kind of verification about the directory and permissions would have avoided the previous issue.

In the book Code Complete Second Edition, there are some rules regarding assertions that you should take in account:

  • That an input parameter’s value falls within its expected range (or an output parameter’s value does)
  • That a file or stream is open (or closed) when a routine begins executing (or when it ends executing)
  • That a file or stream is at the beginning (or end) when a routine begins executing (or when it ends executing)
  • That a file or stream is open for read-only, write-only, or both read and write
  • That the value of an input-only variable is not changed by a routine
  • That a pointer or element is non-null
  • That an array or other container passed into a routine can contain at least X number of data elements
  • That a table has been initialized to contain real values
  • That a container is empty (or full) when a routine begins executing (or when it finishes)
  • That the results from a highly optimized, complicated routine match the results from a slower but clearly written routine

An example of a groovy assertion would be the following:

 

On the one hand, if you want to use assertions in different places of your code, you can generalize your solution, this is an specific example.

On the other hand, be careful with the assertions in production environment, they can reduce the performance of your application.

Feature disabling

Another situation, that I have seen during my career, has been when you deploy a particular development which has been included in a  key component. If it is not working properly and you can not deactivate it.

A good practice, which you can use to avoid downtimes in your system, is feature disabling, although it is not the clearest solution. This technique consists in including an if statment in order to excute or not a part of code depends on any configuration.

Example:

 

If you ask me, the configuration should be in the database, because you can change it without deploying the application again. The other possibility is to use a properties file, but in this case you would need to reboot your application server, which  would produce a downtime.

Alert systems

Other problem, which you can find in production environment, is when you have a process which never finishes and you have no alarms or alerts. So… How can I know that the process has failed? Maybe somebody reminds you which you have a report who nobody has received or maybe not.

Hence It is very important to manage if a process should have finished. In this way, you can add an alert or timeout, when the process is running although it should have achieved successfully. Then you can research quickly, what it happens so as to fix the problem and restore the service.

During this article, we have talked about alarms, alerts and reports systems. All of them should be out our application to work when our system has failed, otherwise the IT team would not anticipate whether the system is working or not.

Another features of the Logs should be:

  • Readeable (By human or machine)
  • Do not allow break lines (New appenders Log4j2 or Logback can be configured to show the info in one line)
  • Do not mix different information
  • Show the important information (user ids, emails….)
  • Split the logs about the different services
  • Use daily rollings

Integration with other systems

Currently, almost all the application or sites are working with microservices or they have to connect to other services or databases. For these cases the health checks are very important, due to they detect whether the provider services are falling or not. Depending on their responses, the application will be deployed or not.

In addition, when your are connected with a provider service or database, you are going to consume their responses, and perhaps the data, which they provide, are not the best for your application or they are wrong, so you can create a barricade where you can validate and transform the data. With the new data your system will work properly almost always.

If your system detects an error data, you must decide what the system does with that data. In the Code Complete book, the author proposes some techniques which can help you to decide :

  • Return a neutral value: Sometimes the best response to bad data is to continue operating  and simply return a value that’s known to be harmless.
  • Substitute the next piece of valid data: When processing a stream of data, some circumstances  call for simply returning the next valid data. If you’re reading records  from a database and encounter a corrupted record, you might simply continue reading  until you find a valid record.
  • Substitute the closest legal value: In some cases, you might choose to return the closest  legal value, as in the Velocity example earlier. This is often a reasonable approach  when taking readings from a calibrated instrument. The thermometer might be calibrated  between 0 and 100 degrees Celsius, for example.
  • Log a warning message to a file: When bad data is detected, you might choose to log a warning message to a file and then continue on. This approach can be used in conjunction with other techniques.
  • Call an error-processing routine/object: Another approach is to centralize error handling  in a global error-handling routine or error-handling object. The advantage of this  approach is that error-processing responsibility can be centralized, which can make  debugging easier.
  • Display an error message wherever the error is encountered: This approach minimizes  error-handling overhead; however, it does have the potential to spread user  interface messages through the entire application, which can create challenges when  you need to create a consistent user interface.
  • Shut down: Some systems shut down whenever they detect an error. This approach  is useful in safety-critical applications.

In conclusion, we have to know which there will be bugs in production environment, but as pragmatic programmers we have to try to reach a system without downtimes and bugs, for these goals, the defensive programming tools are very important. However, if the bugs happen, try to solve it as soon as possible.

Although you use defensive programming,please you do not forget to keep on using  S.O.L.I.D, code reviews, tests and TDD.

References:

Joaquín Engelmo – Programación defensiva y sistemas resilientes en el mundo real

8 formas de mejorar tu vida gracias a los Logs

Code Complete Second Edition