Code repair 2: Fibonacci and standard deviations homework (part 2)

This is the second part in a series of articles about improving some numerical code, see part 1 for the starting point. In part one we looked at some student code and improved it to a point that would be suitable for submitting in that context. In this part we will go beyond the scope of what was appropriate for the assignment and start to make maintainable production code.

Just like part 1 the code in this series is hosted over on my github page: code repair series repo if you are familiar with git you can check out that code when following along.

Why c-style arrays should be avoided

Before we get back to improving the code I want to cover some important points about C-arrays in order to show why we should change that particular aspect of the code. C-style arrays are basically a pointer to the first element in the block of memory that stores the array and a chunk of allocated memory. The problem is that when you pass a c-style-array to a function it decays to a pointer and you are left passing a pointer to the first element of memory for the array and nothing more. This means it's up to you to manage the extra information that actually makes those arrays usable. You need to manually keep track of how many elements are stored in the array and you also have to manually manage the lifetime of the memory.

This bug we had calculating the mean and standard deviation came about because the information about the length of the array got separated from the array we were actually using. This lack of length information is one of the main problems with the old C-style arrays. In order to use them correctly you need to be extremely careful when it comes to keeping track of the length of the array. If you get it wrong you can get segfaults and other very insidious bugs. Segfaults are actually one of the best outcomes because you very quickly get feedback that your code is broken. In this case the code might have been submitted or shipped out to customers if the bug wasn't caught and this could end up being an expensive problem to fix. Sometimes that cost isn't immediately apparent either, there's a large number of security vulnerabilities out there because of problems people have had with array bounds and pointers.

There's a whole host of other problems with C style arrays too, the c++ FAQ has a good summary of why "arrays are evil". The main point I'm was trying to illustrate via this code is that c-style arrays make it harder to write correct code than it needs to be in many cases. This extra difficulty has a fairly significant cost to productivity. If you don't specifically need to deal with things at such a low level then it's best to not use the old C style arrays at all because you can lose program safety and correctness for absolutely no benefit whatsoever. [1]

One of the large advantages of std::array is that the length of the array is always kept with the array object. The bug we just encountered is no longer possible if std::array is used correctly.

So for educational purposes I'm going to start by changing the code to use the newer std::array in the next section.

Moving beyond the scope of this assignment

Given that this was some code for a school assignment the scope of what was expected to be used, language feature wise, is quite limited. I will assume that the instructor intended to show some concepts about function overloading and arrays here. In the context of an assignment which designed for educational purposes some of the things are ok, however would be problematic in the context of professional level library code.

Looking at making this high quality production code is outside of the scope of the assignment but I do think there is value in showing the readers some of the improvements that could be made if we went beyond the original scope somewhat.

Unfortunately the specification of the assignment makes it really hard to write the best quality code possible to solve the problem. This is particularly apparent with the use of the old C style arrays. Given that I just wrote about the problems of C style arrays it makes sense to change that first. This actually takes a fairly substantial rewrite:

#include <array>
#include <cmath>
#include <iostream>

const size_t SIZE_OF_GENERIC_ARRAY = 100;
const size_t SIZE_OF_FIBONACCI_ARRAY = 20;

typedef std::array<float, SIZE_OF_GENERIC_ARRAY> seq_array_t;
typedef std::array<int, SIZE_OF_FIBONACCI_ARRAY> fib_array_t;

void fillSequentialIntegersArray(seq_array_t&);
void fillFibonacciArray(fib_array_t&);

double mean(seq_array_t);
double mean(fib_array_t);

double standardDeviation(seq_array_t);
double standardDeviation(fib_array_t);

void outputMean(seq_array_t);
void outputMean(fib_array_t);

void outputStandardDeviation(seq_array_t);
void outputStandardDeviation(fib_array_t);

int main(int argc, char* argv[])
{
   char sequenceType;
   seq_array_t SequentialNumbers;
   fib_array_t FibonacciNumbers;
   std::cout << "Would you like to generate a generic sequence or a Fibonacci sequence?"
       << std::endl
       << "\n"
       << "Type [G] + [ENTER] for a generic sequence, or;" << std::endl
       << "Type [F] + [ENTER] for a Fibonacci sequence: ";
   std::cin >> sequenceType;

   if (sequenceType == 'G' || sequenceType == 'g')
   {
       fillSequentialIntegersArray(SequentialNumbers);
       outputMean(SequentialNumbers);
       outputStandardDeviation(SequentialNumbers);
   }
   else if (sequenceType == 'F' || sequenceType == 'f')
   {
       fillFibonacciArray(FibonacciNumbers);
       outputMean(FibonacciNumbers);
       outputStandardDeviation(FibonacciNumbers);
   }
   else
   {
       std::cout << "\n"
            << "Invalid input. Please type 'F' or 'G'. Thank you.";
   }
   return 0;
}

void fillSequentialIntegersArray(seq_array_t& array_to_fill)
{
   size_t array_size = array_to_fill.size();
   for (size_t i = 0; i < array_size; i++){
       array_to_fill[i] = i + 1;
   }
}

void fillFibonacciArray(fib_array_t& array_to_fill)
{
   array_to_fill[0] = 0;
   array_to_fill[1] = 1;

   size_t array_size = array_to_fill.size();
   for (size_t i = 2; i < array_size; i++){
       array_to_fill[i] = array_to_fill[i - 1] + array_to_fill[i - 2];
   }
}

double mean(fib_array_t fib_container)
{
   double sumOfElements = 0;
   size_t container_size = fib_container.size();
   for (size_t i = 0; i < container_size; i++)
   {
       sumOfElements += fib_container[i];
   }
   return sumOfElements / container_size;
}

double mean(seq_array_t seq_container)
{
   double sumOfElements = 0;
   size_t container_size = seq_container.size();
   for (size_t i = 0; i < container_size; i++)
   {
       sumOfElements += seq_container[i];
   }
   return sumOfElements / container_size;
}

double standardDeviation(seq_array_t seq_container)
{
   double tempSum = 0;
   size_t container_size = seq_container.size();
   double calculated_mean = mean(seq_container);
   for (size_t i = 0; i < container_size; i++)
   {
       tempSum += pow((seq_container[i] - calculated_mean), 2);
   }
   return sqrt(tempSum / container_size);
}

double standardDeviation(fib_array_t fib_container)
{
   double tempSum = 0;
   size_t container_size = fib_container.size();
   double calculated_mean = mean(fib_container);
   for (size_t i = 0; i < container_size; i++)
   {
       tempSum += pow((fib_container[i] - calculated_mean), 2);
   }
   return sqrt(tempSum / container_size);
}

void outputMean(seq_array_t array_to_display)
{
   std::cout << "\n";
   std::cout << "The mean is: " << mean(array_to_display);
   std::cout << std::endl;
}

void outputMean(fib_array_t array_to_display)
{
   std::cout << "\n";
   std::cout << "The mean is: " << mean(array_to_display);
   std::cout << std::endl;
}

void outputStandardDeviation(seq_array_t array_to_display)
{
   std::cout << "\n";
   std::cout << "The standard deviation is: " << standardDeviation(array_to_display);
   std::cout << std::endl;
}

void outputStandardDeviation(fib_array_t array_to_display)
{
   std::cout << "\n";
   std::cout << "The standard deviation is: " << standardDeviation(array_to_display);
   std::cout << std::endl;
}

In order to compile this we need to allow c++11 features in the compiler options.

g++ -Wall -std=c++11 main.cpp -o Output

Now this compiles and runs so it's time to commit.

(master)$ git add compile.sh
(master)$ git add main.cpp
(master)$ git commit -m "Changed program to use std::array and updated compile script to enable c++11"
[master 7bd2e3c] Changed program to use std::array and updated compile script to enable c++11
 2 files changed, 150 insertions(+), 142 deletions(-)
 rewrite 02_fibonnaci_standard_deviation_homework/main.cpp (60%)

One of the first things that's particularly noticeable is that the code calculating the mean and the standard deviation is completely duplicated apart from the type of the parameter. The print functions are similar because we don't actually change the output based on the type passed in. Such code duplication is very often a c++ code smell as duplicate code is a prime candidate to become dead code. The template mechanism originally appeared to solve the exact problem we have here. Changing the mean requires changing the declaration and the definition as follows:

//declaration:
template< typename containerT >
double mean(containerT items);

//definition:
template< typename containerT >
double mean(containerT items)
{
   double sumOfElements = 0;
   size_t container_size = items.size();
   for (size_t i = 0; i < container_size; i++)
   {
       sumOfElements += items[i];
   }
   return sumOfElements / container_size;
}

The standard deviation function and the output functions are similar.

No compilation errors or warnings and the output is the same, so time to commit changes to the git repository:

(master)$ git add main.cpp
(master)$ git commit -m "Removed code duplication by using templates"
[master 9676bd9] Removed code duplication by using templates
 1 file changed, 23 insertions(+), 56 deletions(-)

The big benefit of this is reducing code duplication. We now take less lines of code to do the exact same thing and we cut out the possibility for a number of subtle bugs that could pop up in the future if someone updated only one function but not the other. Now all we need is a type that has a size member and random access via [].

The next thing is that c++ passes everything by value by default. This means that whenever we are using mean, standardDeviation or the output functions we have to make a copy of the entire container. Because we are only ever reading the values out of these containers and not changing them it would be nice to not have to make this copy. Instead of creating a copy we can instead pass a reference. Because the reference is much smaller than the container itself we are getting a fairly significant boost to the performance because we are passing a smaller representation of the data around.

However in order for the behaviour to be identical from the call sites we would need to gaurantee that we never changed the data that we got via the reference. The language has the const feature to help us indicate that we don't want changes.

//declaration:
template< typename containerT >
double mean(containerT const& items); //const and items are actually ignored by the compiler here

//definition:
template< typename containerT >
double mean(containerT const& items)
{
   double sumOfElements = 0;
   const size_t container_size = items.size();
   for (size_t i = 0; i < container_size; i++)
   {
       sumOfElements += items[i];
   }
   return sumOfElements / container_size;
}

//calling code:
std::cout << "The mean is: " << mean(items);

What we have used here is the pass by reference to const idiom. I've also added a const to the container size because it really shouldn't change at any point in the loop and now the compiler will warn us if it does change. It is really important to note that the calling code stayed the same.

The template change from before is now starting to show some concrete benefits. Until you actually start maintaining code the benefits from removing code duplication are somewhat intangible. Previously we would have needed to change double the number of functions to get these changes integrated. And this is a fairly tame case too, if we instead had 4 or 5 different overloaded functions then the reduction in effort making the changes would have been even more substantial.

The code compiles and runs fine so time to commit:

(master)$ git add main.cpp
(master)$ git commit -m "Passed in containers by const reference to avoid making copies"
[master 7836eb7] Passed in containers by const reference to avoid making copies
 1 file changed, 13 insertions(+), 13 deletions(-)

Something you might notice is that with each step here we are increasing the seperation of containers from algorithms. We can go further in this direction too, but first...

Intermezzo 3: creating documentation

We are starting to get some code now that is doing something a bit more substantial. Other people might even use this code at this point so we really need to take a step back and document the code now.

Code is read more than it is written

I'm a big fan of using something like doxygen to generate the documentation from the source because the documentation then stays close to the code. (Documentation going out of sync with the code it is documenting is frequently worse than having no documentation at all)

We really should document everything that has a public interface as this is the place that people will look when they are using our code:

/** Fills an array with sequential integers
 * @param reference to the array that we are filling
 */
void fillSequentialIntegersArray(seq_array_t&);

/** Fills an array with sequential fibonacci numbers
 * @param reference to the array that we are filling
 */
void fillFibonacciArray(fib_array_t&);

/** Calculate the mean for a collection of items
 *@tparam the type of the collection
 *@param the collection of items for which we are calculating the mean
 */
template< typename containerT >
double mean(containerT const& items);

/** Calculate the standard deviation for a collection of items
 *@tparam the type of the collection
 *@param the collection of items for which we are calculating the standard deviation
 */
template< typename containerT >
double standardDeviation(containerT const& items);

/** Calculates the mean of a container of items then outputs it to console
 *@param the collection of items we are calculating then displaying the mean of
 */
template< typename containerT >
void outputMean(containerT const& items);

/** Calculates the standard deviation of a container of items then outputs it to console
 *@param the collection of items we are calculating then displaying the standard deviation of
 */
template< typename containerT >
void outputStandardDeviation(containerT const& items);

Arguably this could be better. For the purposes of the article I think it illustrates the concept well enough though.

(master)$ git add main.cpp
(master)$ git commit -m "Documented publicly facing interfaces with doxygen comments"
[master 31d2f91] Documented publicly facing interfaces with doxygen comments
 1 file changed, 21 insertions(+)

Another sensible thing to do at this point is to write some unit tests. When more is added and the complexity increases unit tests become more and more valuable. While we could get by before by just checking the output on our terminal screen against the expected outputs this get less and less feasible as the size of the code grows. I'd argue that we are already at the point where it's best to write some tests before continuing.

While I would highly recommend using an actual unit testing library such as boost unit test library I think that for the purposes of keeing this article self contained we can just write a few asserts using the results from earlier. This way we have a baseline for results that any new changes should maintain.

#include <cassert>

/** Asserts that 2 floating point numbers are within epsilon of each other */
template< typename T >
void assert_floats_equality(T f1, T f2, T epsilon){
   assert((f1 < f2 + epsilon) && (f1 > f2 - epsilon));
}

bool run_unit_tests(){
   seq_array_t SequentialNumbers;
   fib_array_t FibonacciNumbers;

   fillSequentialIntegersArray(SequentialNumbers);
   fillFibonacciArray(FibonacciNumbers);

   double epsilon = 0.01;

   assert_floats_equality(mean(SequentialNumbers), 50.5, epsilon);
   assert_floats_equality(mean(FibonacciNumbers), 547.25, epsilon);

   assert_floats_equality(standardDeviation(SequentialNumbers), 28.8661, epsilon);
   assert_floats_equality(standardDeviation(FibonacciNumbers), 1055.81, epsilon);

   return true;
}

int main(int argc, char* argv[])
{
   run_unit_tests();

   //rest of main...
}

Now we have the beginnings of a unit test suite. From this base we can start to make changes with some more confidence that our code is actually correct. For example if we made the same mistake as before with the array bounds with the Fibonnaci numbers array the unit test would have failed and indicated that there was a problem with our code.

The helper function to compare floats is necessary because comparing floats with each other directly causes problems. The reasons for this are discussed in the Goldbergs paper What Every Computer Scientist Should Know About Floating-Point Arithmetic. Any decent unit testing framework will implement methods like the assert_floats_equality better that I have here, so if you are writing real code it's best to go with an established unit testing framework.

(master)$ git add main.cpp
(master)$ git commit -m "Created an ad-hoc unit testing framework and made some tests"
[master 2ff3f12] Created an ad-hoc unit testing framework and made some tests
 1 file changed, 27 insertions(+)

Making the code more generic

Now that we have a few tests and some documentation we can make some progress We can see that this code could be more generic than it currently is.

An astute reader might ask, now that we have more generic code for types what happens if someone passes a type that's not able to have a mean or standard deviation calculated for it.

Let's try it!

int main(int argc, char* argv[])
{
   std::string string_of_numbers = "0123456789";
   outputMean(string_of_numbers);
   outputStandardDeviation(string_of_numbers);
   return 0;
}

When we compile there's no warnings or errors. Running this we get:

(master)$ ./Output

The mean is: 52.5

The standard deviation is: 2.87228

At this point I'm going to commit the code so that the people following along with git can see where this is at:

(master)$ git commit -m "Passing in a std::string to out mean and standard deviation functions"
[master 793c43a] Passing in a std::string to out mean and standard deviation functions
 1 file changed, 8 insertions(+), 3 deletions(-)

When we compile we don't get any warnings because std::string is a container with all the member functions required for this code to compile. Running this gives us some values but those values might not be what you would expect.

The reason we get the mean as 52.5 is because the mean is calculated by iterating over each character and implicitly casting those to the integers. When we pass in "0123456789" we are actually finding the mean of:

{'0','1','2','3','4','5','6','7','8','9'}

You might think that the mean here should be 4.5 which would be entirely reasonable. However because the character encoding used here is ASCII this corresponds to the integer representation:

{48,49,50,51,52,53,54,55,56,57}

The mean of which is 52.5 and the std-dev is 2.87

While this result is predictable it might not be what the user really wants. At some level it does make some sense though as long as we know that the characters in a string get converted to the integers that are used to represent them.

But what about completely innapropriate types?

To avoid our functions generating code for types we don't want we can create a static assert using type-traits.

To do this we change the functions to look like this:

#include <type_traits>

template< typename containerT >
double mean(containerT const& items)
{
   static_assert(std::is_arithmetic< typename containerT::value_type >::value,
                 "This function only operates on containers with numeric types");
   double sumOfElements = 0;
   const size_t container_size = items.size();
   for (size_t i = 0; i < container_size; i++)
   {
       sumOfElements += items[i];
   }
   return sumOfElements / container_size;
}

Now let's try passing in an array of pointers.

int main(int argc, char* argv[])
{
   std::array<int*, 10> array_of_pointers;
   outputMean(array_of_pointers);
   outputStandardDeviation(array_of_pointers);
   return 0;
}

Now when we try to compile:

(master)$ ./compile.sh
main.cpp: In instantiation of ‘double mean(const containerT&) [with containerT = std::array<int*, 10ul>]’:
main.cpp:164:4:   required from ‘void outputMean(const containerT&) [with containerT = std::array<int*, 10ul>]’
main.cpp:77:32:   required from here
main.cpp:134:4: error: static assertion failed: This function only operates on containers with numeric types
main.cpp:140:8: error: invalid operands of types ‘double’ and ‘const value_type {aka int* const}’ to binary ‘operator+’
main.cpp:140:8: error:   in evaluation of ‘operator+=(double, const value_type {aka int* const})’
main.cpp: In instantiation of ‘double standardDeviation(const containerT&) [with containerT = std::array<int*, 10ul>]’:
main.cpp:172:4:   required from ‘void outputStandardDeviation(const containerT&) [with containerT = std::array<int*, 10ul>]’
main.cpp:78:45:   required from here
main.cpp:148:4: error: static assertion failed: This function only operates on containers with numeric types
main.cpp:155:8: error: invalid operands of types ‘const value_type {aka int* const}’ and ‘const double’ to binary ‘operator-’

Now we get a compile time failure when we get a non-arithmetic value type in the container. This is good. Failing as fast and as visibly as possible has a lot of benefits in this context. [2] In this case having invalid code be caught at compile time removes a category of runtime issues. . In this case we can now spot a class of bugs automatically and immediately which reduces the time we need to spend on debugging.

Let's commit this to git:

(master)$ git add main.cpp
(master)$ git commit -m "Passed in an array of pointer type to demonstrate the static_assert failing"
[master 603efc3] Passed in an array of pointer type to demonstrate the static_assert failing
 1 file changed, 8 insertions(+), 3 deletions(-)

Given that we are aiming to make the code as generic as possible the next step is to change the fill functions to take generic container types.

/** Fills a container with sequential integers
 * @param reference to the container that we are filling
 */
template< typename containerT >
void fillSequentialIntegers(containerT& items);

template< typename containerT >
void fillSequentialIntegers(containerT& container_to_fill)
{
   static_assert(std::is_arithmetic< typename containerT::value_type >::value,
                 "This function only operates on containers with numeric types");

   const size_t container_size = container_to_fill.size();
   for (size_t i = 0; i < container_size; i++){
       container_to_fill[i] = i + 1;
   }
}

The Fibonacci is similar.

Compiling gives no warnings and running it is fine so let's commit:

(master)$ git add main.cpp
(master)$ git commit -m "Made the fill functions generic"[master e8b93a1] Made the fill functions generic
 1 file changed, 32 insertions(+), 26 deletions(-)

Conclusion

We have attempted to make the code here more generic and also more type safe. By using std::array and templates we have removed the possibility of the array bounds bug and also the dead code section that caused us problems in part 1.

There is a bit more that could be done here but we start to reach diminishing returns fairly quickly. Say we were writing a library then we would want to push for as much as possible. When a lot of people are using the code then small improvements are important as the product of the number of users with the utility gained per user starts getting substantial.

If I get time I will attempt to make a part 3 in the series that attempts to do just that.

[1]Generally speaking the only time that you want to use raw C arrays is within memory management code. The less you use raw C arrays internally in your code the better. By wrapping the raw C arrays with the c++ language features such as smart pointers and RAII you will aid your teams productivity and create safer code. If you need to send raw arrays to some API then extract those raw arrays at the API call site. For example say an OS call requires you send a C style null terminaled string this doesn't mean you have to structure your code to internally pass around raw char* arrays. The better solution is to use std::string and convert to the C style strings using c_str() at the API call locations.
[2]When you are developing your code it's good to get the smallest feedback loop possible. The main idea is that things that would cause your program to not work as intended should fail as visibly as possible as early as possible in the development cycle. The shorter the amount of time that it takes to identify a bug the better because the longer a bug exists the harder and more expensive they tend to be to fix. Static type systems when used properly can help you find a lot of bugs. Compile time is one of the best places to try to have failures stop the process because you get the benefit of better program correctness without any adverse impact on the UI/UX for the end users. Dealing with runtime failures is more difficult because we also have to take into account UI/UX issues, see this blog post for more on that: http://blog.codinghorror.com/whats-worse-than-crashing/

blogroll

social