Science and software testing

Software testing, particularly manual software testing, is sometimes thought of as nothing more than following a script to confirm that the software does what it was designed to do. From that perspective, testing might seem like a boring and relatively mindless task. And to be honest, that is the traditional view of testing as part of the Waterfall method of software development in large organisations. Division of labour meant that there were some people who did nothing but follow scripts someone else had written, and report bugs that someone else would fix.

Science, on the other hand, is undeniably interesting and challenging. So if you share the impression that software testing is boring, you might be suprised to know that I find both engaging and worth spending my time and effort on1. Having worked as a software tester, and having studied a scientific field (cognitive science), I’ve noticed some similarities that help explain why I’m drawn to both persuits despite their apparent lack of similarity.

Science can be defined as:

“The intellectual and practical activity encompassing the systematic study of the structure and behaviour of the physical and natural world through observation and experiment.”

That doesn’t seem to describe following testing scripts at all. Even if you swap “the physical and natural world” for “the software under test”, and even if you include the task of writing scripts. But if you consider the entire process of software testing you’ll see similarities emerge. For one thing, test scripts have to be written based on something, and in today’s world of agile software development, than something is usually not requirements handed down from designers, but rather requirements explored, developed, and refined iteratively. Observation and experiment are a big part of that iterative process. This is especially the case when working on existing software that doesn’t have good documentation—how else could you figure out how the software works except through observation and experiment? Even if you have access to the code, it’s unlikely you could read the code and know exactly how the software will behave. And there isn’t always someone else around to ask.

The reality of software testing is a lot more than following a script. A more complete definition of testing is that:

“Testing is the process of evaluating a product by learning about it through exploration and experimentation, which includes to some degree: questioning, study, modeling, observation, inference, etc.”

When defined that way, it’s much clearer how testing and science are similar. Questioning, study, modeling, observation, and inference are all core aspects of science and testing.

In testing, we question whether the software does what we expect it to do. We question whether it does what customers want it to do and in the way they want. We question whether a code change has unintended effects. We study how the software behaves under various conditions. We construct models of how we believe the software performs, even if they’re only mental models. We observe how the software responds to input. And ultimately we make inferences about the quality of the software.

Another similarity between science and software testing is that neither process truly has an end. There is always more to discover through science, even at the end of a project that has produced significant insights. And there is always more to learn about any but the simplest software. In the case of science and testing, it’s not meaningful to think of the entire process as having a goal, but it is necessary to define a reasonable milestone as the completion of a project. We don’t finish testing when there are no bugs, because that will never happen, but we can consider testing complete when the software behaves well under a reasonable range of scenarios.

Science is often described as trying to prove things2. That is not the aim of science, nor is it how science works. Science is, in part, a way of trying to better understand the world. And science is the knowledge produced by that process. The scientific method involves making a hypothesis and then gathering evidence and analysing data to draw conclusions about whether the hypothesis is supported. It’s possible to find evidence that rules out a hypothesis, but it’s not possible find evidence that a particular hypothesis is the only explanation for the data. This is because other hypotheses might explain that evidence just as well, including hypotheses that no-one has come up with yet. But after carefully analysing the results of many experiments a clearer understanding can begin to emerge (in the form of a theory). In that way you can think of science as showing what doesn’t work until there’s a reasonably solid explanation left. It’s not about being right; it’s about being less wrong.

Similarly, testing isn’t about proving that the software is bug free; it’s about providing evidence that you can use the software without any significant issues, so that what’s left is reasonably solid. It’s also not about proving that the software does exactly what the customer wants, but it is about helping to iteratively improve the customer’s satisfaction with the software. This is an important part of software testing that’s sometimes forgotten—the aim isn’t solely to find bugs, but also to find unexpected, unusual, or confusing behaviour.

On the other hand, there are plenty of ways in which science and testing are different. But I’ll leave that for another post.

1. Not so much manual testing specifically, but a comprehensive approach to testing that includes exploratory testing and automation.

2. Do a search for "science proves". It's enough to make a scientist or philosopher or mathematician cry.

Updated:

We're neither rock stars nor impostors

Recently, Rach Smith raised some important points about how we tend to talk about impostor syndrome:

  • it minimizes the impact that this experience has on people that really do suffer from it.
  • we’re labelling what should be considered positive personality traits - humility, an acceptance that we can’t be right all the time, a desire to know more, as a “syndrome” that we need to “deal with”, “get over” or “get past”.

If you haven’t read her post yet I highly recommend you do. The issue came up again during Rach’s chat with Dave on Developer on Fire.

I can’t truly say I’ve experienced impostor syndrome, although I suspect that’s mostly because I’ve often been in small teams where everyone was similarly skilled. For example, I was once one of two novice web developers in a product development team. We really didn’t know what we were doing. I did feel unqualified, but since there was no one more experienced to compare myself against I didn’t feel like an impostor. But I did suffer from low self-confidence and a huge pile of self-doubt. Fortunately, experience and education has helped me come to grips with the limits of my knowledge and ability. I’m sure that self-awareness has contributed to better performance independently of any increase in my skills.

It all got me thinking about my experience with how jobs are advertised and how interviews are conducted, about the pressure to elevate one’s technical skills, about the growing awareness of the importance of “soft” skills, and about the rock star culture that’s promoted in some parts of the industry.

Rach noted that even highly successful senior developers sometimes experience self-doubt and the awareness of gaps in their knowledge. This is something that is all too often missing from discussions about preparing for interviews, especially for highly sought-after positions. We’re always told to prepare extensively (good advice), and to project confidence (sure, projecting a lack of confidence is understandably unhelpful), but the highest quality advice also points out the importance of awareness of the limits of one’s skills and knowledge so that they can be appropriately managed. Much of the advice I remember from my early days suggested I should do my best to cover up my weaknesses. I don’t believe that did anything but lead to feelings of insecurity and inevitably falling apart when the limits of my knowledge were revealed. Later, I received much better advice; to be able to say “I don’t know,” and then to work through the problem aloud, asking questions to fill in the gaps until I do have enough understanding to give a reasonable answer. And isn’t that more or less how we work each day? If anyone actually had the supreme skills and confidence we’re naively advised to portray during interviews, I’m pretty sure they wouldn’t find the job challenging or interesting enough (and would likely inflict their arrogance and the consequences of their boredom on the rest of us).

Another topic missing from good career advice, fortunately less common these days, is the importance of soft skills. As Rach noted, “the most accomplished developers [have] constant awareness of the ‘gap’ in their knowledge and willingness to work towards closing it.” That sort of awareness is as important a soft skill as general social and communication skills. It’s a key part of metacognition. The people I’ve experienced most joy in working with are those who freely admit their limitations and strive daily towards eliminating them. That effort shows in their contributions at work that go above and beyond the explicit requirements of their role. Among the worst people to work with are those who do the minimum work required, without any awareness of the opportunities for improvement that pass them by every day. Even worse are those who perform at a similar level while believing that they are in fact contributing much more and at a much greater degree of competence1. The latter type of person is unlikely to experience anything that might be called “impostor syndrome”, although if anyone were truly an impostor, it would be them.

Beyond a growing understanding of the importance of interpersonal soft skills, there are many other non-technical skills that make a solid team member. For example, the O*NET database shows active learning towards the top of a list of skills seen as important for a programmer2. And yet typical hiring practices overwhelmingly reflect the prioritisation of immediate technical skills. I’m confident that’s a big part of the reason “rock star” developers are those seen as having the greatest skills rather than being most able to learn or improve. And yet the former doesn’t imply the latter, especially if those great skills lie in one highly specific domain; you can learn to do one thing really well without being able to generalise that skill, nor does it mean you possess other distinct but important skills. Other downsides of specialisation are a topic for another post.

Similarly, the poor attitudes and bad behaviours of some workers are accepted because of their technical skills, despite the negative impact they have on the people around them. I suspect this might be a subtle influence on feeling like an imposter; we provide a perverse incentive for people to behave in ways that no reasonable person wants to. Our industry favours those who promote themselves as the best coder, the most knowledgeable developer, the ideal technical candidate, and we (at least implicitly) discourage people from embracing their range of skills and their ability to improve.


1. The Dunning-Kruger effect in effect, so to speak.

2. Although communication skills are apparently the #1 requirement in computing-related job ads, other soft skills and transferable technical skills are far less frequently mentioned.

Updated:

Summer of Data Science 2017 - Final Update

Ok, so it’s not summer any more. My defence is that I did this work during summer but I’m only writing about it now.

To recap, I’d been working on a smart filter; a system to predict articles I’d like based on articles I’d previously found interesting. I’m calling it my rss-thingy / smart filter / information assistant1. I’m tempted to call it “theia”, short for “the information assistant” and a play on “the AI”, but it sounds too much like a Siri rip-off. Which it’s not.

Aaaanyway, I’d collected 660 interesting articles and 801 that I didn’t find interesting–fewer than expected, but I had to get rid of some that were too short or weren’t articles (e.g., lists of links, or github repositories). There was also a bit of manual work to make sure none of the ‘misses’ were actually ‘hits’. I.e., I didn’t want interesting articles to turn up as misses, so I skimmed through all the misses to make sure they weren’t coincidentally interesting (there were a few). The hits and misses then went into separate folders, ready to be loaded by scikit-learn.

I used scikit-learn to vectorise the documents as a tf-idf matrix, and then trained a linear support vector machine and a naive bayes classifier. Both showed reasonable precision and recall upon my first attempt, but tests on new articles showed that the classifier tended to categorise articles as misses, even if I did find them interesting. This is not particularly surprising; most articles I’m exposed to are not particularly interesting, and such simple models trained on a relatively small dataset are unlikely to be exceptionally accurate in identifying them. I spent a little time tuning the models without getting very far and decided to take a step sideways before going further.

Eventually I’ll want to group potentially interesting articles, so I wrote up a quick topic analysis of the articles I liked, comparing non-negative matrix factorization with latent dirichlet allocation. They did a reasonable job of identifying common themes, including brain research, health research, science, technology, politics, testing, and, of course, data science.

You can see the code for this informal experiment on github.

In my next experiment (now, not SoDS18!) I plan to refine the predictions by paying more attention to cleaning and pre-processing the data. And I need to brush up on tuning these models. I’ll also use the trained models to make ranked predictions rather than simple binary classifications. The dataset will be a little bigger now at around 800 interesting articles, and a few thousand not-so-interesting.

1. Given all the trouble I have naming things, I'm really glad I haven't had to do any cache-invalidation yet.

Updated:

Summer of Data Science 2017 - Update 1

My dataset/corpus is coming together.

It was relatively easy to create a set of text files from the articles I’d saved to Evernote. It’s taking more time to collect a set of articles that I didn’t find interesting enough to save. I’ll make that easier in the future by automatically saving all the articles that pass through my feed reader, but for now I’m grabbing copies from CommonCrawl. This saves me the trouble of crawling dozens of different websites, but I still have to search the CommonCrawl index to find articles among everything else in the index from each site.

I created a list of all the site I’d saved at least one article from, then I downloaded the CommonCrawl index records for each site from the last two years. Next I filtered the records to include only pages that were likely to be articles (e.g., no ‘about’ or ‘contact’ pages, etc.). I took a random sample of up to 100 of the records remaining for each site and downloaded the WARC records, and then extracted and saved each article’s text. I’ll make all the code available once I’ve polished it a little.

The next step will be to explore the dataset a little before diving into topic analysis.

Updated:

Summer of Data Science 2017

Goal: To launch* my learn’ed system for coping with the information firehose

I heard about the Summer of Data Science 2017 recently and decided to join in. I like learning by doing so I chose a personal project as the focus of my efforts to enhance my data science skills.

For the past forever I’ve been stopping and starting one side-project in particular. It’s a system that searches and filters many sources of information to provide me with articles/papers/web pages relevant to my interests. It will use NLP and machine learning to model my interests and to predict whether I’m likely to find a new article worthwhile. Like a recommender system but just for me, because I’m selfish. Something like Winds. The idea is to collect all the articles I read/skim/ignore via an RSS reader, and tag those I find interesting. And to build up a Zotero collection of papers of several degrees of quality and interest. Those tagged and untagged articles and papers will comprise my datasets. There is a lot more to this project, but that’s the core of it.

My first (mis)step was to begin building an RSS reader than could automatically gather data on my reading habits that I could use to infer interest based on my behaviour; whether I clicked a link to the full article, how long I spent reading an article, whether I shared it, etc. Recently I decided that was not the best use of my time, as it would be much easier to start with explicitly tagged articles–I can start gathering those without creating a new tool. So I’m doing that by saving interesting articles to Evernote. Today I have just under 900. I can use CommonCrawl to get all the articles I didn’t find interesting on the relevant sites (i.e., the articles that would have appeared in my RSS reader, but which I didn’t save).

There are many things I’ll need to do before I’m done, but all of those depend on having a dataset I can analyse. So my next step will be to turn those Evernote notes and other articles into a dataset suitable for consumption by NLP tools. Given the tools available for transforming text-based datasets from one format to another, I’m not going to spend much time choosing a particular format. I’ll start with a set of plain-text copies of each article and associated metadata, and take it from there.

I’ve been less consistent in gathering research papers. I’ve been saving the best papers I’ve stumbled across, but I could do much better by approaching it as a research project, i.e., do a literature review. That’s a huge task so I’ll focus on analysing web articles first.

*I was going to write "complete" but really, it'll always be changing and will probably never be complete. But ready for automated capture and analysis? Sure, I can make that happen.

Updated: