Dependencies and Their Socio-Technical Duality

The design and implementation of complex software systems involve a wide range of dependencies among the various constituent parts of a system. Developers, managers, and other relevant stakeholders often can’t recognize and manage all those dependencies. Such failures typically translate into lower productivity because, among other things, they require more rework and more time spent on integration and testing [Cataldo et al. 2006], Cataldo et al. 2008] [Herbsleb and Mockus 2003]. We also tend to see an impact on software quality, where unrecognized dependencies result in a higher number of defects [Cataldo 2010] [Herbsleb et al. 2006].

A major challenge we all face in software development is identifying dependencies. This is not easy, and two interrelated dimensions are at play: a technical component and a socio-organizational component. For example, in certain cases, the technical nature of a dependency makes it hard to recognize it. A classic example would be asynchronous remote procedure calls among a pair of components that can create timing, locking, and resource consumption dependencies that might become visible only when developers are faced with a particular defect that exposes such dependencies.

In other cases, dependencies fall into a socio-organizational category, stemming from the way work is organized and carried out in the development organization. For instance, it is quite common to allocate work based on the availability of development resources. Such an approach could create work dependencies, such as information sharing needs, between individuals located in different parts of the world, sometimes with minimal work time overlap (e.g., time zone difference larger than 7‒8 hours, as one finds between the USA and India). Under such circumstances, it is highly likely that those dependencies would not be addressed during the work.

In order to illustrate some of the challenges involved in identifying the relevant dependencies in software projects, consider the following example described by Bass et al. which is representative of many large development projects [Bass et al. 2007]. System 1 was a software platform designed to support a family of real-time embedded products. Given the products’ wide range of technical requirements (the features supported, memory requirements, timing requirements, etc.), new functionality was expected to be added regularly over the lifetime of the platform.

The architecture team designed System 1 utilizing several approaches that had proved successful in past projects. It was developed as a component-based solution using a component integration framework, where the coupling among components was minimal and based on well-defined interfaces, allowing for development by independent teams. The development teams were to ensure that they developed their particular modules against the interface specification.

The architecture team addressed the performance and resource utilization requirements by building on past experience and applying the principle of separation of concerns. For example, the RAM was divided into several logical partitions, each assigned to a particular part of the system, and a priority-based scheduler was used to satisfy the performance requirements. In other words, the overall design followed well-established principles and practices.

On the organizational side, the project involved a total of four development sites, two in Germany and two in France. There was a central architecture team made up of the most qualified representatives from each site, and each site had responsibility for the design and development of one or more subsystems. The project was managed from one of the development sites in Germany, and the project manager travelled quite frequently to all the development sites. Each site had about the same number of engineers.

Nevertheless, Bass et al. described several serious problems encountered by the project. I will focus on two problems that provide good examples of how challenging it is to identify technical and work dependencies in large distributed projects.

First, after the initial implementation, the system had serious performance problems, particularly during system startup. Based on the experience from past projects, the architecture team assumed that the prioritization rules implemented in the scheduler were adequate to allow the teams to work independently. However, they failed to recognize that a significant amount of tweaking of the prioritization rules from previous systems was required. Resolving the performance conflicts resulted in a significant coordination overhead.

Second, the organizational structure inhibited coordination around the performance issues. The architecture team deferred many of the design decisions required for each component or subsystem to the corresponding site. The impact of the local decisions was not recognized until late in the development life cycle. Once a dependency was recognized, it was difficult to coordinate effectively because the teams were distributed and were only vaguely familiar with the activities of the other sites.

This example shows that the technical and organizational dimensions of a project do not operate in isolation of each other. Quite the contrary: each dimension influences the other. In the following section, I discuss the various types of technical dependencies, how they can impact productivity and quality when they are not recognized and coordinated appropriately, and the traditional approaches used to discover such dependencies. I then present an overview of traditional types of work dependencies along similar lines. Following that, I explore the socio-technical duality of dependencies and its implications for productivity and software quality in the context of GSD.

The Technical Dimension

Technical dependencies are relationships among the various software entities that constitute a software system. During architectural or detail design, such entities can be components, modules, or classes. On the other hand, during implementation, the entities of interest are source code files.

Software engineers commonly think of technical dependencies as syntactic relationships; that is, they manifest themselves in terms of programming language constructs such as data structures or function and method calls. A different way of thinking about technical dependencies focuses on a logical or semantic relationship between software entities. For instance, two components that implement part of a requirement are logically related.

Publisher/subscriber systems represent another example of semantic or logical dependencies, where the relationship between software entities is not explicitly articulated by a call, for instance, from one module to the other. Our discussion focuses primarily on the distinction between syntactic and logical dependencies, because this distinction points directly to the issue of the ease with which the dependency can be identified.

The idea of syntactic dependencies has its origins in compiler optimizations, where the main goal was to understand control and dataflow relationships across statements. Most approaches in this line of work extract relational information, typically from source code or some sort of intermediate representation of the code, such as bytecodes or abstract syntax trees. Then, they analyze units such as statements, functions, or methods to identify relationships that can reveal data-related dependencies (e.g., a particular data structure modified by a function and used in another function) or functional dependencies (e.g., method A calls method B).

Syntactic dependency analyses are used in a wide variety of applications, ranging from static analysis to detect defects to tools that assist developers in understanding and debugging code. Unfortunately, this type of relational information has some problems. In certain cases, it tends not to be accurate. For instance, in programming languages such as C and C++, function pointers and conditional compilation directives tend to create a lot of difficulties for syntactic analyzers based on source code. In the case of object-oriented languages, polymorphism makes the identification of a relationship between two classes almost impossible prior to runtime. In other cases, syntactic dependency information is overwhelming, uncovering tens of hundreds of relationships between source code files and making the process of identifying the relevant relationships quite challenging.

An alternative mechanism for identifying technical dependencies consists of examining the set of source code files that are modified together as part of some unit of development work, such as the development of a new feature or fixing a defect. Gall et al. suggested that files that changed together—for instance, in a version control commit transaction—share some sort of dependency, which they called logical [Gall et al. 1998]. Certainly, logical dependencies are not completely different from syntactic dependencies. In fact, they can range from syntactic relationships (e.g., the commit was due to a change in the number of parameters to a function that is implemented in file A and called from file B) to more complex semantic dependencies where the computations done in one file affects the behavior of another file.

One attractive characteristic of this way of thinking about technical dependencies is that it provides a better estimate for semantic dependencies relative to call graphs or data graphs, because it does not rely on language constructs to establish the dependency relationship between source code files. For instance, in the case of remote procedure calls (RPCs), a syntactic dependency approach would provide the necessary information to relate a pair of modules. However, such information would be embedded in a long path of connections from the RPC caller through the RPC stubs and all the way to the RPC server module. Alternatively, when the module invoking the RPC and the module implementing the RPC server are changed together, a logical dependency is created, showing a direct dependency between the affected source code files.

The logical dependency approach is even more valuable in cases such as publisher/subscriber or event-based systems. Here, the call-graph approach would fail to relate the interdependent modules because no syntactically visible dependency would exist between, for instance, a module that generates an event and a module that registers to receive such an event. Therefore, the logical dependency approach has the potential to identify important dependencies not visible in syntactic code analyses.

Moreover, logical dependencies can filter syntactic dependencies that may be relevant in terms of the information needs of developers. For example, the syntactic dependency approach highlights relationships among basic libraries (e.g., memory management, printing functionality, etc.) because they contain highly coupled files. Yet they tend to be very stable and unlikely to fail, despite this high level of coupling.

However, the logical view of technical dependencies also has its problems. The main challenge is that the dependency information is extracted from historical data such as that provided by version control systems. Without such a resource, there is no efficient way to determine the logical dependencies among source code files.

Despite the limitations of the two ways of characterizing technical dependencies, both the syntactic and the logical approaches provide useful and complementary information that can support, in multiple ways, the identification and management of dependencies in software development projects. The following sections discuss the implications of both types of dependencies for productivity and quality in software development projects.

Syntactic dependencies and their impact on productivity and quality

Syntactic dependencies are a simple vehicle to understand how traditional software engineering concepts such as coupling and cohesion affect the ability of an organization to efficiently develop high-quality software. Unfortunately, there is a lot more evidence about the effect of syntactic-based coupling and cohesion on software quality than on development productivity.

Banker et al. found that the more syntactic dependencies a software component had, the higher the maintenance effort associated with the component [Banker et al.1998]. The increase in maintenance effort stems, primarily, from the challenge of understanding the higher levels of complexity associated with higher numbers of syntactic dependencies.

Ever since the ideas of coupling and cohesion were introduced in the mid-1970s, researchers focused a lot of attention on the relationship between these concepts and software quality. This line of work has contributed a large collection of metrics, ranging for simple quantification of the syntactic dependencies of a software entity (e.g., the number of data type references or number of function calls made in a source code file) to more complex measures that attempt to capture the structural characteristics of coupling (e.g., the depth of the inheritance tree). Some of these measures are explored in Chapter 8, Beyond Lines of Code: Do We Need More Complexity Metrics?, by Israel Herraiz and Ahmed E. Hassan in this book.

The simplest way to summarize the extensive body of research is to say, “the higher the number of syntactic dependencies a software entity (e.g., a file, module or component) has, the more defects it will have.” However, the relationship between syntactic dependencies and software quality is not that simple. Recent research [Cataldo et al. 2009] has shown that data-related syntactic dependencies (e.g., data type references) are more likely to lead to defects than functional dependencies (e.g., function A calls function B). One possible reason for such a difference is that data-related dependencies tend to require more abstract thinking than functional relationships in order to understand how their content changes as a program executes.

The structure of the syntactic dependencies also matter. For instance, Zimmermann and Nagappan applied graph theoretic lenses to calculate network measures on the syntactic dependencies extracted from Windows binaries, and found that such measures, when combined with more traditional metrics (e.g., churn metrics), were useful for predicting post-release defects [Zimmermann and Nagappan 2008].

Logical dependencies and their impact on productivity and quality

Thinking of technical dependencies as logical relationships is a more recent idea than syntactic dependencies, which means, unfortunately, that we know a lot less about how logical dependencies affect software quality and development productivity. Earlier research results suggested that logical dependencies affected quality in ways similar to syntactic dependencies. However, recent findings suggest more interesting implications.

My colleagues and I [Cataldo et al. 2009] studied two large-scale systems from two distinct firms and found that the number of logical dependencies was a much better predictor of failures than syntactic dependencies. In fact, the impact of syntactic dependencies vanished when logical dependencies are considered. These results have important implications for developers because they suggest that the effort to identify and understand dependencies should focus on those less-explicit relationships rather than on the obvious and explicit syntactic dependencies.

A second useful finding is that the structure of the logical dependencies also affect quality. For instance, my colleagues and I, studying two large-scale systems [Cataldo et al. 2009], found that the likelihood of failures associated with a source code file, e.g., file A, decreases as the density of the logical dependencies among the files dependent on file A increases. Such results suggest that developers should become more cognizant of logical dependencies, how tight those relationships are among a set of files, and where to look to make sure that changes to one part of the system do not introduce problems elsewhere. More importantly, we know that the structure of logical dependencies matters more than the structure of syntactic dependencies. For instance, Nambiar and I studied a large-scale multinational development organization and found that the density of logical dependencies among architectural components was one of the more important factors affecting the quality outcomes of GSD projects, whereas no evidence of such effects was found for syntactic dependencies [Cataldo and Nambiar 2010].

The Socio-Organizational Dimension

The social side of dependencies in software development have to do with the communication, information sharing, and coordination needs that emerge from the set of tasks that need to be performed in a particular software development project. We refer to these dependencies as work dependencies. It will be relatively obvious to those who have been involved in development projects, in particular large-scale ones, that numerous factors—such as experience, organizational structure, geographic distribution, and schedule pressure—introduce barriers that constrain project members’ ability to efficiently and effectively identify and manage all the relevant work dependencies.

For instance, when developers have limited experience of a system, they tend not to understand all the potential implications of the changes they make to the system as part of their tasks (e.g., satisfying pre-invocation and post-invocation assumptions of a method or function call). Those knowledge gaps, typically, result in lower productivity in the form of rework or in poorer quality.

However, experience can be important beyond the categories of technical or system know-how. Familiarity in working together with other team members could result in important improvements in productivity and quality because knowing the people you work with facilitates information sharing and coordination. For instance, engineers tend to develop implicit ways of coordinating with each other, learning for instance how and when to interrupt a colleague to ask for or share information.

An important and sometimes overlooked factor related to the identification and management of work dependencies is the structure of projects and the development organization. Such structures encompass several elements, such as the set of organizational units (e.g., teams, departments) involved, their geographical location, their formal reporting paths, their administrative and development processes, and even their incentive mechanisms. All those elements together play an important role in shaping the way project members interact, coordinate, and collaborate.

Geographically distributed software development organizations are at a disadvantage in terms of information sharing and integration. For instance, we know that developers share a lot of technical information related to their current activities through short, informal conversations that might take place in a hallway or around the coffee machine. When developers are physically separated, they no longer can have such impromptu communication. In turn, project members have a lot more difficulty staying aware of other people’s activities, decisions, and difficulties. The end result is an increase in coordination breakdowns, integration problems, and, ultimately, decreases in productivity and quality [de Souza et al. 2004], [Grinter et al. 1999].

However, being geographically distributed has other drawbacks besides the elimination of impromptu conversation. Significant time zone difference (more than six hours) significantly reduce possibilities for real-time problem solving activities that require synchronous interaction, such as phone conversations or video conferencing. Typically, project members tend to opt for an asynchronous means of communication, such as email. Unfortunately, as Espinosa et al. has demonstrated, asynchronous communication tends to create a lot of misunderstanding and mistakes stemming from the complexity associated with adequate management of the flow of information in this type of communication [Espinosa et al. 2007].

Another problem with geographically distributed projects is that members are less likely to know each other. We tend to share more information with collocated colleagues for whom we have a certain level of rapport than with colleagues located in other offices or locations. In such cases, and when coworkers don’t know each other, information requests are not likely to be addressed in a timely fashion and may even be completely ignored. The end result is that project members are likely to fail to identify relevant work dependencies, particularly when unanticipated changes take place.

Finally, the schedule pressure of a software project can have serious implications for the ability of project members to identify and manage work dependencies. An increase in schedule pressure is typically manifested as an increase in the number of concurrent, and potentially interdependent, development tasks. Imagine a particular functionality B that was planned to follow the development of functionality A, on which it depends. Due to time pressures, both functionalities may have to be developed concurrently. To make that possible, developers face a new and more complex set of coordination needs.

For instance, interfaces between the two functionalities might evolve as the work progresses, creating the need for constant coordination to avoid integration problems and hard-to-find defects. In addition, there might be a need for special “glue code” to incrementally test the code under development. In other words, successful completion of these tasks depends on a collection of appropriate coordination mechanisms that allow developers to identify the relevant dependencies and deal with them appropriately.

The following section discusses the various types of work dependencies and their impacts on development productivity and quality.

Different types of work dependencies and their impacts on productivity and quality

The most common way to think about work dependencies is to model the temporal relationships between tasks. These dependencies focus on the temporal precedence of tasks (e.g., task A needs to be completed before task B). Projects focused on this way of thinking use numerous approaches, ranging from analytical and graphical methods such as Gantt and PERT diagrams to tool support such as workflow-based tools, to identify and managed the dependencies.

We would all agree that if we fail to recognize these dependencies and manage accordingly, we end up with longer development times. However, more interesting analyses are also possible with this type of information. My colleagues and I [Cataldo et al. 2009] considered all the temporal dependencies among the development tasks in one release and applied a social network type of analysis to a workflow dependency graph, where the nodes were members of the development organization and the edges represented the handover of a particular development task from one individual to another. Such people-to-people relationships were examined through the lenses of social network analysis. We found that a high number of relationships require a significant effort by the individuals involved to maintain the relationships.

This point is particularly important in the context of workflow dependencies because it suggests that centrally located project members are more likely to be overloaded because of the extra effort associated with managing the work dependencies, increasing the likelihood for communication breakdowns and risking diminished quality in the software produced. In fact, our results supported this argument, showing that source code files were more likely to have failures when highly interdependent individuals modified them.

A variant of temporal or workflow types of dependencies relates to temporal work relationships, where the dependency is centered on information needs rather than completion or handover of the task. For instance, if we have two interdependent tasks—develop module A and develop module B, where module B invokes functionality in module A—the temporal dependency resides in the information related to how to materialize the call from module B to module A. Given this information, developers working on module A can define an interface and provide the information to the developers working on module B. Then, the tasks can proceed in parallel.

Concurrent engineering, a line of research focused on ways to manage interdependent and concurrent development tasks, proposes that appropriate coordination mechanisms can be put in place to deal with the information-related dependencies among overlapping development tasks when we a priori can determine those information needs. Unfortunately, the reality of a typical software development project is not that simple. Tens or hundreds of tasks tend to overlap over the life of a project. In many cases, the temporal precedence of those tasks as well as the information-related dependencies among the tasks is not completely understood or known a priori. In fact, as requirements become known and better understood, the dependencies among those tasks might change.

One of my recent studies [Cataldo 2010] examined 209 distributed projects in a large multi-national development organization and found that higher levels of overlap among development tasks, as represented in a task-tracking system, were associated with lower levels of software quality. More importantly, the impact of task overlap was consistent along the life cycle of the project. That is, the impact on quality was not conditional on being close, for instance, to a project milestone, where we typically see a surge in the amount of concurrent development work. These results are useful because they show the importance of keeping track of the type of dependencies, so managers or other stakeholders can act to address their negative impact on a particular project.

A third type of work dependencies has its roots in the role of the organizational structure in establishing useful communication and coordination paths to address information-sharing needs. Nagappan et al. constructed a collection of metrics based on the information available in a traditional organizational chart and studied their relationships to failures in Windows components and programs [Nagappan et al. 2008]. Although their measures do not specifically capture work dependencies, they represent good proxies for numerous organizational phenomena, including issues of work dependencies. Their analyses showed some interesting results.

For instance, the higher the number of departmental units involved in the development of a component or a binary, the lower the quality of the component or binary. In addition, the distance in the organizational hierarchy among the individuals who developed or modified a component also had negative effects on software quality. These results highlight the difficulties associated with communication and coordination across organizational boundaries (e.g., teams, departments, divisions, locations, etc.) and the negative consequences those barriers can have on software quality.

The development of a software system consists of a collection of design decisions, either at the architectural level or at the implementation level. Those design decisions introduce constraints that might establish new dependencies among the various parts of the system, modify existing ones, or even eliminate dependencies. The changes in dependencies can generate new coordination needs that are typically quite difficult to identify a priori.

Imagine, for instance, a task that requires the modification of the memory allocation policies of RPCs in one component in order to improve its performance. Such a change could affect the timing of RPC exchanges with other components, which in turn might break certain assumptions made by users of those RPCs. In order to better understand how to capture such dynamic dependencies, my colleagues and I [Cataldo et al. 2006] [Cataldo et al. 2008] proposed a socio-technical framework for examining the relationship between the logical software dependencies and the structure of the development work used to construct such systems.

Coordination requirements, one of the elements of that framework, is a measure of the extent to which the work of each project member depends on the work of other project members, given a set of development tasks and the technical dependencies among the parts of the system that those tasks affect. One important finding from this line of work is that the higher the number of coordination needs a developer is faced with, the lower the quality of the source code files modified by the developer. In other words, you are more likely to introduce bugs in the system the more that coordination needs emerge from the logical dependencies of the system as development work progresses.

The Socio-Technical Dimension

Software development, and product development in general, involves technical and socio-organizational elements. So far we have discussed dependencies in the context of each individual dimension. However, the technical and the socio-organizational dimensions are intertwined, and considering them in isolation of each other does not allow us to consider the whole picture.

The general idea behind the socio-technical perspective of dependencies in software development is quite simple. Development productivity and software quality improve when the coordination needs established by the dependencies among development tasks are matched by the actual coordination activities carried out by the engineers. However, the important contribution of the socio-technical perspective is identifying and tracking the dynamic relationship between social and technical dependencies, focusing on a fine-grained level of analysis of different types of technical dependencies (e.g., syntactic or logical software dependencies) and examining the coordination needs that such technical dependencies create over the life cycle of a development project.

For instance, my colleagues and I [Cataldo et al. 2006] [Cataldo et al. 2008] [Cataldo and Herbsleb 2010] used data from multiple software repositories (version control systems, defect tracking systems, etc.) from two large-scale projects to show that when engineers identified and managed the relevant coordination needs, development productivity and software quality improved. A key insight of this work is that the relevant work-related coordination needs, in fact, tend to stem from logical dependencies instead of from syntactic dependencies. Logical dependencies tend to capture relationships that are more semantic and tacit, unlike syntactic relationships, which are explicit in nature (engineers can easily identify these relationships by looking at a piece of software code).

For example, logical dependencies could represent publisher/subscriber relationship or a particular timing relationship between two different components of a software system. In such cases, syntactic language constructs tend not to provide all the necessary information to identify such dependencies.

A related finding is that a misalignment between dependencies and coordination activity is a key factor affecting development productivity and software quality, when we examine the right set of software dependencies that determine the relevant work dependencies.

Considering the technical and the socio-organizational dimensions together also allows us to understand better how we can improve large-scale development that involves multiple organizational boundaries. For instance, architectural dependencies that cross the boundaries of projects tend to be associated with lower levels of software quality, as reported in my past study of 209 projects in a large development organization [Cataldo 2010]. Such a result suggests the importance of coordination and awareness beyond the traditional small organizational entity, such as the team, and the importance of providing support at a larger scale within the development organization.

In addition, Nambiar and I explored how technical coupling affects quality in geographically distributed software development projects [Cataldo and Nambiar 2010]. Our main finding was that the higher the number of technical dependencies that are external to a project (e.g., a component A interfaces with component B, but component B was not changed in the project), the lower the software quality. In particular, logical dependencies that crossed the boundaries of projects increased the predicted number of defects in a project by 50%, an impact similar in size to traditional factors such as the amount of code produced or the level of Capability Maturity Model (CMM) process maturity.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset