10

Describing knowledge domains: a case study of biological ontologies

Liam Magee

In the past decade, the development of standardised and machine-processable controlled vocabularies has been a fertile field of research in the life sciences. Understandably, development of the semantic web has attracted significant attention from parts of the medical and life science community. A nexus of government, academic and corporate sources has funded the construction of semantic web ontologies for a range of biomedical and biological vocabularies, including: clinical terms (SNOMED), genetic sequencing (Gene Ontology), proteins (PRotein Ontology) and general ontology repositories (Open Biological and Biomedical Ontologies). Even several of the upper-level ontologies surveyed in the preceding chapter were the beneficiaries of bioinformatics funding. The nature and scale of the classificatory structures of life sciences makes the semantic web, and ontologies in particular, seem especially well suited.

A problem arises for the operationalisation of ontologies, however, due to the tendency for classificatory practices to engage multiple systems or conceptual schemes. As an example, sociologists have explored how microbial objects are frequently analysed through competing frames of physiological and genetic characteristics in laboratories (Sommerlund 2006). While the homogenisation of vocabularies holds promise for system-level interoperability across medical and life science industries, it risks the occlusion of practiced differentiation, in which biological objects are allowed to ‘speak’ through alternate and potentially inconsistent frames (Bowker and Starr 1999). Moreover the promulgation of institutionally invested ontologies has broader societal implications—as Smart et al. (2008) have shown, social categories of race and ethnicity are increasingly filtered through genetic codes, suggesting that biomedical categories are far from being innocent epistemological constructs.

This study seek to explore some of the trade-offs of taxonomic standardisation through a content analysis of ontologies and surrounding debates in the life science and bioinformatics communities. It begins by briefly examining some of the more successful ontologies used in biological research, including two of the most widely cited and noted: an umbrella biomedical ontology collaborative effort known as the ‘OBO Foundry’, and a particularly successful biological ontology, the Gene Ontology. These examples highlight ‘state of the art’, large-scale, interoperable biological classification systems, while also demonstrating some of the latent tensions in the move towards standardised representations of biological objects. This examination holds a secondary purpose—to consider also how new computational representations of biological objects themselves can be understood as ‘second-order’, discursive objects, reflecting the epistemological theories and methodological practices—the perspectival orbits—of their authors. This has bearing on the more general question of commensurability, posed in earlier chapters. Rather than attempt an explicit analysis of the commensurability of multiple ontologies here, however, the study discusses how some of the perspectival character of bioinformatic systems might be made more perspicuous in the process of ontology design and dissemination. This in turn can be viewed as a series of tentative suggestions for how it might be possible to reach at least partial commensurability and interoperability between ontologies and their underlying conceptualisations.

Biological ontologies

Developing formal ontologies which can be used and reused across biological research teams is a formidable undertaking, both technically and culturally. In many cases the advent of new standardised mechanisms like semantic web ontologies incur the familiar penalty of trading cost—of migrating existing information systems to yet another format—against technological obsolescence. Moreover ontologies are technically challenging to construct, presuming not only knowledge of the biomedical field, but also a new set of technical procedures and vocabulary.

In the biological field these challenges are partially mitigated by the kinds of advantages mentioned above—reusability, portability and interoperability of data, and the possibility of automatic inferencing across large conceptual taxonomies. Indeed, interest in ontologies has led to bioinformatic research groups funding much of the core semantic web research itself.

In the past decade there have been numerous efforts to construct ontologies and other highly structured bioinformatic taxonomies:

image The National Cancer Institute (part of the US National Institutes of Health) has developed a thesaurus and meta-thesaurus, covering a wide range of clinical, research and health administration terms.

image Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT), funded by the US Federal Government, is a healthcare terminology covering over one million clinical terms; it is now moving towards management by an international standards development organisation.

image Unified Medical Language System (UMLS) was developed by the National Library of Medicine in the United States; it is another general purpose vocabulary for biomedical, healthcare and informatics terms.

image MedicalWordNet was an experimental extension of the Princeton WordNet thesaurus to the medical domain, distinguishing medical facts (determined by experts) from beliefs (held by lay users).

image BioCyc (Cycorp) is part of a large research knowledge base operated by Cycorp Inc., a private US company; it contains databases of ‘pathways and genomes of different organisms’.

image Dumontier ontologies were developed by Michel Dumontier at Carleton University, Canada, to describe ‘biological and scientific concepts and relations’.

image The Open Biological and Biomedical Ontologies (OBO) Foundry was hosted at Berkeley; it is an umbrella hosting environment and framework for independent ontology development and collaboration.

image There have been several smaller sub-disiplinary research taxonomies and ontologies, such as OpenGalen, BioPAX and Ecocyc (Aranguren 2005).

These initiatives, with varying degrees of cost and complexity, all aim to deliver standardised vocabularies of biomedical and biological entities to researchers and practitioners. The following sections review two of these in further detail.

OBO Foundry

The largest organisation of ontologies is housed by the Open Biological and Biomedical Ontologies (OBO) Foundry, hosted by Berkeley but with participants and funding from around the United States and Europe (OBO Foundry 2010). Several quotes from the OBO Foundry website highlight its aims and methods for designing and maintaining ontologies:

The OBO Foundry is a collaborative experiment involving developers of science-based ontologies who are establishing a set of principles for ontology development with the goal of creating a suite of orthogonal interoperable reference ontologies in the biomedical domain (OBO Foundry 2010).

It is our vision that a core of these ontologies will be fully interoperable, by virtue of a common design philosophy and implementation, thereby enabling scientists and their instruments to communicate with minimum ambiguity. In this way the data generated in the course of biomedical research will form a single, consistent, cumulatively expanding, and algorithmically tractable whole.

The OBO Foundry is open, inclusive and collaborative. The contributors are those biological researchers and ontology developers who have agreed to work together on an evolving set of design principles that can foster interoperability of ontologies, and ensure a gradual improvement of quality and formal rigor in ontologies, in ways designed to meet the increasing needs of data and information integration in the biomedical domain (OBO Foundry 2010).

Experience thus far confirms that adherence to OBO principles is largely self-policing because of the positive benefits that accrue to individual members. The task of the OBO coordinators is to help to build this community (OBO Foundry 2010).

As these quotes suggest, the OBO Foundry is structured very much like an open access academic journal, open to general contributions, with a group of editors responsible for outlining and enforcing ‘a common design philosophy and implementation’ (OBO Foundry 2010).

Presently the OBO Foundry lists around 60 discrete ontologies, which relate to a wide range of biomedical fields. Some of the ontologies are anatomical, describing parts of particular organisms such as spiders, mice and fungi; others are taxonomic, covering various species of, for instance, amphibians, flies and disease; others again are methodological, describing units of measure and biological instruments; and a final set of ontologies can be described as ‘foundational’, covering general spatial and temporal concepts, as well as biological processes, functions and components. According to its website, and several publications by its editors, the ontologies gathered in the OBO Foundry are intended to interoperate ‘orthogonally’—meaning they need not fit together as an entirely consistent and comprehensive whole, but they ought not to contradict another ontology’s definitional terms and, where possible, ought also to reuse more basic conceptual axioms declared in other ontologies.

Foundational principles

In a positioning paper published in 2007, members of the OBO Consortium claim:

Our long-term goal is that the data generated through biomedical research should form a single, consistent, cumulatively expanding and algorithmically tractable whole. Our efforts to realize this goal, which are still very much in the proving stage, reflect an attempt to walk the line between the flexibility that is indispensable to scientific advance and the institution of principles that is indispensable to successful co-ordination (Smith et al. 2007).

In a set of associated publications, several of the editors have attempted to elucidate what these ‘principles’ might be. Aside from questions of naming conventions and design patterns applied to the ontologies, there are a number of what might be termed ‘epistemological’ recommendations—which pertain to just how biological objects are to be represented, understood and reasoned over within a conformant ontological representation.

The first suggestion relates to the compositing of class relationships. The key recommendation here is that ontologies ought to utilise ‘two backbone hierarchies of “is_a” and “part_of” relations’ (Smith et al. 2007). Class composition, the foundational activity of ontology modelling, should then be driven by finding relationships of specialisation–generalisation—where members of one class can be said to be members of another—and of part–whole—where members of one class can be said to parts of members of another.

In a series of further publications, Barry Smith—one of the foundry editors—and others have also attempted to define a so-called ‘Basic Formal Ontology’ (BFO), which describes a metaphysical theory of reality in a formal ontology. The purpose of such an ontology would be that, if successfully adopted, domain-level ontologies such as biomedical ones would have commensurate views over foundational notions such as time and space, universals and particulars, substances and qualities, and so on. In the authors’ terms, the underlying theory is one of ‘naive realism’, which eschews various conceptualised contaminants in favour of a kind of commonsense empiricism, founded on observational data furnished by the natural sciences. Perhaps its most controversial claim is a top-level distinction between ‘occurents’ (events or processing taking place ‘in time’) and ‘continuants’ (objects which endure ‘through time’). Espousing what they call ‘ontological pluralism’, in a paper titled ‘The Cornucopia of Formal-Ontological Relations’, Smith and Grenon (2004) argue for a dualistic—and mutually incommensurable—view of objects. On the one hand, a three-dimensionalist view sees objects as occupying spatial locality within a series of temporal points; on the other, a four-dimensionalist view sees objects as mere apparitions of underlying processes which take place within a spatio-temporal continuum. Smith and Grenon (2004) further argue that neither view is independently sufficient to account for the kinds of descriptions and representations ontology modellers and designers need in practice. By co-ordinating incompatible perspectives within the same foundational ontology, multiple, equally viable, scientific descriptions can be housed in derivative ontologies, constrained only by veridicality. Moreover such descriptions have the advantage of making explicit their perspectival orientation towards the objects described.

The BFO has been explicitly referenced by around 10 per cent of the OBO ontologies (OBO Foundry 2010), suggesting its foundation definitions have been of some practical use in biomedical ontology design. How much this reflects commitments to its underlying theory of ontological pluralism is difficult to gauge.

The Gene Ontology

Of the 60 or so ontologies housed by the OBO Foundry, the Gene Ontology is its flagship one in size and usage. It has been under active and ongoing development since 2000, having originated as an effort to unite three existing genomic databases:

image the FlyBase project (cataloguing a particular species of fruitfly, Drosophila melanogaster)

image the Mouse Genome Informatics project (cataloguing the ‘laboratory mouse’)

image the Saccharomyces Genome Database project (cataloguing the budding yeast Saccharomyces cerevisiae) (Ashburner et al. 2000).

The US National Institute of Health provided the initial grant to consolidate these databases, which were then organised into the nascent Gene Ontology, with the much grander ambition of cataloguing general biological functions systematically. Currently the Gene Ontology is managed by the Gene Ontology Consortium, a loose group of US and Europe-based institutes, funded by a range of grants. The Gene Ontology website acknowledges several benefactors: ‘Direct support for the Gene Ontology Consortium is provided by an R01 grant from the National Human Genome Research Institute (NHGRI) [grant HG02273]; AstraZeneca; Incyte Genomics; the European Union and the UK Medical Research Council’ (The GO Consortium 2010).

The current Gene Ontology comprises approximately 22,000 concepts. These are organised into three separate—though related—sub-ontologies:

image the Biological Process—the ‘biological objective to which the gene or gene product contributes’

image the Molecular Function—the ‘biochemical activity… of a gene product’

image the Cellular Component—the ‘place in a cell where a gene product is active’.

Genes, gene products and gene product groups can each be described by one or more attributes from each of these categories. The authors of Gene Ontology note:

The relationships between a gene product (or gene-product group) to biological process, molecular function and cellular component are one-to-many, reflecting the biological reality that a particular protein may function in several processes, contain domains that carry out diverse molecular functions, and participate in multiple alternative interactions with other proteins, organelles or locations in the cell (Ashburner et al. 2000).

While these sub-ontologies are clearly related, the designers are restrained from defining any such relationships explicitly—this would complicate the overall ontological structure, lead to possible errors and hinder adoption among the bioinformatic community.

Factors in the success of gene ontology

In the comparatively obscure world of biological ontologies, the Gene Ontology has been incredibly successful. Its website lists 2,829 articles that cite or make use of the ontology, while Google Scholar suggests the positioning paper introducing the ontology has been cited 4,773 times since its publication in 2000 (The GO Consortium 2010).

A study by Michael Bada et al. titled ‘A Short Study on the Success of the Gene Ontology’—some of the authors of which also collaborated on the Gene Ontology itself—concluded that seven factors contributed to its successful emergence as the de facto taxonomy for genetic coding:

1. community involvement

2. clear goals

3. limited scope

4. simple, intuitive structure

5. continuous evolution

6. active curation

7. early use (Bada et al. 2004).

Of these factors, 1, 5, 6 and 7 relate to extrinsic—or sociological—features, which help explain its adoption. Specifically, involving the bioinformatic community, whose members would ultimately use the ontology early in its development, and continuously improving and actively maintaining the ontology, has played a significant role in its success.

Another, underrated element is that the overarching policing of the ontology has balanced classificatory quality with community input. Other ontologies—and also thesauri, taxonomies and controlled vocabularies—have suffered either by being closed to diverse community input, or by being too open to community input, with quality being corroded as a result. Related to this, the very publication of the Gene Ontology itself, alongside supporting academic materials, encourages would-be users to browse, use and potentially extend the ontology, fostering further collaboration and adoption.

Biological cultures, ontological cultures

Ontologies, in spite of their formal rigour, remain intransigently cultural artefacts, and therefore reflect the cultural biases and assumptions of their authors. Unsurprisingly, given the complexity and field of these kinds of artefacts, they tend to be funded by a combination of well-heeled Western university, government and corporate sponsors. In their purely technical form, even accompanied by academic publications, relatively little can be gleaned about the perspectival assumptions that underwrite their construction. As ontologies like the Gene Ontology emerge as de facto standards, adopted by increasing numbers of bioinformatics researchers and, further downstream, medical practitioners, it becomes increasingly vital—and yet exceedingly difficult—to understand the motivating choices behind one particular method of ‘carving nature at its joints’. As Bowker and Starr (1999) illustrate in their discussion of conceptual schemes in medical contexts, the process of standardisation inevitably tends towards ossification—where objects are demarcated within increasingly uncontested categorial boundaries.

The Gene Ontology exemplifies this positing of predefined categorial assertions. While the authors have published papers that discuss and justify distinctions, these too have the appearance of organic origin, as though this and no other conceptual scheme could meet a set of idealised scientific desiderata. The context of the production and dissimination of the ontology remains obscured in these discussions—indeed such considerations would appear out of place, unnatural, within the discursive constraints of biological scientific publications.

This contrasts with kind of picture Julie Sommerlund (2006), for instance, describes in relation to practicing scientists working with multiple microbial classification systems. She writes:

When I first visited the Molecular Microbial Ecology Group, it struck me that everybody told me stories of conflict when they introduced me to their work. They told stories of historical conflicts (about earlier forms of science that had been difficult to get past) and stories of concurrent conflicts between traditions or paradigms that were still influential. These kinds of conflicts seemed very important to the researchers, presumably because their field is interdisciplinary, combining as it does molecular microbiology with microbial ecology (Sommerlund 2006).

By contrast, presented in their idealised formal state, ontologies and even associated published collateral demonstrate no such conflict. Indeed the ‘resistible rise’ and permeation of the Gene Ontology through bioinformatic discourse seems to take place without forcable opposition, though of course this opposition may well exist, both at a paradigmatic level—in the fundamental theoretical assumptions on which the design of the ontology rests—and the piece-meal level—in the individual concepts admitted to the ontology in the course of its various iterations. But the absence of visible contestation serves only to obscure just what perspectival characteristics the informational system has inherited from its authorial sources and context.

Ontological objects

In The Order of Things, Foucault writes of the ‘great tables of the seventeenth and eighteenth centuries, when in the disciplines of biology, economics and philology the raw phenomena of experience was classified, categorised, organised and labelled’ (Foucault 1970). In the twenty-first century, bioinformatics is reorienting the description of biological objects away from tabular and hierarchical structures, towards directed acyclic informational graphs of incomparably greater scale and sophistication, operated by multidisplinary and globally distributed teams of human and—increasingly—computational agents. The costs, expertise and resources for developing and maintaining these structures invariably constrains the number of rival ontologies, resulting in a kind of gradual ‘merger and acquisition’ activity as localised databases are rolled into umbrella frameworks like the OBO Foundry.

In this context, bioinformational objects tend to inherit some of the very characteristics of the biological objects they represent. Like genes, they codify complex sets of instructions; like viruses, they can proliferate across the unfettered ‘body’ of the internet; like cells, they can exist in complex interwoven structures and linkages with other objects; and like all organic materials they leave traces on the—in their case, purely digital—environment. And while in a mundane sense they are products of specifically research cultures, in another sense they operate increasingly like biological cultures—transmitting, consuming, adapting and evolving as if independent of the human agents responsible for them. In this sense they are active rather than passive; coupled with reasoning algorithms, they are able to grow and produce ‘new information’ or ‘new knowledge’ via deductive inferencing procedures. They are then to some extent self-explicating and self-analysing discursive artefacts, quasiobjects in the Latourian sense, cultural entitites, but also possessing a pecular nature of their own. However, the preceding analysis shows there is a risk of information objects becoming ‘naturalised’ in a different way, as the ‘cultured’ part of their nature—the particular material conditions, epistemological theories and methodological practices of the context in which they emerge—is occluded.

Towards compromise: ontologies in practice

As Bowker and Star (1999) suggest, information systems have the propensity to reify and rigidify disciplinary categories. In the case of biological ontologies, the formal rigour of these artefacts affords research groups powerful means for analysing and sharing scientific results. This comes at the cost, I would argue, of increasing difficulty of negotiating the conceptual boundaries under which biological objects are located. Moreover the pristine mathematical formalism in which these systems are described serves to obscure the messy contested practices of scientific discovery and definition, in favour of a definitive conceptualisation.

The examples of the OBO Foundry and Gene Ontology suggest some of these concerns can be mitigated by a number of policing determinations before and during the development of biological ontologies. Retaining a relatively open policy towards the ongoing development permits scientific innovations—both paradigmatic and piece-meal—to revise ontological structures, while preserving the kinds of quality characteristics that make the ontologies serviceable to a scientific community at all. Open access and publication of the ontologies encourages ongoing use and critical peer review. However, as standardised bioinformatic ontological structures become increasingly central to the practice of research, it becomes important to develop ways of describing the theoretical scaffolds that underpin these structures. This poses difficulties for the authors of ontologies themselves, and for sociological researchers who engage with ontologies from outside their disciplinary terrain, as cultured rather naturalised objects.

Part of the difficulty arises via a second-order, meta problem of commensurability, which can be posed as follows. While the ontology commensurability framework presented in earlier chapters provides a mechanism for comparing ontologies, what emerges from the case of the Gene Ontology in particular, in the conflicting accounts and styles of the authors of biological ontologies, on the one hand, and those of sociologists of biology—such as Sommerlund, and Bowker and Star—on the other, is the potential need to navigate between the respective advocatory and critical views of standardised knowledge systems. Here the question is less, then, one of commensurability between systems themselves than between the disciplines which propagate and dissect them. This requires a respect for the efforts of ontology developers, without losing sight of potential power effects within the epistemic frames they operate within—the propensity, for instance, for standardised vocabularies to eclipse rival theories, terms and conceptualisations. Equally, it requires avoiding lapsing into naive forms of reconciliatory relativism and perspectivalism, in which all vantage points are accommodated, but in a manner that deflates them of critical resonance and impact.

A starting point towards this goal would follow the model already laid out in the commensurability framework—it involves identifying a set of salient distinctions in the respective viewpoints. These distinctions in turn can be evaluated for those viewpoints, quantified, weighted and subjected to further algorithmic treatment if necessary. This treatment fundamentally, though, would seek to develop an analysis of how ontologies are designed and used, both on their own terms—using the methodological and theoretical principles laid out by the authors themselves, where available— and within the broader epistemic fields in which they operate, using the critical evaluative apparatuses and conceptual distinctions operationalised by sociologists of science. From the point of view of users and adopters, this analysis provides a mechanism for what would at least be a preliminary evaluation of the trade-offs, assumptions and commitments entailed by usage of an ontology. This, in turn, makes for a critical rather than purely dogmatic or passive engagement to a particular set of ontological concepts, terms and corresponding practices.

References

Aranguren, M.E.Ontology Design Patterns for the Formalisation of Biological Ontologies. technical report, University of Manchester, 2005.

Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G. Gene Ontology: Tool for the Unification of Biology: The Gene Ontology Consortium. Nature Genetics. 2000; 25(1):25–29.

Bada, M., Stevens, R., Goble, C., Gil, Y., Ashburner, M., Blake, J.A., Cherry, J.M., Harris, M., Lewis, S. A Short Study on the Success of the Gene Ontology. Web Semantics: Science, Services and Agents on the World Wide Web. 2004; 1(2):235–240.

Bowker, G.C., Star, S.L. Sorting Things Out: Classification and its Consequences. Cambridge, MA: MIT Press; 1999.

Foucault, M. The Order of Things: An Archaeology of the Human Sciences. New York: Vintage Books; 1970.

The GO Consortium. The GO Consortium website. http://www.geneontology.org, 2010. [(accessed 16 June 2010)].

OBO Foundry. The Open Biological and Biomedical Ontologies. http://www.obofoundry.org, 2010. [(accessed 16 June 2010)].

Smart, A., Tutton, R., Martin, P., Ellison, G.T.H., Ashcroft, R. The Standardization of Race and Ethnicity in Biomedical Science Editorials and UK Biobanks. Social Studies of Science. 2008; 38:407.

Smith, B., Grenon, P. The Cornucopia of Formal-Ontological Relations. Dialectica. 2004; 58:279–296.

Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L.J., Eilbeck, K., Ireland, A., Mungall, C.J., The OBI Consortium, Leontis, N., Rocca-Serra, P., Ruttenberg, A., Sansone, S.-A., Scheuermann, R.H., Shah, N., Whetzel, P.L., Lewis, S. The OBO Foundry: Coordinated Evolution of Ontologies to Support Biomedical Data Integration. Nature Biotechnology. 2007; 25:1251–1255.

Sommerlund, J. Classifying Microorganisms: The Multiplicity of Classifications and Research Practices in Molecular Microbial Ecology. Social Studies of Science. 2006; 36:909–928.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset