6.5. Implementing Categories

We have emphasized the intellectual choices and challenges that arise in the design of a system of categories because, at their essence, categories are conceptual or mental constructs. We use categories in a mostly invisible way when we communicate, solve problems, or organize our kitchens and clothes closets. Sometimes categories are more apparent, as when we see signs and labels in the aisles of department or grocery stores to help us find things, when we put our socks and t-shirts in different dresser drawers, or when we create a system of folders and directories in our file cabinets or on our personal computers.

The most visible implementations of a category system are usually those for institutional categories, especially those that are embodied in the organizing systems for information resources where category membership can be verified by technology and the boundaries between categories are precise. In this final section of the chapter we briefly discuss some of the most important technologies for implementing categories, contrasting those that are appropriate for categories where membership is defined using properties with those that work for categories defined on the basis of similarity.

6.5.1. Implementing Classical Categories

The most conceptually simple and straightforward implementation of categories in technologies for organizing systems adopts the classical view of categories based on necessary and sufficient features. This approach results in prescriptive categories with explicit and clear boundaries. Classifying items into the categories is objective and deterministic and supports a well-defined notion of validation to determine unambiguously whether some instance is a member of the category.

The most direct way to implement classical categories is as a decision tree. A simple decision tree is an algorithm for determining a decision by making a sequence of logical or property tests. For example, we can classify numbers as prime or not with two tests: is it greater than 1, and does it have any divisors other than itself and 1. More complex categories need more tests. We can model the CAFE fuel economy standards (§6.2.4, “A “Categorization Continuum”) that assign vehicles to “car,” “truck,” and “light truck” categories using a decision tree whose differentiating tests are (1) the maximum number of passengers (cars have 10 or fewer), (2) off-road capability (cars do not have it), and (3) weight (trucks weigh at least 6000 pounds), and (4) function (numerous sub-tests that classify vehicles as trucks even if they weigh less than 6000 pounds).368[Cog]

[368][Cog] Even though they are not classical categories, we might also model goal-derived categories as decision trees by ordering the decisions to ensure that any sub-goals are satisfied according to their priority. We could understand the category “Things to take from a burning house” by first asking the question Are there living things in the house? because that might be the most important sub-goal. If the answer to that question is “yes,” we might proceed along a different path than if the answer is “no.” Similarly, we might put a higher priority on things that cannot be replaced (Grandma’s photos) than those that can (passport).

Because natural language embodies cultural categories, it is not the optimal representational format for formally defined institutional categories like those in the CAFE standards. Categories defined using natural language can be incomplete, inconsistent, or ambiguous, so they are sometimes specified using the controlled vocabularies and constrained syntax of "simplified writing” or “business rule” systems to create a decision tree or network that reliably classifies instances369[Cog]

[369][Cog] Institutional uses of decision trees can also sometimes be thought of as models of goal-derived categories. For example, the US Department of Health and Human Services (HHS) uses several decision trees as part of its efforts to ensure that research programs funded by the department do not harm human subjects. The chart http://www.hhs.gov/ohrp/humansubjects/guidance/decisioncharts.htm#c1 for example, is used to determine whether a program can be classified as research involving human subjects, which would mean that the program would have to be reviewed by an Institutional Review Board (IRB). (See also §3.4.3.2, “Use Controlled Vocabularies”.)

Artificial languages are a more ambitious way to enable precise specification of categories. An artificial language expresses ideas concisely by introducing new terms or symbols that represent complex ideas along with syntactic mechanisms for combining and operating on them. Mathematical notation and programming languages are familiar examples of artificial languages, and it is certainly easier to explain and understand the Pythagorean Theorem when it is efficiently expressed as “H2 = A2 + B2 than with a more verbose natural language expression: “In all triangles with an angle such that the sides forming the angle are perpendicular, the product of the length of the side opposite the angle such that the sides forming the angle are perpendicular with itself is equal to the sum of the products of the lengths of the other two sides, each with itself.”369a[Cog]

[369a][Cog] This example comes from (Perlman 1984), who introduced the idea of "natural artificial languages" as those designed to be easy to learn and use because they employ mnemonic symbols, suggestive names, and consistent syntax.

Artificial languages for defining categories have a long history in philosophy and science, including an effort by Wilkins in the seventeenth century that was satirized three centuries later by Borges. (See the sidebar, Artificial Languages for Description and Classification). However, the vast majority of institutional category systems are still specified with natural language, despite its ambiguities. Sometimes this is even intentional to allow institutional categories embodied in laws to evolve in the courts and to accommodate technological advances.370[Law]

[370][Law] When the US Congress revised copyright law in 1976 it codified a “fair use” provision to allow for some limited uses of copyrighted works, but fair use in the digital era is vastly different today; website caching to improve performance and links that return thumbnail versions of images are fair uses that were not conceivable when the law was written. A law that precisely defined fair uses using contemporary technology would have quickly become obsolete, but one written more qualitatively to enable interpretation by the courts has remained viable. See (Samuelson 2009).

Data schemas that specify data entities, elements, identifiers, attributes, and relationships in databases and XML document types on the transactional end of the Document Type Spectrum (§3.2.1) are implementations of the categories needed for the design, development and maintenance of information organization systems. Like the classical model of categorization, data schemas tend to rigidly define resources. “Rigid” might sound negative, but a rigidly defined resource is also precisely defined. Precise definition is essential when creating, capturing, and retrieving data and when information about resources in different organizing systems needs to be combined or compared.371[Com]

[371][Com] For example, in a traditional relational database, each table contains a field, or combination of fields, known as a primary key, which is used to define and restrict membership in the table. A table of email messages in a database might define an email message as a unique combination of sender address, recipient address, and date/time when the message was sent, by enforcing a primary key on a combination of these fields. Similar to category membership based on a single, monothetic set of properties, membership in this email message table is based on a single set of required criteria. An item without a recipient address cannot be admitted to the table. In categorization terms, the item is not a member of the “email message” class because it does not have all the properties necessary for membership.

The 100 or so standard document types of the Universal Business Language (UBL) (mentioned briefly in §7.1.5.2) are XML schemas that define basic level categories like orders, invoices, payments, and receipts that many people are familiar with from their personal experiences of shopping and paying bills. UBL’s vast library of information components enables the design of very specific or subordinate level transactional document types like “purchase order for industrial chemicals when buyer and seller are in different countries.” At the other end of the abstraction hierarchy are document types like “fill-in-the-blank” legal forms for any kind of contract.

In object-oriented programming languages, classes are schemas that serve as templates for the creation of objects. A class in a programming language is analogous to a database schema that specifies the structure of its member instances, in that the class definition specifies how instances of the class are constructed in terms of data types and possible values. Programming classes may also specify whether data in a member object can be accessed, and if so, how.372[Com]

[372][Com] Like data schemas, programming classes specify and enforce rules in the construction and manipulation of data. However, programming classes, like other implementations that are characterized by specificity and rule enforcement, can vary widely in the degree to which rules are specified and enforced. While some class definitions are very rigid, others are more flexible. Some languages have abstract types that have no instances but serve to provide a common ancestor for specific implemented types.

6.5.2. Implementing Categories That Do Not Conform to the Classical Theory

Unlike transactional document types, which can be prescriptively defined as classical categories because they are often produced and consumed by automated processes, narrative document types are usually descriptive in character. We do not classify something as a novel because it has some specific set of properties and content types. Instead, we have a notion of typical novels and their characteristic properties, and some things that are considered novels are far from typical in their structure and content.373[Cog]

[373][Cog] The existence of chapters might suggest that an item is a novel; however, a lack of chapters need not automatically indicate that an item is not a novel. Some novels are hypertexts that encourage readers to take alternative paths. Many of the writings by James Joyce and Samuel Beckett are “stream of consciousness” works that lack a coherent plot, yet they are widely regarded as novels.

Nevertheless, categories like narrative document types can sometimes be implemented using document schemas that impose only a few constraints on structure and content. Unlike a schema for a purchase order that uses regular expressions, strongly data typed content and enumerated code lists to validate the value of required elements that must occur in a particular order, a schema for a narrative document type would have much optionality, be flexible about order, and expect only text in its sections, paragraphs and headings. Even very lax document schemas can be useful in making content management, reuse, and formatting more efficient.

Category types that are furthest in character from the classical model are those that are not defined using properties in any explicit way. We do not use technology to help us understand cultural categories like “friend” or “game” that rely on some notion of similarity to determine category membership. However, there are technologies that can create a system of categories that uses similarity as its basis.

In particular, machine learning is a subfield of computer science that develops and applies algorithms that accomplish tasks that are not explicitly programmed; creating categories and assigning items to them is an important subset of machine learning. Two subfields of machine learning that are particularly relevant to organizing systems are supervised and unsupervised learning. In supervised learning, a machine learning program is trained by giving it sample items or documents that are labeled by category, and the program learns to assign new items to the correct categories. In unsupervised learning, the program gets the samples but has to come up with the categories on its own by discovering the underlying correlations between the items; that is why unsupervised learning is sometimes called statistical pattern recognition. This generally takes longer, since the program is not given correct answers to use in improving its performance, as it is in the supervised case. As we pointed out in §6.2.1, “Cultural Categories”, we learn most of our cultural categories without any explicit instruction about them, so it is not surprising that computational models of categorization developed by cognitive scientists often employ unsupervised statistical learning methods. We will now briefly discuss unsupervised learning and return to supervised learning in Chapter 7, “Classification: Assigning Resources to Categories.

There are far too many unsupervised learning techniques for categorization to even mention them all, let alone describe how they work. The ones that are most relevant for us are called clustering techniques, which share the same goal and three basic methods.

  • Clustering techniques share the goal of creating meaningful categories from a collection of items whose properties are hard to directly perceive and evaluate; this is especially true with large collections of heterogeneous documents, where goals might be to find categories of documents with the same topics, genre, sentiment, or other characteristic that cannot easily be reduced to specific property tests.

  • The first shared method is that clustering techniques start with an initially uncategorized set of items or documents from which some measures of inter-item similarity can be calculated.374[Com]

  • The second shared method is that categories are created by putting items that are most similar into the same category. Hierarchical clustering approaches start with every item in its own category. Other approaches, notably one called “K-means clustering,” start with a fixed number of K categories initialized with a randomly chosen item or document.

  • The third shared method is refining the system of categories by iterative similarity recalculation each time an item is added to a category. Approaches that start with every item in its own category create a hierarchical system of categories by merging the two most similar categories, recomputing the similarity between the new category and the remaining ones, and repeating this process until all the categories are merged into a single category at the root of a category tree. Techniques that start with a fixed number of categories do not create new ones but instead repeatedly recalculate the “center” of the category by adjusting its property representation to the average of all its members after a new member is added.375[Com]

[374][Com] Some approaches represent each item as a vector of property values; documents are usually represented as vectors of frequency-weighted terms; in either case the items can be represented as points in a multidimensional space of these properties. Similarity can then be calculated by measuring the distance between items in this space. A popular text on machine learning is(Witten, Frank, and Hall 2011).

Other approaches start more directly with the similarity measure, obtained either by direct judgments of the similarity of each pair of items or by indirect measures like the accuracy in deciding whether two sounds, colors, or images are the same or different. The assumption is that the confusability of two items reflects how similar they are.

[375][Com] Unlike hierarchical clustering methods that have a clear stopping rule when they create the root category, k-means clustering methods run until the centroids of the categorize stabilize. Furthermore, because the k-means algorithm is basically just hill-climbing, and the initial category “seed” items are random, it can easily get stuck in a local optimum. So it is desirable to try many different starting configurations for different choices of K.

The end result of clustering is a statistically optimal set of categories in which the similarity of all the items within a category is larger than the similarity of items that belong to different categories. This is a statistical result produced by a computer, and there is no guarantee that the categories are meaningful ones that can be named and used by people. In the end, clustering relies on the data analyst or information scientist to make sense of the clusters if they are to be used to classify resources. In many cases it is better to start with categories created by people and then teach them to computers that can use supervised learning techniques to assign new resources to the categories.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset