Categories are equivalence classes: sets or groups of things or abstract entities that we treat the same.
The size of the equivalence class is determined by the properties or characteristics we consider.
Cultural, individual, and institutional categorization share some core ideas but they emphasize different processes and purposes for creating categories.
Individual categories are created by intentional activity that usually takes place in response to a specific situation.
Institutional categories are most often created in abstract and information-intensive domains where unambiguous and precise categories are needed.
The rigorous definition of institutional categories enables classification, the systematic assignment of resources to categories in an organizing system.
Computational categories are created by computer programs when the number of resources, or when the number of descriptions or observations associated with each resource, are so large that people cannot think about them effectively.
In supervised learning, a machine learning program is trained by giving it sample items or documents that are labeled by category. In unsupervised learning, the program gets the samples but has to come up with the categories on its own.
Any collection of resources with sortable identifiers (alphabetic or numeric) as an associated property can benefit from using sorting order as an organizing principle.
If only a single property is used to distinguish among some set of resources and to create the categories in an organizing system, the choice of property is critical because different properties often lead to different categories.
A sequence of organizing decisions based on a fixed ordering of resource properties creates a hierarchy, a multi-level category system.
An important implication of necessary and sufficient category definition is that every member of the category is an equally good member or example of the category.
For most purposes, the most useful property of information resources for categorizing them is their aboutness, which is not directly perceivable and which is hard to characterize.
In domains where properties lack one or more of the characteristics of separability, perceptibility, and necessity, a probabilistic or statistical view of properties is needed to define categories.
(See §7.3.5, “Probabilistic Categories and “Family Resemblance””)
Sharing some but not all properties is akin to family resemblances among the category members.
(See §7.3.5, “Probabilistic Categories and “Family Resemblance””)
Similarity is a measure of the resemblance between two things that share some characteristics but are not identical.
(See §7.3.6, “Similarity”)
Feature- or property-based, geometry-based, transformational, and alignment- or analogy-based approaches are psychologically-motivated approaches that propose different functions for computing similarity.
(See §7.3.6, “Similarity”)
Classical categories can be defined precisely with just a few necessary and sufficient properties.
Broader or coarse-grained categories increase recall, but lower precision.
A simple decision tree is an algorithm for determining a decision by making a sequence of logical or property tests.
(See §7.5.2, “Implementing Categories Defined by Properties”)
The most conceptually simple and straightforward implementation of categories in technologies for organizing systems adopts the classical view of categories based on necessary and sufficient features.
(See §7.5.2, “Implementing Categories Defined by Properties”)
An artificial language expresses ideas concisely by introducing new terms or symbols that represent complex ideas along with syntactic mechanisms for combining and operating on them.
(See §7.5.2, “Implementing Categories Defined by Properties”)
Naïve Bayes classifiers learn by revising the conditional probability of each property for making the correct classification after seeing the base rates of the class and property in the training data and how likely it is that a member of the class has the property.
Because clustering techniques are unsupervised, they create categories based on calculations of similarity between resources, maximizing the similarity of resources within a category and maximizing the differences between them.