7 Chapter 7. Categorization: Describing Resource Classes and Types
Robert J. Glushko
Rachelle Annechino
Jess Hemerly
Robyn Perry
Longhao Wang
Table of Contents
7.2. The What and Why of Categories
7.2.3. Institutional Categories
7.2.4. A “Categorization Continuum”
7.2.5. Computational Categories
7.3. Principles for Creating Categories
7.3.4. The Limits of Property-Based Categorization
7.3.5. Probabilistic Categories and “Family Resemblance”
7.3.7. Goal-Derived Categories
7.3.8. Theory-Based Categories
7.4. Category Design Issues and Implications
7.4.1. Category Abstraction and Granularity
7.4.2. Basic or Natural Categories
7.4.3. The Recall / Precision Tradeoff
7.4.4. Category Audience and Purpose
7.5.1. Implementing Enumerated Categories
7.5.2. Implementing Categories Defined by Properties
7.5.3. Implementing Categories Defined by Probability and Similarity
7.5.4. Implementing Goal-Based Categories
7.5.5. Implementing Theory-Based Categories
7.6. Key Points in Chapter Seven
Introduction
Many texts in library science introduce categorization via cataloging rules, a set of highly prescriptive methods for assigning resources to categories that some describe and others satirize as “mark ’em and park ’em.” Many texts in computer science discuss the process of defining the categories needed to create, process, and store information in terms of programming language constructs: “here’s how to define an abstract type, and here’s the data type system.” Machine learning and data science texts explain how categories are created through statistical analysis of the correlations among the values of features in a collection or dataset. We take a very different approach in this chapter, but all of these different perspectives will find their place in it.[386]
Navigating This Chapter
In the following sections, we discuss how and why we create categories, reviewing some important work in philosophy, linguistics, and cognitive psychology to better understand how categories are created and used in organizing systems. We discuss how the way we organize differs when we act as individuals or as members of social, cultural, or institutional groups (the section called “The What and Why of Categories”); later we share principles for creating categories( the section called “Principles for Creating Categories”), design choices (the section called “Category Design Issues and Implications”), and implementation experience (the section called “Implementing Categories”). Throughout the chapter, we will compare how categories created by people compare with those created by computer algorithms. As usual, we close the chapter with a summary of the key points (the section called “Key Points in Chapter Seven”).
The What and Why of Categories
Categories are equivalence classes, sets or groups of things or abstract entities that we treat the same. This does not mean that every instance of a category is identical, only that from some perspective, or for some purpose, we are treating them as equivalent based on what they have in common. When we consider something as a member of a category, we are making choices about which of its properties or roles we are focusing on and which ones we are ignoring. We do this automatically and unconsciously most of the time, but we can also do it in an explicit and self-aware way. When we create categories with conscious effort, we often say that we are creating a model, or just modeling. You should be familiar with the idea that a model is a set of simplified descriptions or a physical representation that removes some complexity to emphasize some features or characteristics and to de-emphasize others.[387]
All human languages and cultures divide up the world into categories. How and why this takes place has long been debated by philosophers, psychologists and anthropologists. One explanation for this differentiation is that people recognize structure in the world, and then create categories of things that “go together” or are somehow similar. An alternative view says that human minds make sense of the world by imposing structure on it, and that what goes together or seems similar is the outcome rather than a cause of categorization. Bulmer framed the contrast in a memorable way by asking which came first, the chicken (the objective facts of nature) or the egghead (the role of the human intellect).[388]
A secondary and more specialized debate going on for the last few decades among linguists, cognitive scientists, and computer scientists concerns the extent to which the cognitive mechanisms involved in category formation are specialized for that purpose rather than more general learning processes.[389]
Even before they can talk, children behave in ways that suggest they have formed categories based on shape, color, and other properties they can directly perceive in physical objects.[390] People almost effortlessly learn tens of thousands of categories embodied in the culture and language in which they grow up. People also rely on their own experiences, preferences, and goals to adapt these cultural categories or create entirely individual ones that they use to organize resources that they personally arrange. Later on, through situational training and formal education, people learn to apply systematic and logical thinking processes so that they can create and understand categories in engineering, logistics, transport, science, law, business, and other institutional contexts.
These three contexts of cultural, individual, and institutional categorization share some core ideas but they emphasize different processes and purposes for creating categories, so they are a useful distinction.[391] Cultural categorization can be understood as a natural human cognitive ability that serves as a foundation for both informal and formal organizing systems. Individual categorization tends to grow spontaneously out of our personal activities. Institutional categorization responds to the need for formal coordination and cooperation within and between companies, governments, and other goal-oriented enterprises.
Cultural Categories
Cultural categories are the archetypical form of categories upon which individual and institutional categories are usually based. Cultural categories tend to describe our everyday experiences of the world and our accumulated cultural knowledge. Such categories describe objects, events, settings, internal experiences, physical orientation, relationships between entities, and many other aspects of human experience. Cultural categories are learned primarily, with little explicit instruction, through normal exposure of children with their caregivers; they are associated with language acquisition and language use within particular cultural contexts.
Two thousand years ago Plato wrote that living species could be identified by “carving nature at its joints,” the natural boundaries or discontinuities between types of things where the differences are the largest or most salient. Plato’s metaphor is intuitively appealing because we can easily come up with examples of perceptible properties or behaviors of physical things that go together that make some ways of categorizing them seem more natural than others.[392]
Natural languages rely heavily on nouns to talk about categories of things because it is useful to have a shorthand way of referring to a set of properties that co-occur in predictable ways.[393] For example, in English (borrowed from Portuguese) we have a word for “banana” because a particular curved shape, greenish-yellow or yellow color, and a convenient size tend to co-occur in a familiar edible object, so it became useful to give it a name. The word “banana” brings together this configuration of highly interrelated perceptions into a unified concept so we do not have to refer to bananas by listing their properties.[394]
Languages differ a great deal in the words they contain and also in more fundamental ways that they require speakers or writers to attend to details about the world or aspects of experience that another language allows them to ignore. This idea is often described as linguistic relativity. (See the sidebar, Linguistic Relativity.)
Nevertheless, even though academic linguists have discredited strong versions of Whorf’s ideas, less deterministic versions of linguistic relativity have become influential and help us understand cultural categorization. The more moderate position was crisply characterized by Roman Jakobson, who said that “languages differ essentially in what they must convey and not in what they may convey.” In English one can say “I spent yesterday with a neighbor.” In languages with grammatical gender, one must choose a word that identifies the neighbor as male or female.[395]
For example, speakers of the Australian aboriginal language, Guugu Yimithirr, do not use concepts of left and right, but rather use cardinal directions. Where in English we might say to a person facing north, “Take a step to your left,” they would use their term for west. If the person faced south, we would change our instruction to “right,” but they would still use their term for west. Imagine how difficult it would be for a speaker of Guugu Yimithirr and a speaker of English to collaborate in organizing a storage room or a closet.[396]
It is not controversial to notice that different cultures and language communities have different experiences and activities that give them contrasting knowledge about particular domains. No one would doubt that university undergraduates in Chicago would think differently about animals than inhabitants of Guatemalan rain forests, or even that different types of “tree experts” (taxonomists, landscape workers, foresters, and tree maintenance personnel) would categorize trees differently.[397]
On the other hand, despite the wide variation in the climates, environments, and cultures that produce them, at a high level “folk taxonomies” that describe natural phenomena are surprisingly consistent around the world. Half a century ago the sociologists Emile Durkheim and Marcel Mauss observed that the language and structure of folk taxonomies mirrors that of human family relationships (e.g., different types of trees might be “siblings,” but animals would be part of another family entirely). They suggested that framing the world in terms of familiar human relationships allowed people to understand it more easily.[398]
Anthropologist Brent Berlin, a more recent researcher, concurs with Durkheim and Mauss’s observation that kinship relations and folk taxonomies are related, but argues that humans patterned their family structures after the natural world, not the other way around.[399]
Invoking the Whorfian Hypothesis in a Clothing Ad
(Photo by R. Glushko. Taken in the Reykjavik airport.)
Individual Categories
Individual categories are created in an organizing system to satisfy the ad hoc requirements that arise from a person’s unique experiences, preferences, and resource collections. Unlike cultural categories, which usually develop slowly and last a long time, individual categories are created by intentional activity, in response to a specific situation, or to solve an emerging organizational challenge. As a consequence, the categories in individual organizing systems generally have short lifetimes and rarely outlive the person who created them.[400]
Traditionally, individual categorization systems were usually not visible to, or shared with, others, whereas, this has become an increasingly common situation for people using web-based organizing system for pictures, music, or other personal resources. On websites like the popular Flickr, Instagram, and YouTube sites for photos and videos, people typically use existing cultural categories to tag their content as well as individual ones that they invent.[401]
Institutional Categories
In contrast to cultural categories that are created and used implicitly, and to individual categories that are used by people acting alone, institutional categories are created and used explicitly, and most often by many people in coordination with each other. Institutional categories are most often created in abstract and information-intensive domains where unambiguous and precise categories are needed to regulate and systematize activity, to enable information sharing and reuse, and to reduce transaction costs. Furthermore, instead of describing the world as it is, institutional categories are usually defined to change or control the world by imposing semantic models that are more formal and arbitrary than those in cultural categories. Laws, regulations, and standards often specify institutional categories, along with decision rules for assigning resources to new categories, and behavior rules that prescribe how people must interact with them. The rigorous definition of institutional categories enables classification: the systematic assignment of resources to categories in an organizing system.[402]
Creating institutional categories by more systematic processes than cultural or individual categories does not ensure that they will be used in systematic and rational ways, because the reasoning and rationale behind institutional categories might be unknown to, or ignored by, the people who use them. Likewise, this way of creating categories does not prevent them from being biased. Indeed, the goal of institutional categories is often to impose or incentivize biases in interpretation or behavior. There is no better example of this than the practice of gerrymandering, designing the boundaries of election districts to give one political party or ethnic group an advantage.[403](See the sidebar, Gerrymandering the Illinois 17th Congressional District.)
Gerrymandering the Illinois 17th Congressional District
(Picture from nationatlas.gov. Not protectable by copyright (17 USC Sec. 105).)
Institutional categorization stands apart from individual categorization primarily because it invariably requires significant efforts to reconcile mismatches between existing individual categories, where those categories embody useful working or contextual knowledge that is lost in the move to a formal institutional system.[404]
Institutional categorization efforts must also overcome the vagueness and inconsistency of cultural categories because the former must often conform to stricter logical standards to support inference and meet legal requirements. Furthermore, institutional categorization is usually a process that must be accounted for in a budget and staffing plans. While some kinds of institutional categories can be devised or discovered by computational processes, most of them are created through the collaboration of many individuals, typically from various parts of an organization or from different firms. For example, with the gerrymandering case we just discussed, it is important to emphasize that the inputs to these programs and the decisions about districting are controlled by people, which is why the districts are institutional categories; the programs are simply tools that make the process more efficient. [405]
Some institutional categories that initially had narrow or focused applicability have found their way into more popular use and are now considered cultural categories. A good example is the periodic table in chemistry, which Mendeleev developed in 1869 as a new system of categories for the chemical elements. The periodic table proved essential to scientists in understanding their properties and in predicting undiscovered ones. Today the periodic table is taught in elementary schools, and many things other than elements are commonly arranged using a graphical structure that resembles the periodic table of elements in chemistry, including sci-fi films and movies, desserts, and superheroes.[406]
A “Categorization Continuum”
CAFE Standards: Blurring the Lines Between Categorization Perspectives
The Corporate Average Fuel Economy(CAFE) standards sort vehicles into “passenger car” and “light truck” categories and impose higher minimum fuel efficiency requirements for cars because trucks have different typical uses.
CAFE standards have evolved over time, becoming a theater for political clashes between holistic cultural categories and formal institutional categories, which plays out in competing pressures from industry, government, and political organizations. Furthermore, CAFE standards and manufacturers’ response to them are influencing cultural categories, such that our cultural understanding of what a car looks like is changing over time as manufacturers design vehicles like the PT Cruiser with car functionality in unconventional shapes to take advantage of the CAFE light truck specifications.[407]
Computational Categories
The simplest kind of computational categories can be created using descriptive statistics (see the section called “Organizing With Descriptive Statistics”). Descriptive statistics do not identify the categories they create by giving them familiar cultural or institutional labels. Instead, they create implicit categories of items according to how much they differ from the most typical or frequent ones. For example, in any dataset where the values follow the normal distribution, statistics of central tendency and dispersion serve as standard reference measures for any observation. These statistics identify categories of items that are very different or statistically unlikely outliers, which could be signals of measurement errors, poorly calibrated equipment, employees who are inadequately trained or committing fraud, or other problems. The “Six Sigma” methodology for process improvement and quality control rests on this idea that careful and consistent collection of statistics can make any measurable operation better.
Many text processing methods and applications use simple statistics to categorize words by their frequency in a language, in a collection of documents, or in individual documents, and these categories are exploited in many information retrieval applications (see the section called “Interactions Based on Instance Properties” and the section called “Interactions Based on Collection Properties”).
Supervised and Unsupervised Learning
Two subfields of machine learning that are relevant to organizing systems are supervised and unsupervised learning. In supervised learning, a machine learning program is trained with sample items or documents that are labeled by category, and the program learns to assign new items to the correct categories. In unsupervised learning, the program gets the same items but has to come up with the categories on its own by discovering the underlying correlations between the items; that is why unsupervised learning is sometimes called statistical pattern recognition.
Categories that people create and label also can be used more explicitly in computational algorithms and applications. In particular, a program that can assign an item or instance to one or more existing categories is called a classifier. The subfield of computer science known as machine learning is home to numerous techniques for creating classifiers by training them with already correctly categorized examples. This training is called supervised learning; it is supervised because it starts with instances labeled by category, and it involves learning because over time the classifier improves its performance by adjusting the weights for features that distinguish the categories. But strictly speaking, supervised learning techniques do not learn the categories; they implement and apply categories that they inherit or are given to them. We will further discuss the computational implementation of categories created by people in the section called “Implementing Categories”.
In contrast, many computational techniques in machine learning can analyze a collection of resources to discover statistical regularities or correlations among the items, creating a set of categories without any labeled training data. This is called unsupervised learning or statistical pattern recognition. As we pointed out in the section called “Cultural Categories”, we learn most of our cultural categories without any explicit instruction about them, so it is not surprising that computational models of categorization developed by cognitive scientists often employ unsupervised statistical learning methods.
This focused scope is obvious when we consider how we might describe a computational category. “Fraudulent transaction for cardholder 4264123456780123” is not lexicalized with a one-word label as familiar cultural categories are. “Door” and “window” have broad scopes that are not tied to a single purpose. Put another way, the “door” and “window” cultural categories are highly reusable, as are institutional categories like those used to collect economic or health data that can be analyzed for many different purposes. The definitions of “door” and “window” might be a little fuzzy, but institutional categories are more precisely defined, often by law or regulation. Examples are the North American Industry Classification System(NAICS) from the US Census Bureau and the United Nations Standard Products and Services Code(UNSPC).
Principles for Creating Categories
the section called “The What and Why of Categories” explained what categories are and the contrasting cultural, individual, and institutional contexts and purposes for which categories are created. In doing so, a number of different principles for creating categories were mentioned, mostly in passing.
The simplest principle for creating a category is enumeration; any resource in a finite or countable set can be deemed a category member by that fact alone. This principle is also known as extensional definition, and the members of the set are called the extension. Many institutional categories are defined by enumeration as a set of possible or legal values, like the 50 United States or the ISO currency codes (ISO 4217).
Enumerative categories enable membership to be unambiguously determined because a value like state name or currency code is either a member of the category or it is not. However, this clarity has a downside; it makes it hard to argue that something not explicitly mentioned in an enumeration should be considered a member of the category, which can make laws or regulations inflexible. Moreover, there comes a size when enumerative definition is impractical or inefficient, and the category either must be sub-divided or be given a definition based on principles other than enumeration.[408]
Too Many Planets to Enumerate: Keeping up with Kepler
Kepler is a space observatory launched by NASA in 2009 to search for Earth-like planets orbiting other stars in our own Milky Way galaxy. Kepler has already discovered and verified a few thousand new planets, and these results have led to estimates that there may be at least as many planets as there are stars, a few hundred billion in the Milky Way alone. Count fast.
The International Astronomical Union(IAU) thought it solved this category crisis by proposing a definition of planet as “a celestial body that is (a) in orbit around a star, (b) has sufficient mass for its self-gravity to overcome rigid body forces so that it assumes a hydrostatic equilibrium (nearly round) shape, and (c) has cleared the neighborhood around its orbit.” Unfortunately, Pluto does not satisfy the third requirement, so it no longer is a member of the planet category, and instead is now called an “inferior planet.”
Changing the definition of a significant cultural category generated a great deal of controversy and angst among ordinary non-scientific people. A typical headline was “Pluto’s demotion has schools spinning,” describing the outcry from elementary school students and teachers about the injustice done to Pluto and the disruption on the curriculum. [409]
It is intuitive and useful to think in terms of properties when we identify instances and when we are describing instances (as we saw in the section called “Resource Identity” and in Chapter 5, Resource Description and Metadata). Therefore, it should also be intuitive and useful to consider properties when we analyze more than one instance to compare and contrast them so we can determine which sets of instances can be treated as a category or equivalence class. Categories whose members are determined by one or more properties or rules follow the principle of intensional definition, and the defining properties are called the intension.
You might be thinking here that enumeration or extensional definition of a category is also a property test; is not “being a state” a property of California? But statehood is not a property precisely because “state” is defined by extension, which means the only way to test California for statehood is to see if it is in the list of states.[410]
Any single property of a resource can be used to create categories, and the easiest ones to use are often the intrinsic static properties. As we discussed in Chapter 5, Resource Description and Metadata, intrinsic static properties are those inherent in a resource that never change. The material of composition of natural or manufactured objects is an intrinsic and static property that can be used to arrange physical resources. For example, an organizing system for a personal collection of music that is based on the intrinsic static property of physical format might use categories for CDs, DVDs, vinyl albums, 8-track cartridges, reel-to-reel tape and tape cassettes.[411]
The name or identifier of a resource is often arbitrary but once assigned normally does not change, making it an extrinsic static property. Any collection of resources with alphabetic or numeric identifiers as an associated property can use sorting order as an organizing principle to arrange spices, books, personnel records, etc., in a completely reliable way. Some might argue whether this organizing principle creates a category system, or whether it simply exploits the ordering inherent in the identifier notation. For example, with alphabetic identifiers, we can think of alphabetic ordering as creating a recursive category system with 26 (A-Z) top-level categories, each containing the same number of second-level categories, and so on until every instance is assigned to its proper place.[412]
These properties can have a large number of values or are continuous measures, but as long as there are explicit rules for using property values to determine category assignment the resulting categories are still easy to understand and use. For example, we naturally categorize people we know on the basis of their current profession, the city where they live, their hobbies, or their age. Properties with a numerical dimension like “frequency of use” are often transformed into a small set of categories like “frequently used,” “occasionally used,” and “rarely used” based on the numerical property values.[413]
While there are an infinite number of logically expressible properties for any resource, most of them would not lead to categories that would be interpretable and useful for people. If people are going to use the categories, it is important to base them on properties that are psychologically or pragmatically relevant for the resource domain being categorized. Whether something weighs more or less than 5000 pounds is a poor property to apply to things in general, because it puts cats and chairs in one category, and buses and elephants in another.[414]
Multi-Level or Hierarchical Categories
If you have many shirts in your closet (and you are a bit compulsive or a “neat freak”), instead of just separating your shirts from your pants using a single property (the part of body on which the clothes are worn) you might arrange the shirts by style, and then by sleeve length, and finally by color. When all of the resources in an organizing system are arranged using the same sequence of resource properties, this creates a logical hierarchy, a multi-level category system.
Put another way, each subdivision of a category takes place when we identify or choose a property that differentiates the members of the category in a way that is important or useful for some intent or purpose. Shirts differ from pants in the value of the “part of body” property, and all the shirt subcategories share this “top part” value of that property. However, shirts differ on other properties that determine the subcategory to which they belong. Even as we pay attention to these differentiating properties, it is important to remember the other properties, the ones that members of a category at any level in the hierarchy have in common with the members of the categories that contain it. These properties are often described as “inherited” or “inferred” from the broader category.[415] For example, just as every shirt shares the “worn on top part of body” property, every item of clothing shares the “can be worn on the body” property, and every resource in the “shirts” and “pants” category inherits that property.
However, even when the lowest level categories of our shirt organizing system have more than one member, we might choose not to use additional properties to subdivide it because the differences that remain among the members do not matter to us for the interactions the organizing system needs to support. Suppose we have two long-sleeve white dress shirts from different shirt makers, but whenever we need to wear one of them, we ignore this property. Instead, we just pick one or the other, treating the shirts as completely equivalent or substitutable. When the remaining differences between members of a category do not make a difference to the users of the category, we can say that the organizing system is pragmatically or practically complete even if it is not yet logically complete. That is to say, it is complete “for all intents and purposes.” Indeed, we might argue that it is desirable to stop subdividing a system of categories while there are some small differences remaining among the items in each category because this leaves some flexibility or logical space in which to organize new items. This point might remind you of the concept of overfitting, where models with many parameters can very accurately fit their training data, but as a result generalize less well to new data. (See the section called “Resource Description for Sensemaking and Science”.)
Classifying Hawaiian “Boardshorts”
The swimsuits worn by surfers, called “boardshorts,” have evolved from purely functional garments to symbols of extreme sports and the Hawaiian lifestyle. A 2012 exhibition at the Honolulu Museum of Art captured the diversity of boardshorts on three facets: their material, how they fastened around the surfer’s fly and waist, and their length.
Different Properties for Subsets of Resources
The contrasts between intrinsic and extrinsic properties, and between static and dynamic ones, are helpful in explaining this method of creating organizing categories. For example, you might organize all of your clothes using intrinsic static properties if you keep your shirts, socks, and sweaters in different drawers and arrange them by color; extrinsic static properties if you share your front hall closet with a roommate, so you each use only one side of that closet space; intrinsic dynamic properties if you arrange your clothes for ready access according to the season; and, extrinsic dynamic properties if you keep your most frequently used jacket and hat on a hook by the front door.[416]
If we relax the requirement that different subsets of resources use different organizing properties and allow any property to be used to describe any resource, the loose organizing principle we now have is often called tagging. Using any property of a resource to create a description is an uncontrolled and often unprincipled principle for creating categories, but it is increasingly popular for organizing photos, web sites, email messages in gmail, or other web-based resources. We discuss tagging in more detail in the section called “Tagging of Web-based Resources”.
Necessary and Sufficient Properties
A large set of resources does not always require many properties and categories to organize it. Some types of categories can be defined precisely with just a few essential properties. For example, a prime number is a positive integer that has no divisors other than 1 and itself, and this category definition perfectly distinguishes prime and not-prime numbers no matter how many numbers are being categorized. “Positive integer” and “divisible only by 1 and itself” are necessary or defining properties for the prime number category; every prime number must satisfy these properties. These properties are also sufficient to establish membership in the prime number category; any number that satisfies the necessary properties is a prime number. Categories defined by necessary and sufficient properties are also called monothetic. They are also sometimes called classical categories because they conform to Aristotle’s theory of how categories are used in logical deduction using syllogisms.[417] (See the sidebar, The Classical View of Categories.)
An important implication of necessary and sufficient category definition is that every member of the category is an equally good member or example of the category; every prime number is equally prime. Institutional category systems often employ necessary and sufficient properties for their conceptual simplicity and straightforward implementation in decision trees, database schemas,and programming language classes.
The Classical View of Categories
Consider the definition of an address as requiring a street, city, governmental region, and postal code. Anything that has all of these information components is therefore considered to be a valid address, and anything that lacks any of them will not be considered to be a valid address. If we refine the properties of an address to require the governmental region to be a state, and specifically one of the United States Postal Service’s list of official state and territory codes, we create a subcategory for US addresses that uses an enumerated category as part of its definition. Similarly, we could create a subcategory for Canadian addresses by exchanging the name “province” for state, and using an enumerated list of Canadian province and territory codes.
The Limits of Property-Based Categorization
Property-based categorization works tautologically well for categories like “prime number” where the category is defined by necessary and sufficient properties. Property-based categorization also works well when properties are conceptually distinct and the value of a property is easy to perceive and examine, as they are with man-made physical resources like shirts.
Historical experience with organizing systems that need to categorize information resources has shown that basing categories on easily perceived properties is often not effective. There might be indications “on the surface” that suggest the “joints” or boundaries between types of information resources, but these are often just presentation or packaging choices, That is to say, neither the size of a book nor the color of its cover are reliable cues for what it contains. Information resources have numerous descriptive properties like their title, author, and publisher that can be used more effectively to define categories, and these are certainly useful for some kinds of interactions, like finding all of the books written by a particular author or published by the same publisher. However, for practical purposes, the most useful property of an information resource is its aboutness, which may not be objectively perceivable and which is certainly hard to characterize.[418] Any collection of information resources in a library or document filing system is likely to be about many subjects and topics, and when an individual resource is categorized according to a limited number of its content properties, it is at the same time not being categorized using the others.
Classifying the Web: Yahoo! in 1996
Their goal was to manually assign every web page to a category.
(Screenshot by R. Glushko. Source: Internet Archive wayback machine.)
Considering every distinct word in a document stretches our notion of property to make it very different from the kinds of properties we have discussed so far, where properties were being explicitly used by people to make decisions about category membership and resource organization. It is just not possible for people to pay attention to more than a few properties at the same time even if they want to, because that is how human perceptual and cognitive machinery works. But computers have no such limitations, and algorithms for information retrieval and machine learning can use huge numbers of properties, as we will see later in this chapter and in Chapter 8, Classification: Assigning Resources to Categories and Chapter 10, Interactions with Resources.
Probabilistic Categories and “Family Resemblance”
The first is an effect of typicality or centrality that makes some members of the category better examples than others. Membership in probabilistic categories is not all or none, so even if they share many properties, an instance that has more of the characteristic properties will be judged as better or more typical.[419] Try to define “bird” and then ask yourself if all of the things you classify as birds are equally good examples of the category (look at the six birds in Family Resemblance and Typicality). This effect is also described as gradience in category membership and reflects the extent to which the most characteristic properties are shared.
A second consequence is that the sharing of some but not all properties creates what we call family resemblances among the category members; just as biological family members do not necessarily all share a single set of physical features but still are recognizable as members of the same family. This idea was first proposed by the 20th-century philosopher Ludwig Wittgenstein, who used “games” as an example of a category whose members resemble each other according to shifting property subsets.[420]
The third consequence, when categories do not have necessary features for membership, is that the boundaries of the category are not fixed; the category can be stretched and new members assigned as long as they resemble incumbent members. Personal video games and multiplayer online games like World of Warcraft did not exist in Wittgenstein’s time but we have no trouble recognizing them as games and neither would Wittgenstein, were he alive. Recall that in Chapter 1, Foundations for Organizing Systems we pointed out that the cultural category of “library” has been repeatedly extended by new properties, as when Flickr is described as a web-based photo-sharing library. Categories defined by family resemblance or multiple and shifting property sets are termed polythetic.
Ludwig Wittgenstein (1889-1951) was a philosopher who thought deeply about mathematics, the mind, and language. In 1999, his Philosophical Investigations was ranked as the most important book of 20th-century philosophy in a poll of philosophers.[421] In that book, Wittgenstein uses “game” to argue that many concepts have no defining properties, and that instead there is a “complicated network of similarities overlapping and criss-crossing: sometimes overall similarities, sometimes similarities of detail.” He contrasts board games, card games, ball games, games of skill, games of luck, games with competition, solitary games, and games for amusement. Wittgenstein notes that not all games are equally good examples of the category, and jokes about teaching children a gambling game with dice because he knows that this is not the kind of game that the parents were thinking of when they asked him to teach their children a game.[422]
We conclude that instead of using properties one at a time to assign category membership, we can use them in a composite or integrated way where together a co-occurring cluster of properties provides evidence that contributes to a similarity calculation. Something is categorized as an A and not a B if it is more similar to A’s best or most typical member rather than it is to B’s.[423]
Family Resemblance and Typicality
A penguin, a pigeon, a swan, a stork, a flamingo, and a frigate bird. (Clockwise from top-left.)
Similarity
Similarity is a measure of the resemblance between two things that share some characteristics but are not identical. It is a very flexible notion whose meaning depends on the domain within which we apply it. Some people consider that the concept of similarity is itself meaningless because there must always be some basis, some unstated set of properties, for determining whether two things are similar. If we could identify those properties and how they are used, there would not be any work for a similarity mechanism to do.[424]
To make similarity a useful mechanism for categorization we have to specify how the similarity measure is determined. There are four psychologically-motivated approaches that propose different functions for computing similarity: feature- or property-based, geometry-based, transformational, and alignment- or analogy-based. The big contrast here is between models that represent items as sets of properties or discrete conceptual features, and those that assume that properties vary on a continuous metric space.[425]
Feature-based Models of Similarity
those features that the first has that the second lacks, and
those features that the second has that the first lacks.
The similarity based on the shared features is reduced by the two sets of distinctive ones. The weights assigned to each set can be adjusted to explain judgments of category membership. Another commonly feature-based similarity measure is the Jaccard coefficient, the ratio of the common features to the total number of them. This simple calculation equals zero if there are no overlapping features and one if all features overlap. Jaccard’s measure is often used to calculate document similarity by treating each word as a feature.[426]
Geometric Models of Similarity
Geometric similarity functions are commonly used by search engines; if a query and document are each represented as a vector of search terms, relevance is determined by the distance between the vectors in the “term space.” The simplified diagram in the sidebar, Document Similarity, depicts four documents whose locations in the term space are determined by how many of each of three terms they contain. The document vectors are normalized to length 1, which makes it possible to use the cosine of the angle between any two documents as a measure of their similarity. Documents d1 and d2 are more similar to each other than documents d3 and d4, because angle between the former pair (Θ) is smaller than the angle between the latter (Φ). We will discuss how this works in greater detail in Chapter 10, Interactions with Resources.
The diagram in the sidebar, Geometric Distance Functions shows two different ways of calculating the distance between points 1 and 2 using the differences A and B. The Euclidean distance function takes the square root of the sum of the squared differences on each dimension; in two dimensions, this is the familiar Pythagorean Theorem to calculate the length of the hypotenuse of a right triangle, where the exponent applied to the differences is 2. In contrast, the City Block distance function, so-named because it is the natural way to measure distances in cities with “gridlike” street plans, simply adds up the differences on each dimension, which is equivalent to an exponent of 1.
Transformational Models of Similarity
Transformational models assume that the similarity between two things is inversely proportional to the complexity of the transformation required to turn one into the other. The simplest transformational model of similarity counts the number of properties that would need to change their values. More generally, one way to perform the name matching task of determining when two different strings denote the same person, object, or other named entity is to calculate the “edit distance” between them; the number of changes required to transform one into the other.
The simplest calculation just counts the number of insertion, deletion, and substitution operations and is called the Levenshtein distance; for example, the distance between “bob” and “book” is two: insert “o” and change the second “b” to “k”. Two strings with a short edit distance might be variant spellings or misspellings of the same name, and transformational models that are sensitive to common typing errors like transposed or duplicated letters are very effective at spelling correction. Transformational models of similarity are also commonly used to detect plagiarism and duplicate web pages.[427]
Alignment or Analogy Models of Similarity
This kind of analogical comparison is especially important in problem solving. You might think that experts are good at solving problems in their domain of expertise because they have organized their knowledge and experience in ways that enable efficient search for and evaluation of possible solutions. For example, it is well known that chess masters search their memories of previous winning positions and the associated moves to decide what to play. However, top chess players also organize their knowledge and select moves on the basis of abstract similarities that cannot be explained in terms of specific positions of chess pieces. This idea that experts represent and solve problems at deeper levels than novices do by using more abstract principles or domain structure has been replicated in many areas. Novices tend to focus more on surface properties and rely more on literal similarity.[428]
Goal-Derived Categories
Another psychological principle for creating categories is to organize resources that go together in order to satisfy a goal. Consider the category “Things to take from a burning house,” an example that cognitive scientist Lawrence Barsalou termed an ad hoc or goal-derived category.[429]
Theory-Based Categories
A final psychological principle for creating categories is organizing things in ways that fit a theory or story that makes a particular categorization sensible. A theory-based category can win out even if probabilistic categorization, on the basis of family resemblance or similarity with respect to visible properties, would lead to a different category assignment. For example, a theory of phase change explains why liquid water, ice, and steam are all the same chemical compound even though they share few visible properties.
Theory-based categories based on origin or causation are especially important with highly inventive and computational resources because unlike natural kinds of physical resources, little or none of what they can do or how they behave is visible on the surface (see the section called “Affordance and Capability”). Consider all of the different appearances and form factors of the resources that we categorize as “computers” —their essence is that they all compute, an invisible or theory-like principle that does not depend on their visible properties.[430]
Category Design Issues and Implications
Category Abstraction and Granularity
We can identify any resource as a unique instance or as a member of a class of resources. The size of this class—the number of resources that are treated as equivalent—is determined by the properties or characteristics we consider when we examine the resources in some domain. The way we think of a resource domain depends on context and intent, so the same resource can be thought of abstractly in some situations and very concretely in others. As we discussed in Chapter 5, Resource Description and Metadata, this influences the nature and extent of resource description, and as we have seen in this chapter, it then influences the nature and extent of categories we can create.
This tendency to use specific categories instead of broader ones is a general principle that reflects how people organize their experience when they see similar, but not identical, examples or events. This “size principle” for concept learning, as cognitive scientist Josh Tenenbaum describes it, is a preference for the most specific rules or descriptions that fit the observations. For example, if you visit a zoo and see many different species of animals, your conception of what you saw is different than if you visited a kennel that only contained dogs. You might say “I saw animals at the zoo,” but would be more likely to say “I saw dogs at the kennel” because using the broad “animal” category to describe your kennel visit conveys less of what you learned from your observations there.[431]
In the section called “Single Properties” we described an organizing system for the shirts in our closet, so let us talk about socks instead. When it comes to socks, most people think that the basic unit is a pair because they always wear two socks at a time. If you are going to need to find socks in pairs, it seems sensible to organize them into pairs when you are putting them away. Some people might further separate their dress socks from athletic ones, and then sort these socks by color or material, creating a hierarchy of sock categories analogous to the shirt categories in our previous example.
For example, how should a business system deal with a customer’s address? Printed on an envelope, “an address” typically appears as a comprehensive, multi-line text object. Inside an information system, however, an address is best stored as a set of distinctly identifiable information components. This fine-grained organization makes it easier to sort customers by city or postal codes, for sales and marketing purposes. Incompatibilities in the abstraction and granularity of these information components, and the ways in which they are presented and reused in documents, will cause interoperability problems when businesses need to share information.[432]
The Universal Business Language(UBL) (mentioned briefly in the section called “Institutional Semantics”) is a library of information components designed to enable the creation of business document models that span a range of category abstraction. UBL comes equipped with XML schemas that define document categories like orders, invoices, payments, and receipts that many people are familiar with from their personal experiences of shopping and paying bills. However, UBL can also be used to design very specific or subordinate level transactional document types like “purchase order for industrial chemicals when buyer and seller are in different countries,” or document types at the other end of the abstraction hierarchy like “fill-in-the-blank” legal forms for any kind of contract.
Bowker and Star point out that there is often a pragmatic tradeoff between precision and validity when defining categories and assigning resources to them, particularly in scientific and other highly technical domains. More granular categories make more precise classification possible in principle, but highly specialized domains might contain instances that are so complex or hard to understand that it is difficult to decide where to organize them.[433]
As an example of this real-world messiness that resists precise classification, Bowker and Star turn to medicine and the World Health Organization’s International Classification of Diseases (ICD), a system of categories for cause-of-death reporting. The ICD requires that every death be assigned to one and only one category out of thousands of possible choices, which facilitates important uses such as statistical reporting for public health research.
In practice, however, doctors often lack conclusive evidence about the cause of a particular death, or they identify a number of contributing factors, none of which could properly be described as the sole cause. In these situations, less precise categories would better accommodate the ambiguity, and the aggregate data about causes of death would have greater validity. But doctors have to use the ICD’s precise categories when they sign a death certificate, which means they sometimes record the wrong cause of death just to get their work done.
It might seem counterintuitive, but when a system of human-generated categories is too complex for people to interpret and apply reliably, computational classifiers that compute statistical similarity between new and already classified items can outperform people.[434]
Basic or Natural Categories
Psychological research suggests that some levels of abstraction in a system of categories are more basic or natural than others. Anthropologists have also observed that folk taxonomies invariably classify natural phenomena into a five- or six-level hierarchy, with one of the levels being the psychologically basic or “real” name (such as “cat” or “dog”), as opposed to more abstract names (e.g. “mammal”) that are used less in everyday life. An implication for organizing system design is that basic level categories are highly efficient in terms of the cognitive effort they take to create and use. A corollary is that classifications with many levels at different abstraction levels may be difficult for users to navigate effectively.[435]
The Recall / Precision Tradeoff
If we translate this example into the jargon of information retrieval, we say that more fine-grained organization reduces recall, the number of resources you find or retrieve in response to a query, but increases the precision of the recalled set, the proportion of recalled items that are relevant. Broader or coarse-grained categories increase recall, but lower precision. We are all too familiar with this hard bargain when we use a web search engine; a quick one-word query results in many pages of mostly irrelevant sites, whereas a carefully crafted multi-word query pinpoints sites with the information we seek. We will discuss recall, precision, and evaluation of information retrieval more extensively in Chapter 10, Interactions with Resources.
Category Audience and Purpose
What we mean by “English” and “Chinese” as categories can change depending on the audience we are addressing and what our purpose is, however.[436] A language learning school’s representation of “English” might depend on practical concerns such as how the school’s students are likely to use the language they learn, or which teachers are available. For the purposes of a school teaching global languages, and one of the standard varieties of English (i.e., those associated with political power), or an amalgamation of several standard varieties, might be thought of as a single instance (“English”) of the category “Languages.”
Similarly, the category structure in which “Chinese” is situated can vary with context. While some schools might not conceptualize “Chinese” as a category encompassing multiple linguistic varieties, but rather as a single instance within the “Languages” category, another school might teach its students Mandarin, Wu, and Cantonese as dialects within the language category “Chinese,” that are unified by a single standard writing system. In addition, a linguist might consider Mandarin, Wu, and Cantonese to be mutually unintelligible, making them separate languages within the broader category “Chinese” for the purpose of creating a principled language classification system.
Implementing Categories
This separation between category creation and category implementation prompts us to ask how a system of categories can be implemented. We will not discuss the implementation of categories in the literal sense of building physical or software systems that organize resources. Instead, we will take a higher-level perspective that analyzes the implementation problem to be solved for the different types of categories discussed in the section called “Principles for Creating Categories”, and then explain the logic followed to assign resources correctly to them.
Implementing Enumerated Categories
Categories defined by enumeration are easy to implement. The members or legal values in a set define the category, and testing an item for membership means looking in the set for it. Enumerated category definitions are familiar in drop-down menus and form-filling. You scroll through a list of all the countries in the world to search for the one you want in a shipping address, and whatever you select will be a valid country name, because the list is fixed until a new country is born. Enumerated categories can also be implemented with associative arrays (also known as hash tables or dictionaries). With these data structures, a test for set membership is even more efficient than searching, because it takes the same time for sets of any size (see the section called “Kinds of Structures”).
Implementing Categories Defined by Properties
The most conceptually simple and straightforward implementation of categories defined by properties adopts the classical view of categories based on necessary and sufficient features. Because such categories are prescriptive with explicit and clear boundaries, classifying items into the categories is objective and deterministic, and supports a well-defined notion of validation to determine unambiguously whether some instance is a member of the category. Items are classified by testing them to determine if they have the required properties and property values. Tests can be expressed as rules:
If instance X has property P, then X is in category Y.
This doesn’t mean the property test is always easy; validation might require special equipment or calculations, and tests for the property might differ in their cost or efficiency. But given the test results, the answer is unambiguous. The item is either a member of the category or it isn’t.[437]
A system of hierarchical categories is defined by a sequence of property tests in a particular order. The most natural way to implement multi-level category systems is with decision trees. A simple decision tree is an algorithm for determining a decision by making a sequence of logical or property tests. Suppose a bank used a sequential rule-based approach to decide whether to give someone a mortgage loan.
Otherwise, deny the loan application.
This simple decision tree is depicted in Figure 7.1, “Rule-based Decision Tree”. The rules used by the bank to classify loan applications as “Approved” or “Denied” have a clear representation in the tree. The easy interpretation of decision trees makes them a common formalism for implementing classification models.
Figure 7.1. Rule-based Decision Tree
Artificial languages are a more ambitious way to enable precise specification of property-based categories. An artificial language expresses ideas concisely by introducing new terms or symbols that represent complex ideas along with syntactic mechanisms for combining and operating on them. Mathematical notation, programming languages, schema languages that define valid document instances (see the section called “Specifying Vocabularies and Schemas”), and regular expressions that define search and selection patterns (see the section called “Controlling Values”) are familiar examples of artificial languages. It is certainly easier to explain and understand the Pythagorean Theorem when it is efficiently expressed as “H2 = A2 + B2” than with a more verbose natural language expression: “In all triangles with an angle such that the sides forming the angle are perpendicular, the product of the length of the side opposite the angle such that the sides forming the angle are perpendicular with itself is equal to the sum of the products of the lengths of the other two sides, each with itself.”[438]
Artificial languages for defining categories have a long history in philosophy and science. (See the sidebar, Artificial Languages for Description and Classification). However, the vast majority of institutional category systems are still specified with natural language, despite its ambiguities because people usually understand the languages they learned naturally better than artificial ones. Sometimes this is even intentional to allow institutional categories embodied in laws to evolve in the courts and to accommodate technological advances.[439]
Artificial Languages for Description and Classification
John Wilkins was one of the founders of the British Royal Society. In 1668 he published An Essay towards a Real Character and a Philosophical Language in which he proposed an artificial language for describing a universal taxonomy of knowledge that used symbol composition to specify a location in the category hierarchy. There were forty top level genus categories, which were further subdivided into differences within the genus, which were then subdivided into species. Each genus was a monosyllable of two letters; each difference added a consonant, and each species added a vowel.
This artificial language conveys the meaning of categories directly from the composition of the category name. For instance, zi indicates the genus of beasts, zit would be “rapacious beasts of the dog kind” whereas zid would be “cloven-footed beast.” Adding for the fourth character an a for species, indicating the second species in the difference, would give zita for dog and zida for sheep.
In its remote pages it is written that the animals are divided into: (a) belonging to the emperor, (b) embalmed, (c) tame, (d) sucking pigs, (e) sirens, (f) fabulous, (g) stray dogs, (h) included in the present classification, (i) frenzied, (j) innumerable, (k) drawn with a very fine camel hair brush, (l) et cetera, (m) having just broken the water pitcher, (n) that from a long way off look like flies.
Borges compliments Wilkins for inventing names that might signify in themselves some meaning to those who know the system, but notes that “it is clear that there is no classification of the Universe not being arbitrary and full of conjectures.”[440]
Data schemas that specify data entities, elements, identifiers, attributes, and relationships in databases and XML document types on the transactional end of the Document Type Spectrum (the section called “Resource Domain”) are implementations of the categories needed for the design, development and maintenance of information organization systems. Data schemas tend to rigidly define categories of resources. [441]
In object-oriented programming languages, classes are schemas that serve as templates for the creation of objects. A class in a programming language is analogous to a database schema that specifies the structure of its member instances, in that the class definition specifies how instances of the class are constructed in terms of data types and possible values. Programming classes may also specify whether data in a member object can be accessed, and if so, how.[442]
Unlike transactional document types, which can be prescriptively defined as classical categories because they are often produced and consumed by automated processes, narrative document types are usually descriptive in character. We do not classify something as a novel because it has some specific set of properties and content types. Instead, we have a notion of typical novels and their characteristic properties, and some things that are considered novels are far from typical in their structure and content.[443]
Nevertheless, categories like narrative document types can sometimes be implemented using document schemas that impose only a few constraints on structure and content. A schema for a purchase order is highly prescriptive; it uses regular expressions, strongly data typed content, and enumerated code lists to validate the value of required elements that must occur in a particular order. In contrast, a schema for a narrative document type would have much optionality, be flexible about order, and expect only text in its sections, paragraphs and headings. Even very lax document schemas can be useful in making content management, reuse, and formatting more efficient.
Implementing Categories Defined by Probability and Similarity
Many categories cannot be defined in terms of required properties, and instead must be defined probabilistically, where category membership is determined by properties that resources are likely to share. Consider the category “friend.” You probably consider many people to be your friends, but you have longtime friends, school friends, workplace friends, friends you see only at the gym, and friends of your parents. Each of these types of friends represents a different cluster of common properties. If someone is described to you as a potential friend or date, how accurately can you predict that the person will become a friend? (See the sidebar, Finding Friends and Dates: Lessons for Learning Categories)
Finding Friends and Dates: Lessons for Learning Categories
More generally, if you are organizing a domain where the resources are active, change their state, or are measurements of properties that vary and co-occur probabilistically, the sample you choose strongly affects the accuracy of models for classification or prediction. In The Signal and the Noise, statistician Nate Silver explains how many notable predictions failed because of poor sampling techniques. One common sampling mistake is to use too short a historical window to assemble the training dataset; this is often a corollary of a second mistake, an over reliance on recent data because it is more available. For example, the collapse of housing prices and the resulting financial crisis of 2008 can be explained in part because the models that lenders used to predict mortgage foreclosures were based on data from 1980-2005, when house prices tended to grow higher. As a result, when mortgage foreclosures increased rapidly, the results were “out of sample” and were initially misinterpreted, delaying responses to the crisis.
Samples from dynamic and probabilistic domains result in models that capture this variability. Unfortunately, because many forecasters want to seem authoritative, and many people do not understand probability, classifications or predictions that are inherently imprecise are often presented with certainty and exactness even though they are probabilistic with a range of outcomes. Silver tells the story of a disastrous 1997 flood caused when the Red River crested at 54 feet when the levees protecting the town of Grand Forks were at 51 feet. The weather service had predicted a crest between 40 and 58 feet, but emphasized the midpoint of the range, which was 49 feet. Unfortunately, most people interpreted this probabilistic prediction as if it were a binary classification, “flood” versus “no flood,” ignored the range of the forecast, and failed to prepare for a flood that had about a 35% chance of occurring.[444]
Probabilistic Decision Trees
In the section called “Implementing Categories Defined by Properties”, we showed how a rule-based decision tree could be used to implement a strict property-based classification in which a bank uses tests for the properties of “annual income” and “monthly loan payment” to classify applicants as approved or denied. We can adapt that example to illustrate probabilistic decision trees, which are better suited for implementing categories in which category membership is probabilistic rather than absolute.
A data-driven bank relies upon historical data about loan repayment and defaults to train algorithms that create decision trees by repeatedly splitting the applicants into subsets that are most different in their predictions. Subsets of applicants with a high probability of repayment would be approved, and those with a high probability of default would be denied a loan. One method for selecting the property test for making each split is calculating the “information gain” (see the sidebar Using “Information Theory” to Quantify Organization). This measure captures the degree to which each subset contains a “pure” group in which every applicant is classified the same, as likely repayers or likely defaulters.
For example, consider the chart in Figure 7.2, “Historical Data: Loan Repayment Based on Interest Rate” which is a simplified representation of the bank’s historical data on loan defaults based on the initial interest rate. The chart represents loans that were repaid with “o” and those that defaulted with “x.” Is there an interest rate that divides them into “pure” sets, one that contains only “o” loans and the other that contains only “x” loans?
Figure 7.2. Historical Data: Loan Repayment Based on Interest Rate
You can see that no interest rate divides these into pure sets. So the best that can be done is to find the interest rate that divides them so that the proportions of defaulters are most different on each side of the line.[445]
Figure 7.3. Probabilistic Decision Tree
This calculation is carried out for each of the attributes in the historical data set to identify the one that best divides the applicants into the repaid and defaulted categories. The attributes and the value that defines the decision rule can then be ordered to create a decision tree similar to the rule-based one we saw in the section called “Implementing Categories Defined by Properties”. In our hypothetical case, it turns out that the best order in which to test the properties is Income, Monthly Payment, and Interest Rate, as shown in Figure 7.3, “Probabilistic Decision Tree”. The end result is still a set of rules, but behind each decision in the tree are probabilities based on historical data that can more accurately predict whether an applicant will repay or default. Thus, instead of the arbitrary cutoffs at $100,000 in income and 25% for monthly payment, the bank can offer loans to people with lower incomes and remain profitable doing so, because it knows from historical data that $82,000 and 27% are the optimal decision points. Using the interest rate in their decision process is an additional test to ensure that people can afford to make loan payments even if interest rates go up.[446]
Naïve Bayes Classifiers
Contradictory evidence should reduce confidence in your belief.
P (A | B) = P (B|A) P(A) / P(B)
The left hand side of the equation, P (A | B), is what we want to estimate but can’t measure directly: the probability that A is the correct classification for an item or observation that has property B. This is called the conditional or posterior probability because it is estimated after seeing the evidence of property B.
Using Bayes’ Theorem to Calculate Conditional Probability
P (nonfiction | ebook) = P (ebook |nonfiction) x P (nonfiction) / P (ebook).
We know: P (ebook | nonfiction) = .5 and P (nonfiction) = .4
Therefore: P (nonfiction | ebook) = (.5 x .4) / .8 = .25
Select Properties. We start with a set of properties, some from the message metadata like the sender’s email address or the number of recipients, and some from the message content. Every word that appears in messages can be treated as a separate property[447]
Improve. The classifier can improve its accuracy if the user gives it feedback by reclassifying SPAM messages as HAM ones or vice versa. The most efficient learning occurs when an algorithm uses “active learning” techniques to choose its own training data by soliciting user feedback only where it is uncertain about how to classify a message. For example, the algorithm might be confident that a message with “Cheap drugs” in the subject line is SPAM, but if the message comes from a longtime correspondent, the algorithm might ask the user to confirm that the classification.[448]
Categories Created by Clustering
Clustering techniques share the goal of creating meaningful categories from a collection of items whose properties are hard to directly perceive and evaluate, which implies that category membership cannot easily be reduced to specific property tests and instead must be based on similarity. For example, with large sets of documents or behavioral data, clustering techniques can find categories of documents with the same topics, genre, or sentiment, or categories of people with similar habits and preferences.
There are many different distance-based clustering techniques, but they share three basic methods.
The first shared method is that clustering techniques start with an initially uncategorized set of items or documents that are represented in ways that enable measures of inter-item similarity can be calculated. This representation is most often a vector of property values or the probabilities of different properties, so that items can be represented in a multidimensional space and similarity calculated using a distance function like those described in the section called “Geometric Models of Similarity”.[449]
The third shared method is refining the system of categories by iterative similarity recalculation each time an item is added to a category. Approaches that start with every item in its own category create a hierarchical system of categories by merging the two most similar categories, recomputing the similarity between the new category and the remaining ones, and repeating this process until all the categories are merged into a single category at the root of a category tree. Techniques that start with a fixed number of categories do not create new ones but instead repeatedly recalculate the “centroid” of the category by adjusting its property representation to the average of all its members after a new member is added.[450]
The centroid: a prototypical or average item calculated on the properties of all the category members. However, the centroid might not correspond to any actual member (see the sidebar Median versus Average), and this can make it hard to interpret the classification.
Neural networks
Among the best performing classifiers for categorizing by similarity and probabilistic membership are those implemented using neural networks, and especially those employing deep learning techniques. Deep learning algorithms can learn categories from labeled training data or by using autoencoding, an unsupervised learning technique that trains a neural network to reconstruct its input data. However, instead of using the properties that are defined in the data, deep learning algorithms devise a very large number of features in hidden hierarchical layers, which makes them uninterpretable by people. The key idea that made deep learning possible is the use of “backpropagation” to adjust the weights on features by working backwards from the output (the object classification produced by the network) all the way back to the input. The use of deep learning to classify images was mentioned in the section called “Describing Images”.[451]
Implementing Goal-Based Categories
Implementing Theory-Based Categories
Analogy: The hydrogen atom is like our solar system
Literal Similarity: The X12 star system in the Andromeda galaxy is like our solar system
Structure Mapping theory was implemented in the Structure-Mapping Engine (SME), which both formalized the theory and offered a computationally-tractable algorithm for carrying out the process of mapping structures and drawing inferences.[452]
Key Points in Chapter Seven
7.6.2. What determines the size of the equivalence class?
7.6.3. Why do we contrast cultural, individual, and institutional categorization?
7.6.4. What distinguishes individual categories?
7.6.5. What distinguishes institutional categories?
7.6.6. What is the relation between categories and classification?
7.6.7. When is it necessary to create categories by computational methods rather than by people?
7.6.8. What is the difference between supervised and unsupervised learning?
7.6.9. Why does it matter if every resource in a collection has a sortable identifier?
7.6.10. What is the concern when only a single property is used to assign category membership?
7.6.11. What is a hierarchical category system?
7.6.13. What is aboutness?
7.6.15. What is family resemblance?
7.6.16. What is similarity?
7.6.18. What are so-called “classical categories”?
7.6.19. How does the breadth of a category affect the recall/precision tradeoff?
7.6.20. What is a decision tree?
7.6.21. What is the practical benefit of defining categories according to necessary and sufficient features?
7.6.23. How do Naïve Bayes classifiers learn?
7.6.24. How do clustering techniques create categories?
What are categories? |
|
|
Categories are equivalence classes: sets or groups of things or abstract entities that we treat the same. |
What determines the size of the equivalence class? |
|
|
The size of the equivalence class is determined by the properties or characteristics we consider. |
Why do we contrast cultural, individual, and institutional categorization? |
|
|
Cultural, individual, and institutional categorization share some core ideas but they emphasize different processes and purposes for creating categories. |
What distinguishes individual categories? |
|
|
Individual categories are created by intentional activity that usually takes place in response to a specific situation. |
What distinguishes institutional categories? |
|
|
Institutional categories are most often created in abstract and information-intensive domains where unambiguous and precise categories are needed. |
What is the relation between categories and classification? |
|
|
The rigorous definition of institutional categories enables classification, the systematic assignment of resources to categories in an organizing system. |
When is it necessary to create categories by computational methods rather than by people? |
|
|
Computational categories are created by computer programs when the number of resources, or when the number of descriptions or observations associated with each resource, are so large that people cannot think about them effectively. |
What is the difference between supervised and unsupervised learning? |
|
|
In supervised learning, a machine learning program is trained by giving it sample items or documents that are labeled by category. In unsupervised learning, the program gets the samples but has to come up with the categories on its own. |
Why does it matter if every resource in a collection has a sortable identifier? |
|
|
Any collection of resources with sortable identifiers (alphabetic or numeric) as an associated property can benefit from using sorting order as an organizing principle. |
What is the concern when only a single property is used to assign category membership? |
|
|
If only a single property is used to distinguish among some set of resources and to create the categories in an organizing system, the choice of property is critical because different properties often lead to different categories. |
What is a hierarchical category system? |
|
|
A sequence of organizing decisions based on a fixed ordering of resource properties creates a hierarchy, a multi-level category system. (See the section called “Multi-Level or Hierarchical Categories”) |
What can one say about any member of a classical category in terms of how it represents the category? |
|
|
An important implication of necessary and sufficient category definition is that every member of the category is an equally good member or example of the category. (See the section called “Necessary and Sufficient Properties”) |
What is aboutness? |
|
|
For most purposes, the most useful property of information resources for categorizing them is their aboutness, which is not directly perceivable and which is hard to characterize. (See the section called “The Limits of Property-Based Categorization”) |
When it is necessary to adopt a probabilistic or statistical view of properties in defining categories? |
|
|
In domains where properties lack one or more of the characteristics of separability, perceptibility, and necessity, a probabilistic or statistical view of properties is needed to define categories. (See the section called “Probabilistic Categories and “Family Resemblance””) |
What is family resemblance? |
|
|
Sharing some but not all properties is akin to family resemblances among the category members. (See the section called “Probabilistic Categories and “Family Resemblance””) |
What is similarity? |
|
|
Similarity is a measure of the resemblance between two things that share some characteristics but are not identical. |
What are the four psychologically-motivated approaches that propose different functions for computing similarity? |
|
|
Feature- or property-based, geometry-based, transformational, and alignment- or analogy-based approaches are psychologically-motivated approaches that propose different functions for computing similarity. |
What are so-called “classical categories”? |
|
|
Classical categories can be defined precisely with just a few necessary and sufficient properties. |
How does the breadth of a category affect the recall/precision tradeoff? |
|
|
Broader or coarse-grained categories increase recall, but lower precision. |
What is a decision tree? |
|
|
(See the section called “Implementing Categories Defined by Properties”) |
What is the practical benefit of defining categories according to necessary and sufficient features? |
|
|
The most conceptually simple and straightforward implementation of categories in technologies for organizing systems adopts the classical view of categories based on necessary and sufficient features. (See the section called “Implementing Categories Defined by Properties”) |
How do artificial languages like mathematical notation and programming languages enable precise specification of categories? |
|
|
An artificial language expresses ideas concisely by introducing new terms or symbols that represent complex ideas along with syntactic mechanisms for combining and operating on them. (See the section called “Implementing Categories Defined by Properties”) |
How do Naïve Bayes classifiers learn? |
|
|
Naïve Bayes classifiers learn by revising the conditional probability of each property for making the correct classification after seeing the base rates of the class and property in the training data and how likely it is that a member of the class has the property. |
How do clustering techniques create categories? |
|
|
Because clustering techniques are unsupervised, they create categories based on calculations of similarity between resources, maximizing the similarity of resources within a category and maximizing the differences between them. |
[386] Cataloging and programming are important activities that need to be done well, and prescriptive advice is often essential. However, we believe that understanding how people create psychological and linguistic categories can help us appreciate that cataloging and information systems design are messier and more intellectually challenging activities than we might otherwise think.
[387] Cognitive science mostly focuses on the automatic and unconscious mechanisms for creating and using categories. This disciplinary perspective emphasizes the activation of category knowledge for the purpose of making inferences and “going beyond the information given,” to use Bruner’s classic phrase [(Bruner 1957)]. In contrast, the discipline of organizing focuses on the explicit and self-aware mechanisms for creating and using categories because by definition, organizing systems serve intentional and often highly explicit purposes. Organizing systems facilitate inferences about the resources they contain, but the more constrained purposes for which resources are described and arranged makes inference a secondary goal.
[388] However, even the way this debate has been framed is a bit controversial. Bulmer’s chicken, the “categories are in the world” position, has been described as empirical, environment-driven, bottom-up, or objectivist, and these are not synonymous. Likewise, the “egghead” position that “categories are in the mind” has been called rational, constructive, top-down, experiential, and embodied—and they are also not synonyms. See [(Bulmer 1970)]. See also [(Lakoff 1990)], [(Malt 1995)].
[389] Is there a “universal grammar” or a “language faculty” that imposes strong constraints on human language and cognition? [(Chomsky 1965)] and [(Jackendoff 1996)] think so. Such proposals imply cognitive representations in which categories are explicit structures in memory with associated instances and properties. In contrast, generalized learning theories model category formation as the adjustment of the patterns and weighting of connections in neural processing networks that are not specialized for language in any way. Computational simulations of semantic networks can reproduce the experimental and behavioral results about language acquisition and semantic judgments that have been used as evidence for explicit category representations without needing anything like them. [(Rogers and McClelland 2008)] thoroughly review the explicit category models and then show how relatively simple learning models can do without them.
[390] The debates about human category formation also extend to issues of how children learn categories and categorization methods. Most psychologists argue that category learning starts with general learning mechanisms that are very perceptually based, but they do not agree whether to characterize these changes as “stages” or as phases in a more complex dynamical system. Over time more specific learning techniques evolve that focus on correlations among perceptual properties (things with wings tend to have feathers), correlations among properties and roles (things with eyes tend to eat), and ultimately correlations among roles (things that eat tend to sleep). See [(Smith and Thelen 2003)].
[391] These three contexts were proposed by [(Glushko, Maglio, Matlock, and Barsalou 2008)], who pointed out that cognitive science has focused on cultural categorization and largely ignored individual and institutional contexts. They argue that taking a broader view of categorization highlights dimensions on which it varies that are not apparent when only cultural categories are considered. For example, institutional categories are usually designed and maintained using prescriptive methods that have no analogues with cultural categories. There is a difference between institutional categories created for people, and categories created in institutions by computers in the predictive analytics, data mining sense.
[392] This quote comes from Plato’s Phaedrus dialogue, written around 370 BCE. Contemporary philosophers and cognitive scientists commonly invoke it in discussions about whether “natural kinds” exist. . For example, see [(Campbell, O’Rourke, and Slater 2011)], and [(Hutchins 2010)], [(Atran 1987)], and others have argued that the existence of perceptual discontinuities is not sufficient to account for category formation. Instead, people assume that members of a biological category must have an essence of co-occurring properties and these guide people to focus on the salient differences, thereby creating categories. Property clusters enable inferences about causality, which then builds a framework on which additional categories can be created and refined. For example, if “having wings” and “flying” are co-occurring properties that suggest a “bird” category, wings are then inferred as the causal basis of flying, and wings become more salient.
[393] Pronouns, adjectives, verbs, adverbs, prepositions, conjunctions, particles, and numerals and other “parts of speech” are also grammatical categories, but nouns carry most of the semantic weight.
[394] In contrast, the set of possible interactions with even a simple object like a banana is very large. We can pick, peel, slice, smash, eat, or throw a banana, so instead of capturing this complexity in the meaning of banana it gets parceled into the verbs that can act on the banana noun. Doing so requires languages to use verbs to capture a broader and more abstract type of meaning that is determined by the nouns with which they are combined. Familiar verbs like “set,” “put,” and “get” have dozens of different senses as a result because they go with so many different nouns. We set fires and we set tables, but fires and tables have little in common. The intangible character of verbs and the complexity of multiple meanings make it easier to focus instead on their associated nouns, which are often physical resources, and create organizing systems that emphasize the latter rather than the former. We create organizing systems that focus on verbs when we are categorizing actions, behaviors, or services where the resources that are involved are less visible or less directly involved in the supported interactions.
[395] Many languages have a system of grammatical gender in which all nouns must be identified as masculine or feminine using definite articles (el and la in Spanish, le and la in French, and so on) and corresponding pronouns. Languages also contrast in how they describe time, spatial relationships, and in which things are treated as countable objects (one ox, two oxen) as opposed to substances or mass nouns that do not have distinct singular and plural forms (like water or dirt). [(Deutscher 2011)] carefully reviews and discredits the strong Whorfian view and makes the case for a more nuanced perspective on linguistic relativity. He also reviews much of Lera Boroditsky’s important work in this area. George Lakoff’s book with the title Women, Fire, and Dangerous Things [(Lakoff 1990)] provocatively points out differences in gender rules among languages; in an aboriginal language called Dyirbal many dangerous things, including fire have feminine gender, meanwhile “fire” is masculine in Spanish (el feugo) and French (le feu).
[396] This analysis comes from [(Haviland 1998)]. More recently, Lera Boroditsky has done many interesting studies and experiments about linguistic relativity. See [(Boroditksy 2003)] for an academic summary and [(Boroditsky 2010], [2011)] for more popular treatments.
[398] This was ultimately reflected in complex mythological systems, such as Greek mythology, where genealogical relationships between gods represented category relationships among the phenomena with which they were associated. As human knowledge grew and the taxonomies became more comprehensive and complex, Durkheim and Mauss argued, they lay the groundwork for scientific classifications and shed their mythological roots. [(Durkheim 1963)].
[400] The personal archives of people who turn out to be famous or important are the exception that proves this rule. In that case, the individual’s organizing system and its categories are preserved along with their contents.
[401] The typical syntactic constraint that tags are delimited by white space encourages the creation of new categories by combining existing category names using concatenation and camel case conventions; photos that could be categorized as “Berkeley” and “Student” are sometimes tagged as “BerkeleyStudent.” Similar generative processes for creating individual category names are used with Twitter “hashtags” where tweets about events are often categorized with an ad hoc tag that combines an event name and a year identifier like “#NBAFinals16.”
[402] Consider how the cultural category of “killing a person” is refined by the legal system to distinguish manslaughter and different degrees of murder based on the amount of intentionality and planning involved (e.g., first and second degree murder) and the roles of people involved with the killing (accessory). In general, the purpose of laws is to replace coarse judgments of categorization based on overall similarity of facts with rule-based categorization based on specific dimensions or properties.
[403] The word was invented in 1812 in a newspaper article critical of Massachusetts governor Elbridge Gerry, who oversaw the creation of biased electoral districts. One such district was so contorted in shape, it was said to look like a salamander, and thus was called a Gerrymander. The practice remains widespread, but nowadays sophisticated computer programs can select voters on any number of characteristics and create boundaries that either “pack” them into a single district to concentrate their voting power or “crack” them into multiple districts to dilute it.
[404] The particularities or idiosyncrasies of individual categorization systems sometimes capture user expertise and knowledge that is not represented in the institutional categories that replace them. Many of the readers of this book are information professionals whose technological competence is central to their work and which helps them to be creative. But for a great many other people, information technology has enabled the routinization of work in offices, assembly lines, and in other jobs where new institutionalized job categories have “downskilled” or “deskilled” the nature of work, destroying competence and engendering a great deal of resistance from the affected workers.
[405] Similar technical concerns arise in within-company and multi-company standardization efforts, but the competitive and potentially anti-competitive character of the latter imposes greater complexity by introducing considerations of business strategy and politics. Credible standards-making in multi-company contexts depends on an explicit and transparent process for gathering and prioritizing requirements, negotiating specifications that satisfy them, and ensuring conformant implementations—without at any point giving any participating firm an advantage. See the OASIS Technical Committee Process for an example (https://www.oasis-open.org/policies-guidelines/tc-process) and [(Rosenthal et al. 2004)] for an analysis of best practices.
[406] Unfortunately, in this transition from science to popular culture, many of these so-called periodic tables are just ad hoc collections that ignore the essential idea that the rows and columns capture explanatory principles about resource properties that vary in a periodic manner. A notable exception is Andrew Plotkin’s Periodic Table of Dessert. See [(Suehle 2012)] and Plotkin’s table at (Periodic Table of Dessert).
[407] The Corporate Average Fuel Economy(CAFE) standards have been developed by the United States National Highway Traffic Safety Administration (http://www.nhtsa.gov/fuel-economy) since 1975. For a careful and critical assessment of CAFE, including the politics of categorization for vehicles like the PT Cruiser, see the [2002 report] from the Committee on the Effectiveness and Impact of Corporate Average Fuel Economy (CAFE) Standards, National Research Council.
[408] Legal disputes often reflect different interpretations of category membership and whether a list of category members is exhaustive or merely illustrative. The legal principle of “implied exclusion”—expressio unius est exclusio alterius —says that if you “expressly name” or “designate” an enumeration of one or more things, any thing that is not named is excluded, by implication. However, prefacing the list with “such as,” “including,” or “like” implies that it is not a strict enumeration because there might be other members.
[409] International Astronomical Union(IAU) (iau.org) published its new definition of planet in August 2006. A public television documentary in 2011 called The Pluto Files retells the story [(Tyson 2011)].
[410] The distinction between intension and extension was introduced by Gottlob Frege, a German philosopher and mathematician [(Frege 1892)].
[411] The number of resources in each of these categories depends on the age of the collection and the collector. We could be more precise here and say “single atomic property” or otherwise more carefully define “property” in this context as a characteristic that is basic and not easily or naturally decomposable into other characteristics. It would be possible to analyze the physical format of a music resource as a composition of size, shape, weight, and material substance properties, but that is not how people normally think. Instead, they treat physical format as a single property as we do in this example.
[412] We need to think of alphabetic ordering or any other organizing principle in a logical way that does not imply any particular physical implementation. Therefore, we do not need to consider which of these alphabetic categories exist as folders, files, or other tangible partitions.
[413] Another example: rules for mailing packages might use either size or weight to calculate the shipping cost, and whether these rules are based on specific numerical values or ranges of values, the intent seems to be to create categories of packages.
[414] If you try hard, you can come up with situations in which this property is important, as when the circus is coming to the island on a ferry or when you are loading an elevator with a capacity limit of 5000 pounds, but it just is not a useful or psychologically salient property in most contexts.
[415] Many information systems, applications, and programming languages that work with hierarchical categories take advantage of this logical relationship to infer inherited properties when they are needed rather than storing them redundantly.
[416] Similarly, clothing stores use intrinsic static properties when they present merchandise arranged according to color and size; extrinsic static properties when they host branded displays of merchandise; intrinsic dynamic properties when they set aside a display for seasonal merchandise, from bathing suits to winter boots; and extrinsic dynamic properties when a display area is set aside for “Today’s Special.”
[417] Aristotle did not call them classical categories. That label was bestowed about 2300 years later by [(Smith and Medin 1981)].
[418] We all use the word “about” with ease in ordinary discourse, but “aboutness” has generated a surprising amount of theoretical commentary about its typically implicit definition, starting with [(Hutchins 1977)] and [(Maron 1977)] and relentlessly continued by [(Hjørland 1992], [2001)].
[419] Typicality and centrality effects were studied by Rosch and others in numerous highly influential experiments in the 1970s and 1980s [(Rosch 1975)]. Good summaries can be found in [(Mervis and Rosch 1981)], [(Rosch 1999)], and in Chapter 1 of [(Rogers and McClelland 2008)].
[420] An easy to find source for Wittgenstein’s discussion of “game” is [(Wittgenstein 2002)] in a collection of core readings for cognitive psychology [(Levitin 2002)].
[421] The philosopher’s poll that ranked Wittgenstein’s book #1 is reported by [(Lackey 1999)].
[422] It might be possible to define “game,”but it requires a great deal of abstraction that obscures the “necessary and sufficient” tests. “To play a game is to engage in activity directed toward bringing about a specific state of affairs, using only means permitted by specific rules, where the means permitted by the rules are more limited in scope than they would be in the absence of the rules, and where the sole reason for accepting such limitation is to make possible such activity.” [(Suits 1967)]
[423] The exact nature of the category representation to which the similarity comparison is made is a subject of ongoing debate in cognitive science. Is it a prototype, a central tendency or average of the properties shared by category members, or it one or more exemplars, particular members that typify the category. Or is it neither, as argued by connectionist modelers who view categories as patterns of network activation without any explicitly stored category representation? Fortunately, these distinctions do not matter for our discussion here. A recent review is [(Rips, Smith, and Medin 2012)].
[424] Another situation where similarity has been described as a “mostly vacuous” explanation for categorization is with abstract categories or metaphors. Goldstone says “an unrewarding job and a relationship that cannot be ended may both be metaphorical prisons… and may seem similar in that both conjure up a feeling of being trapped… but this feature is almost as abstract as the category to be explained.” [(Goldstone 1994)], p. 149.
[425] [(Medin, Goldstone, and Gentner 1993)] and [(Tenenbaum and Griffiths 2001)].
[426] Because Tversky’s model separately considers the sets of non-overlapping features, it is possible to accurately capture similarity judgments when they are not symmetric, i.e., when A is judged more similar to B than B is to A. This framing effect is well-established in the psychological literature and many machine learning algorithms now employ asymmetric measures. [(Tversky 1974)]
[427] For a detailed explanation of distance and transformational models of similarity, see [(Flach 2012)], Chapter 9. There are many online calculators for Levenshein distance; http://www.let.rug.nl/kleiweg/lev/ also has a compelling visualization. The “strings” to be matched can themselves be transformations. The “soundex” function is very commonly used to determine if two words could be different spellings of the same name. It “hashes” the names into phonetic encodings that have fewer characters than the text versions. See [(Christen 2006)] and http://www.searchforancestors.com/utility/soundex.html to try it yourself.
[428] This explanation for expert-novice differences in categorization and problem solving was proposed in [(Chi et al 1981)]. See [(Linhares 2007)] for studies of abstract reasoning by chess experts.
[430] The emergence of theory-based categorization is an important event in cognitive development that has been characterized as a shift from “holistic” to “analytic” categories or from “surface properties” to “principles.” See [(Carey and Gelman 1991)] [(Rehder and Hastie 2004)].
[431] [(Tenenbaum 2000)] argues that this preference for the most specific hypothesis that fits the data is a general principle of Bayesian learning with random samples.
[432] Consider what happens if two businesses model the concept of “address” in a customer database with different granularity. One may have a coarse “Address” field in the database, which stores a street address, city, state, and Zip code all in one block, while the other stores the components “StreetAddress,” “City,” and “PostalCode” In separate fields. The more granular model can be automatically transformed into the less granular one, but not vice versa [(Glushko and McGrath 2005)].
[433] [(Bowker and Star 2000)]
[434] Statistician and baseball fan Nate Silver rejected a complex system that used twenty-six player categories for predicting baseball performance because “it required as much art as science to figure out what group a player belonged in.”[ (Silver 2012, p, 83).] His improved system used the technique of “nearest neighbor” analysis to identify current baseball players whose minor league statistics were most similar to the current minor league players being evaluated. (See the section called “Categories Created by Clustering”).
[435] [(Rosch 1999)] calls this the principle of cognitive economy, that “what one wishes to gain from one’s categories is a great deal of information about the environment while conserving finite resources as much as possible. […] It is to the organism’s advantage not to differentiate one stimulus from another when that differentiation is irrelevant to the purposes at hand.” (Pages 3-4.)
[436] For example, some linguists think of “English” as a broad category encompassing multiple languages or dialects, such as “Standard British English,” “Standard American English,” and “Appalachian English.”
[437] For example, you can test whether a number is prime by dividing it by every number smaller than its square root, but this algorithm is ridiculously impractical for any useful application. Many cryptographic systems multiply prime numbers to create encryption keys, counting on the difficulty of factoring them to protect the keys; so, proving that ever larger numbers are prime is very important. See [(Crandall and Pomerance 2006)].
[438] This example comes from [(Perlman 1984)], who introduced the idea of “natural artificial languages” as those designed to be easy to learn and use because they employ mnemonic symbols, suggestive names, and consistent syntax.
[439] When the US Congress revised copyright law in 1976 it codified a “fair use” provision to allow for some limited uses of copyrighted works, but fair use in the digital era is vastly different today; website caching to improve performance and links that return thumbnail versions of images are fair uses that were not conceivable when the law was written. A law that precisely defined fair uses using contemporary technology would have quickly become obsolete, but one written more qualitatively to enable interpretation by the courts has remained viable. See [(Samuelson 2009)].
[440] [(Wilkins 1668)] and [(Borges 1952)]
[441] “Rigid” might sound negative, but a rigidly defined resource is also precisely defined. Precise definition is essential when creating, capturing, and retrieving data and when information about resources in different organizing systems needs to be combined or compared. For example, in a traditional relational database, each table contains a field, or combination of fields, known as a primary key, which is used to define and restrict membership in the table. A table of email messages in a database might define an email message as a unique combination of sender address, recipient address, and date/time when the message was sent, by enforcing a primary key on a combination of these fields. Similar to category membership based on a single, monothetic set of properties, membership in this email message table is based on a single set of required criteria. An item without a recipient address cannot be admitted to the table. In categorization terms, the item is not a member of the “email message” class because it does not have all the properties necessary for membership.
[442] Like data schemas, programming classes specify and enforce rules in the construction and manipulation of data. However, programming classes, like other implementations that are characterized by specificity and rule enforcement, can vary widely in the degree to which rules are specified and enforced. While some class definitions are very rigid, others are more flexible. Some languages have abstract types that have no instances but serve to provide a common ancestor for specific implemented types.
[443] The existence of chapters might suggest that an item is a novel; however, a lack of chapters need not automatically indicate that an item is not a novel. Some novels are hypertexts that encourage readers to take alternative paths. Many of the writings by James Joyce and Samuel Beckett are “stream of consciousness” works that lack a coherent plot, yet they are widely regarded as novels.
[444] See [(Silver 2012)]. Over reliance on data that is readily available is a decision-making heuristic proposed by [(Tversky and Kahneman 1974)], who developed the psychological foundations for behavioral economics. (See the sidebar, Behavioral Economics.)
[445] To be precise, this “difference of proportions” calculation uses an algorithm that also uses the logarithm of the proportions to calculate entropy, a measure of the uncertainty in a probability distribution. An entropy of zero means that the outcome can be perfectly predicted, and entropy increases as outcomes are less predictable. The information gain for an attribute is how much it reduces entropy after it is used to subdivide a dataset.
[446] Unfortunately, this rational data-driven process for classifying loan applications as “Approved” or “Denied” was abandoned during the “housing bubble” of the early 2000s. Because lending banks could quickly sell their mortgages to investment banks who bundled them into mortgage-backed securities, applicants were approved without any income verification for “subprime” loans that initially had very low adjustable interest rates. Of course, when the rates increased substantially a few years later, defaults and foreclosures skyrocketed. This sad story is told in an informative, entertaining, but depressing manner in “The Big Short” [(Lewis, 2010)] and in a 2015 movie with the same name.
[447] Machine learning algorithms differ in which properties they use in how they select them. A straightforward method is to run the algorithms using different sets of properties, and select the set that yields the best result. However, it can be very computationally expensive to run algorithms multiple times, especially when the number of properties is large. A faster alternative is to select or filter features based on how well they predict the classification. The information gain calculation discussed in the section called “Probabilistic Decision Trees” is an example of a filter method.
[448] See [(Blanzieri and Bryl 2009)] for a review of the spam problem and the policy and technology methods for fighting it. [(Upsana and Chakravarty 2010)] is somewhat more recent and more narrowly focused on text classification techniques.
A very thorough yet highly readable introduction to Active Learning is [(Settles 2012)].
[449] In particular, documents are usually represented as vectors of frequency-weighted terms. Other approaches start more directly with the similarity measure, obtained either by direct judgments of the similarity of each pair of items or by indirect measures like the accuracy in deciding whether two sounds, colors, or images are the same or different. The assumption is that the confusability of two items reflects how similar they are.
[450] Unlike hierarchical clustering methods that have a clear stopping rule when they create the root category, k-means clustering methods run until the centroids of the categorize stabilize. Furthermore, because the k-means algorithm is basically just hill-climbing, and the initial category “seed” items are random, it can easily get stuck in a local optimum. So it is desirable to try many different starting configurations for different choices of K.
[451] In addition, the complex feature representations of neural networks compute very precise similarity measurements, which enable searches for specific images or that find duplicate ones.
[452] Structure Mapping theory was proposed in [(Gentner 1983)], and the Structure Mapping Engine followed a few years later [(Falkenhainer et al 1989)]. The SME was criticized for relying on hand-coded knowledge representations, a limitation overcome by [(Turney 2008)], who used text processing techniques to extract the semantic relationships used by Structure Mapping.