4 Chapter 4. Resources in Organizing Systems

Robert J. Glushko

Daniel D. Turner

Kimra McPherson

Jess Hemerly

Table of Contents

4.1. Introduction

4.1.1. What Is a Resource?

4.1.2. Identity, Identifiers, and Names

4.2. Four Distinctions about Resources

4.2.1. Resource Domain

4.2.2. Resource Format

4.2.3. Resource Agency

4.2.4. Resource Focus

4.2.5. Resource Format x Focus

4.3. Resource Identity

4.3.1. Identity and Physical Resources

4.3.2. Identity and Bibliographic Resources

4.3.3. Identity and Information Components

4.3.4. Identity and Active Resources

4.4. Naming Resources

4.4.1. What’s in a Name?

4.4.2. The Problems of Naming

4.4.3. Choosing Good Names and Identifiers

4.5. Resources over Time

4.5.1. Persistence

4.5.2. Effectivity

4.5.3. Authenticity

4.5.4. Provenance

4.6. Key Points in Chapter Four

Introduction

This chapter builds upon the foundational concepts introduced in Chapter 1, Foundations for Organizing Systems to explain more carefully what we mean by resource. In particular, we focus on the issue of identitywhat will be treated as a separate resourceand discuss the issues and principles we need to consider when we give each resource a name or identifier.

Navigating This Chapter

In the section called “Four Distinctions about Resources” we introduce four distinctions we can make when we discuss resources: domain, format, agency, and focus. In the section called “Resource Identity” we apply these distinctions as we discuss how resource identity is determined for physical resources, bibliographic resources, resources in information systems, as well as for active resources and smart things. the section called “Naming Resources” then tackles the problems and principles for naming: once we have identified resources, how do we name and distinguish them? Finally, the section called “Resources over Time” considers issues that emerge with respect to resources over time.

What Is a Resource?

Resources are what we

organize.

We introduced the concept of resource in the section called “The Concept of “Resource” with its ordinary sense of “anything of value that can support goal-oriented activity” and emphasized that a group of resources can be treated as a collection in an organizing system. And what do we mean by “anything of value,” exactly? It might seem that the question of identity, of what a single resource is, should not be hard to answer. After all, we live in a world of resources, and finding, selecting, describing, arranging, and referring to them are everyday activities. And while human resources are not a primary focus of this book, it would be remiss not to explain why it makes sense to think of people that way. See the sidebar, People as Resources.

Nevertheless, even when the resources we are dealing with are tangible things, how we go about organizing them is not always obvious, or at least not obvious to each of us in the same way at all times. Not everyone thinks of them in the same way. Recognizing something in the sense of perceiving it as a tangible thing is only the first step toward being able to organize it and other resources like it. Which properties garner our attention, and which we use in organizing depends on our experiences, purposes, and context.

We add information to a resource when we name it or describe it; it then becomes more than “it.” We can describe the same resource in many different ways. At various times we can consider any given resource to be one of many members of a broad category, as one of the few members of a narrow category, or as a unique instance of a category with only one member. For example, we might recognize something as a piece of clothing, as a sock, or as the specific dirty sock with the hole worn in the heel from yesterday’s long hike.

However, even after we categorize something, we might not be careful how we talk about it; we often refer to two objects as “the same thing” when what we mean is that they are “the same type of thing.” Indeed, we could debate whether a category with only one possible member is really a category, because it blurs an important distinction between particular items or instances and the class or type to which they belong.

The issues that matter and the decisions we need to make about resource instances and resource classes and types are not completely separable. Nevertheless, we will strive to focus on the former ones in this chapter and the latter ones in Chapter 7, Categorization: Describing Resource Classes and Types.

Resources with Parts

As tricky as it can be to decide what a resource is when you are dealing with single objects, it is even more challenging when the resources are objects or systems composed of other parts. In these cases, we must focus on the entirety of the object or system and treat it as a resource, treat its constituent parts as resources, and deal with the relationships between the parts and the whole, as we do with engineering drawings and assembly procedures.

How Many Things is a Chess Set?

image

A chess set exemplifies the many different ways to decide what to count as a separate resource. Is this a chess set, two sets of chess pieces, six types of chess pieces (1 king, 1 queen, 2 rooks, 2 bishops, 2 knights, 8 pawns for each side), or 33 separate things (the 32 pieces and a board on which to play the game)?

(Photo by Emma Jane Hogbin Westby. Creative Commons CC-BY-2.0 license.)

How many things is a car? If you are imagining the car being assembled you might think of several dozen large parts like the frame, suspension, drive train, gas tank, brakes, engine, exhaust system, passenger compartment, doors, and other pre-assembled components. Of course, each of those components is itself made up of many partsthink of the engine, or even just the radio. Some sources have counted ten or fifteen thousand parts in the average car, but even at that precise granularity a lot of parts are still complex things. There are screws and wires and fasteners and on and on; really too many to count.

Ambiguity about the number of parts in the whole holds for information resources too; a newspaper can be considered a single resource but it might also consist of multiple sections, each of which contains separate stories, each of which has many paragraphs, and so on. Similarly, while a web page can be treated as a single resource, it can also be considered as a collection of more granular parts, each of which can be separately identified as the source or anchor of a link. Likewise, a bank’s credit card application might ask about outstanding loans, payment history, current income, and other information, or the bank might just look up your credit score, which is a statistical index that combines this financial information into a single number.

Bibliographic Resources, Information Components, and “Smart Things” as Resources

Information resources generally pose additional challenges in their identification and description because their most important property is usually their content, which is not easily and consistently recognizable. Organizing systems for information resources in physical form, like those for libraries, have to juggle the duality of their tangible embodiment with what is inherently an abstract information resource; that is, the printed book versus the knowledge the book contains. Here, the organizing system emphasizes description resources or surrogates, like bibliographic records that describe the information content, rather than their physical properties.

Another question about resource that is especially critical in libraries is: What set of resources should be treated as the same work because they contain essentially similar intellectual or artistic content? We may talk about Shakespeare’s play Macbeth, but what is this thing we call “Macbeth”? Is it a particular string of words, saved in a computer file or handwritten upon a folio? Is it the collection of words printed with some predetermined font and pagination? Are all the editions and printings of these words the same Macbeth? How should we organize the numerous live and recorded performances of plays and movies that share the Macbeth name? What about creations based on or inspired by Macbeth that do not share the title “Macbeth,” like the Kurosawa film Kumonosu-jo (Throne of Blood) that transposes the plot to feudal Japan? Patrick Wilson proposed a genealogical analogy, characterizing a work as “a group or family of texts,” with the idea that a creation like Shakespeare’s Macbeth is the ancestor of later members of the family.”[161]

Information system designers and architects face analogous design challenges when they describe the information components in business or scientific organizing systems. Information content is intrinsically merged or confounded with structure and presentation whenever it is used in a specific instance and context. From a logical perspective, an order form contains information components for ITEM, CUSTOMER NAME, ADDRESS, and PAYMENT INFORMATION, but the arrangement of these components, their type font and size, and other non-semantic properties can vary a great deal in different order forms and even across a single information system that re-purposes these components for letters, delivery notices, mailing labels, and database entries.[162]

Similar questions about resource identity are posed by the emergence of ubiquitous or pervasive computing, in which information processing capability and connectivity are embedded into physical objects, in devices like smart phones, and in the surrounding environment. Equipped with sensors, radio-frequency identification (RFID) tags, GPS data, and user-contributed metadata, these smart things create a jumbled torrent of information about location and other properties that must be sorted into identified streams and then matched or associated with the original resource.

the section called “Resource Identity” discusses the issues and methods for determining “what is a

resource?” for physical resources, as well as for the bibliographic resources, information components and smart things discussed here, in the section called “Resources with Parts”.

Identity, Identifiers, and Names

The answer to the question posed in the section called “What Is a Resource?” has two parts.

The first part is identity: what thing are we treating as the resource?

The second part is identification: differentiating between this single resource and other resources like it.

These problems are closely related. Once you have decided what to treat as a resource, you create a name or an identifier so that you can refer to it reliably. A name is a label for a resource that is used to distinguish one from another. An identifier is a special kind of name assigned in a controlled manner and governed by rules that define possible values and naming conventions. For a digital resource, its identifier serves as the input to the system or function that determines its location so it can be retrieved, a process called resolving the identifier or resolution.

Choosing names and identifiersbe it for a person, a service, a place, a trend, a work, a document, a concept, etc.is often challenging and highly contentious. Naming is made difficult by countless factors, including the audience that will need to access, share, and use the names, the limitations of language, institutional politics, and personal and cultural biases.

A common complication arises when a resource has more than one name or identifier. When something has more than one name each of the multiple names is a synonym or alias. A particular physical instance of a book might be called a hardcover or paperback or simply a text. George Furnas and his research collaborators called this issue of multiple names for the same resource or concept the vocabulary problem.[163]

Whether we call it a book or a text, the resource will usually have a Library of Congress catalog number as well as an ISBN as an identifier. When the book is in a carton of books being shipped from the publisher to a bookstore or library, that carton will have a bar-coded tracking number assigned by the delivery service, and a manifest or receipt document created by the publisher whose identifier associates the shipment with the customer. Each of these identifiers is unique with respect to some established scope or context.

A partial solution to the vocabulary problem is to use a controlled vocabulary. We can impose rules that standardize the way in which names and labels for resources are assigned in the first place. Alternatively, we can define mappings from terms used in our natural language to the authoritative or controlled terms. However, vocabulary control cannot remove all ambiguity. Even if a passport or national identity system requires authoritative full names rather than nicknames, there could easily be more than one Robert John Smith in the system.

Controlling the language used for a particular purpose raises other questions: Who writes and enforces these rules? What happens when organizing systems that follow different rules get compared, combined, or otherwise brought together in contexts different from those for which they were originally intended?

Four Distinctions about Resources

The nature of the resource is critical for the creation and maintenance of quality organizing systems. There are four distinctions we make in discussing resources: domain, format, agency, and focus. Figure 4.1, “Resource Domain, Format, Focus and Agency.” depicts these four distinctions, perspectives or points of view on resources; because they are not independent, we cannot portray these distinctions as categories of resources.

Figure 4.1. Resource Domain, Format, Focus and Agency.

image

Four distinctions we can make when discussing resources concern their domain (their type of matter or content), format (physical or digital), agency (active or passive), and focus (primary or description).

 

Resource Domain

Resource domain is an intuitive notion that groups resources according to the set of natural or intuitive characteristics that distinguishes them from other resources. It contrasts with the idea of ad hoc or arbitrary groupings of resources that happen to be in the same place at some time.

For physical resources, domains can be coarsely distinguished according to the type of matter they are made of using easily perceived properties. The top-level classification of all things into the animal, vegetable, and mineral kingdoms by Carl Linnaeus in 1735 is deeply embedded in most languages and cultures to create a hierarchical system of domain categories. [164] Many aspects of this system of domain categories are determined by natural constraints on category membership that exist as patterns of shared and correlated properties; a resource identified as a member of one category must also be a member of another with which it shares some but not all properties. For example, a marble statue in a museum must also be a kind of material, and a fish in an aquarium must also be a kind of animal.

The Document Type Spectrum

Different domains or types of documents can be distinguished according to the extent to which their content is semantically prescribed, by the amount of internal structure, and by the correlations of their presentation and formatting to their content and structure. These three characteristics of content, structure, and presentation vary systematically from narrative document types like novels to transactional document types like invoices.

Narrative types are authored by people and are heterogeneous in structure and content, and their content is usually just prose and graphic elements. Their presentational characteristics carefully reinforce their structure and semantics; for example, the text of titles or major headings is large because the content is important, in contrast to the small text of footnotes. Transactional document types are usually created mechanically and, as a result, are homogeneous in structure and content; their content is largely “data” strongly typed content with precise semantics that can be processed by computers.

In the middle of the spectrum are hybrid document types like textbooks, encyclopedias, and technical manuals that contain a mixture of narrative text and structured content in figures, data tables, code examples, and so on.

For information resources, easily perceived properties like a book’s color or size are less reliably correlated with resource domain, so we more often distinguish domains based on semantic properties; the definitions of the “encyclopedia,” “novel,” and “invoice” resource types distinguish them according to their typical subject matter, or the type of content, rather than according to the great variety of physical forms in which we might encounter them. Arranging books by color or size might be sensible for very small collections, or in a photo studio, but organizing according to physical properties would make it extremely impractical to find books in a large library.

We can arrange types of information resources in a hierarchy. However, because the category boundaries are not sharp it is more useful to view domains of information resources on a continuum from weakly-structured narrative content to highly structured transactional content. This framework, called the Document Type Spectrum by Glushko and McGrath, captures the idea that the boundaries between resource domains, like those between colors in the rainbow, are easy to see for colors far apart in the spectrum but hard to see for adjacent ones.[165] (See the sidebar, The Document Type Spectrum, and its corresponding depiction as Figure 4.2, “Document Type Spectrum.”)

Figure 4.2. Document Type Spectrum.

image

The Document Type Spectrum is a continuum of document types from narrative ones that are mostly text, like novels, to transactional ones with highly-structured information, like invoices. In between are hybrid types that contain both narrative and transactional content, like dictionaries and encyclopedias.

 

Resource Format

Information resources can exist in numerous formats with the most basic distinction between physical and digital ones. This distinction is most important in the implementation of a resource storage or preservation system because that is where physical properties are usually considerations, and very possibly constraints. This distinction is less important at the logical level when we design interactions with resources because digital surrogates for the physical resources can overcome the constraints posed by physical properties. When we search for cars or appliances in an online store it does not matter where the actual cars or appliances are located or how they are organized. (See the sidebar, The Three Tiers of Organizing Systems).

Many digital representations can be associated with either physical or digital resources, but it is important to know which one is the original or primary resource, especially for unique or valuable ones.

Today many resources in organizing systems are born digital, created in word processors, digital cameras, audio and video recorders. Other digital resources are by sensors in smart things and by the systems that create digital resources when they interact with barcodes, QR codes, RFID tags, or other mechanisms for tracking identity and location.[166]

Other digital resources are created by digitization, the process for transforming an artifact whose original format is physical so it can be stored and manipulated by a computer. We digitize the printed word, photographs, blueprints, and record albums. Printed text, for example, can be digitized by scanning pages and using character recognition software or simply re-typing it.[167]

There are a vast number of digital formats that differ in many ways, but we can coarsely compare them on two dimensions: the degree to which they distinguish information content from presentation or rendering, and the explicitness with which content distinctions are represented. Taken together, these two dimensions allow us to compare formats on their overall “Information IQ” with the overarching principle being that “smarter” formats contain more computer-processable information, as illustrated in Figure 4.3, “Information IQ.”

Simple digital formats for “plain text” documents contain only the characters that you see on your computer keyboard. ASCII is the most commonly used simple format, but ASCII is inadequate for most languages, which have larger character sets, and it also cannot handle mathematical characters.[168] The Unicode standard was designed to overcome these limitations.[169] (ASCII and Unicode are discussed in great detail in the section called “Notations”.)

Most document formats also explicitly encode a hierarchy of structural components, such as chapters, sections or semantic components like descriptions or procedural steps, and sometimes the appearance of the rendered or printed form.[170] Another important distinction to note is whether the information is encoded as a sequence of text characters so that it is human readable as well as computer readable. Encoding character content with XML, for example, allows for layering of intentional coding or markup interwoven with the “plain text” content. Because XML processors are required to support Unicode, any character can appear in an XML document. The most complex digital formats are those for multimedia resources and multidimensional data, where the data format is highly optimized for specialized analysis or applications.[171]

Digitization of non-text resources such as film photography, drawings, and analog audio and visual recordings raises a complicated set of choices about pixel density, color depth, sampling rate, frequency filtering, compression, and numerous other technical issues that determine the digital representation.[172]

There may be multiple intended uses and devices for a digitized resource that could require different digitization approaches and formats. Downstream users of digitized resources need to know the format in which a digital artifact has been created, so they can reuse it as is, or process it in other ways.

Some digital formats support interactions that are qualitatively different and more powerful than those possible with physical resources. Museums are using virtual world technology to create interactive exhibits in which visitors can fly through the solar system, scan their own bodies, and change gravity so they can bounce off walls. Sophisticated digital document formats can enable interactions with annotated digital images or video, 3-D graphics or embedded datasets. The Google Art Project contains extremely high resolution photographs of famous paintings that make it possible to see details that are undetectable under the normal viewing conditions in museums.[173]

Nevertheless, digital representations of physical resources can also lose important information and capabilities. The distinctive sounds of hip hop music produced by “scratching” vinyl records on turntables cannot be produced from digital MP3 music files.[174]

Figure 4.3. Information IQ.

image

The notion of Information IQ captures the idea that document formats differ on two dimensions: the explicitness of content representation, and the separation of content and presentation. A scanned document is just a picture of a document with neither of these distinctions, so it is low on both dimensions. A database or XML document distinguishes explicitly between types of content and presentation is separately assigned, so they are high on both dimensions and have the highest Information IQ. An HTML document’s content distinctions are usually presentational and, thus, it has lower IQ. Formats with high Information IQ facilitate computer processing.

 

Copyright often presents a barrier to digitization, both as a matter of law and because digitization itself enables copyright enforcement to a degree not possible prior to the advent of digitization, by eliminating common forms of access and interactions that are inherently possible with physical printed books like the ability to give or sell them to someone else.[175]

Resource Agency

People as Resources

Earlier editions of this book sidestepped the question of people as resources to avoid complicating the Discipline of Organizing. People organize themselves in innumerable ways to coexist, share knowledge, and accomplish more than they could as individuals, and behaviors such as trust and reciprocity might be considered “organizing principles” for human society. But these organic relationships and interactions usually lack the intentional arrangement to be considered true Organizing Systems, except when the people are living in “intentional communities.”[9]

However, people do qualify as resources in Organizing Systems under our definition: just like digital and physical resources, human resources can be identified, categorized, described in terms of their attributes and relationships, and take part in interactions to create value. In businesses, people are organized to amplify their skills, knowledge, and agency. A company’s organizational chart is often a formal hierarchy in which each worker’s role is defined through his or her responsibilities and relationships to others in the company. Treating an employee abstractly as a resource with specific and predictable functions, inputs, and outputs enables employees or processes to depend upon each other without being distracted by the details of one another’s work. This so-called “black boxing” can encourage specialization and allow an organization to function more efficiently.

Nevertheless, while organizational charts are typically presented as neat hierarchies, human relationships cut across the hierarchy to create a network, and may be complicated by differing values and motivations. Conflicting incentives and lack of communication between people may cause breakdowns in the system. People are more than the sum of properties used to organize them: understanding and defining employees’ roles too narrowly could exclude the aspects of the job that they find rewarding or consider part of their professional identity, while black boxing people’s labor and treating them as “remote person calls” also risks dehumanizing them and ignoring their working conditions.

Agency is the extent to which a resource can initiate actions on its own. We can define a continuum between completely passive resources that cannot initiate any actions and active resources that can initiate actions based on information they sense from their environments or obtain through interactions with other resources. A book being read at the beach will grow warm from absorbing the sun’s energy but it has no way of measuring its temperature and is a completely passive resource. An ordinary mercury thermometer senses and displays the temperature but is not capable of communicating its own reading, whereas a digital wireless thermometer or weather station can.

Passive resources serve as nouns or operands that are acted upon, while active resources serve as verbs or operants that cause and carry out actions. We need a concept of agency to bring resources that are active information sources, or computational in character, into the organizing system framework. This concept also lets us include living resources, or more specifically, humans, into discussions about organizing systems in a more general way that emphasizes their agency.[176]

Passive or Operand Resources

Organizing systems that contain passive or operand resources are ubiquitous for the simple reason that we live in a world of physical resources that we identify and name in order to interact with them. Passive resources are usually tangible and static and thus they become valuable only as a result of some action or interaction with them.

Most organizing systems with physical resources or those that contain resources that are digitized equivalents treat those resources as passive. A printed book on a library shelf, a digital book in an ebook reader, a statue in a museum gallery, or a case of beer in a supermarket refrigerator only create value when they are checked out, viewed, or consumed. None of these resources exhibits any agency and cannot initiate any actions to create value on their own.

Active or Operant Resources

Active resources create effects or value on their own, sometimes when they initiate interactions with passive resources. Active resources can be people, other living resources, computational agents, active information sources, web-based services, self-driving cars, robots, appliances, machines or otherwise ordinary objects like light bulbs, umbrellas, and shoes that have been made “smarter.” We can exploit computing capability, storage capacity, and communication bandwidth to create active resources that can do things and support interactions that are impossible for ordinary physical passive resources.

Active Resources: The Nest Thermostat “Ecosystem”

image

These two screenshots of the Nest iPad app show the thermostat control panel and an energy history report with a pop-up note explaining that resetting the temperature resulted in higher than average energy use on that day. The Nest thermostat serves as a hub device, communicating with lights, appliances, smoke alarms, your car, your wearable fitness sensor, and other active resources (https://nest.com/works-with-nest/).

(Screenshots by Andrea Angquist. Used with permission.)

We can analyze active resources according to five capabilities that progressively increase their agency. These capabilities build on each other to give resources and the organizing systems in which they participate more ways to create value through interactions and information exchanges.

Sensing or awareness

The minimal capability for a resource to have some agency is for it to be able to sense or be aware of some aspect of its environment or its interactions with other resources. A thermometer measures temperature, a photodetector measures light, a gauge measures the fuel left in a car’s gas tank, a GPS device computes its location after detecting and analyzing signals from satellites, a wearable fitness sensor tracks your heartbeat and how far you walk. But sensing something in itself does not create any value in an organizing system. Something needs to be done.

Actuation

A resource has the capability to actuate when it can create effects or value by initiating some action as a result of the information it senses; “actuator” is often used to describe a resource that can move or control a physical mechanism or system, while “effector” is used when the resource is a biological one. Resources can actuate by turning on lights, speakers, cameras, motors, switches, by sending a message about the state or value of a sensor, or by moving themselves around (as with robots).

A potential or latent actuation is created when a resource can display or broadcast some aspect of its state, but value is only created if another resource (possibly human) happens to see the display or hear the broadcast and then acts upon it.

For example, RFID chips, which are essentially bar codes with built-in radio transponders, can be attached to otherwise passive resources to make them active. RFID chips begin transmitting when they detect the presence of a RFID reading device. This enables automated location tracking and context sensing. RFID receivers are built into assembly lines, loading docks, parking lots, toll booths, or store shelves to detect when some RFID-tagged resource is at some meaningful location. RFID tags can be made more useful by having them record and transmit information from sensors that detect temperature, humidity, acceleration, and even biological contamination.[177]

Connectivity

For an active resource to do useful work it must be connected in some way to the actuation mechanism that manipulates or controls some other resource. This connection might be a direct and permanent one between the resource and the thing it actuates, like that of a thermostat whose temperature sensing capability has a fixed connection to a heating or cooling system that it turns off or on depending on the temperature.

An important innovation in the design of active resources is “wrapping” physical resources with software so they can be given IP addresses and make connections with Internet protocols, which allows them to send information to an application with more capability to act on it. Such resources are said to be part of the “Internet of Things.”

Smart phones are active resources that can identify and share their own location, orientation, acceleration and a growing number of other contextual parameters to enable personalization of information services. Smart phones can also run the applications that receive messages from and send messages to other smart resources to monitor and optimize how they work.

Computation or programmability

Simple active resources operate in a deterministic manner: given this sensor reading, do this. Other active resources have computational capabilities that enable them to analyze the current and historical information from their sensors, identify significant data values or patterns in these interaction resources, and then adapt their behavior accordingly.

Many thermostats are programmable, but most people do not bother to program them so they miss out on potential energy savings. Nest Labs makes a learning thermostat that programs itself. The Nest thermostat uses sensors for temperature, humidity, motion, and light to figure out whether people are at home, and a Wi-fi connection to get local weather data.

The Roomba vacuum cleaning robot navigates around furniture, power cords, stairs, and optimizes its cleaning paths to go over particularly dirty places. But vacuuming is all it does. More sophisticated robots are designed to be versatile and adaptable so they can repetitively perform whatever task is needed for some manufacturing process, and their capabilities can be continually upgraded by software updates, just like the apps on your smart phone. A new generation of robots typified by one called Baxter can be trained by example; a person moves Baxter’s arms and hands to show him what to do, and when Baxter has programmed himself to repeat it, he nods.

Composability and cooperating

The “smartest” active resources can do more than analyze the information they collect and adapt what they do. In addition, they expose what they know and can do to other resources using standard or non-proprietary formats and protocols. This means that active resources that were independently designed and implemented can work together to create value.

Many organizing systems on the web consist of collections or configurations of active digital resources. Interactions among these active resources often implement information-intensive business models where value is created by exchanging, manipulating, transforming, or otherwise processing information, rather than by manipulating, transforming, or otherwise processing physical resources.

We are beginning to see the same principles of modularity and composability applied to physical resources, with open source software libraries for using sensors and micro-controllers and easy to use APIs. In essence, we are using software and physical resources in much the same way as functional building blocks, and standards will be critically important.

Service Oriented Architecture(SOA) is an emerging design discipline for organizing active software resources as functional business components that can be combined in different ways. SOA is generally implemented using web services that exchange XML documents in real-time information flows to interconnect the business service components.

A familiar design pattern for an organizing system composed from active digital resources is the online store.” The store can be analyzed as a composition or choreography in which some web pages display catalog items, others serve as “shopping carts” to assemble the order, and then a checkout page collects the buyer’s payment and delivery information that gets passed on to other service providers who process payments and deliver the goods.

If This, Then That

IFTTT is a visual programming system that lets non-programmers connect and control active resources in the physical and digital worlds. IFTTT programs, called recipes, can take information from a growing library of Internet services (date/time, calendar, weather, news, email, social media, and many others) and use this information with simple control logic to trigger actions in other services or resources. Example recipes can copy an Instagram photo to Google Drive, add daily Fitbit data to a spreadsheet, or control lights based on time, weather, or sunset.

image

“If sunset then electrical outlet on.” The icon on the left is the trigger; the icon on the right is the action.

(Photo by R. Glushko)

Design patterns for composing organizing systems from “smart” physical resources are emerging in work on the “smart home,” “smart office building,” or “smart city.” Many experiments are underway and new products emerging that are trying out different combinations of hardware and software to understand the design tradeoffs between them to best determine where the “smarts” should go. For example, we can compare a “smart home” built around a super-intelligent hub device that communicates and coordinates with many other “not so smart” devices from the same manufacturer to one in which all of the devices are equally smart and come from different makers.

At more complex scales, a truly smart building will not just have programmable thermostats to control heating and cooling systems. It will take in weather forecasts, travel calendars, information about the cost of electricity from different sources, and other relevant information as inputs to a model of how the building heats and cools to optimize energy use and cost while keeping the rooms at appropriate temperatures.

Standard application interfaces enable active resources to interact with people to get information that might otherwise come from sensors or that enhances the value of the sensor information. A programmable thermostat that can record time-based preferences of the people who use the space controlled by the thermostat is more capable than one with just a single temperature threshold. A standard Internet protocol for communicating with the thermostat would enable it to be controlled remotely.

Open and standard data formats and communication protocols enable the aggregation and analysis of information from many instances of the same type of active resource. For example, smart phones running the Google Maps application transmit information about their speed and location. Machine learning and sophisticated optimization techniques of this dataset can yield collective intelligence that can then be given to the resources from which it was derived. In this case, Google can identify traffic jams and generate alternative routes for the drivers stuck in traffic.

Stop and Think: The Internet of Things

There is a great deal of hype about the Internet of Things, but there is also a great deal of innovation. If you search for the phrase “Internet of Things” along with almost any physical resource, chances are you will find something, Try “baby,” “dog,” “fork,” “lettuce,” “pajamas,” “streetlamp,” and then a few of your own.

But not everything can be done best by computers. The web has enabled the use of people as active resources to carry out tasks of short duration that can be precisely described but which cannot be done reliably by computers. These tasks often require aesthetic or subjective judgment. The people doing these web-based tasks are often called Mechanical Turks by analogy to a fake chess playing machine from the 18th century that had a human hidden inside who secretly moved the pieces.[178]

Resource Focus

A fourth contrast between types of resources distinguishes original or primary resources from resources that describe them. Any primary resource can have one or more description resources associated with it to facilitate finding, interacting with, or interpreting the primary one. Description resources are essential in organizing systems where the primary resources are not under its control and can only be accessed or interacted with through the description. Description resources are often called metadata.

The distinction between primary resources and description resources, or metadata, is deeply embedded in library science and traditional organizing systems whose collections are predominantly text resources like books, articles, or other documents. In these contexts description resources are commonly called bibliographic resources or catalogs, and each primary resource is typically associated with one or more description resources.

In business enterprises, the organizing systems for digital information resources, such as business documents, or data records created by transactions or automated processes, almost always employ resources that describe, or are associated with, large sets or classes of primary resources.[179]

The contrast between primary resources and description resources is very useful in many contexts, but when we look more broadly at organizing systems, it is often difficult to distinguish them, and determining which resources are primary and which are metadata is often just a decision about which resource is currently the focus of our attention.[180]

For example, many Twitter users treat the 140-character message body as the primary resource, while the associated metadata about the message and sender (is it a forward, reply, link, etc.) is less important. However, for firms that use Twitter metadata to measure sender and brand impact, or identify social networks and trends, the focus is the metadata, not the content.[181]

As another example, players on professional sports teams are human resources, but millions of people participate in fantasy sports leagues where teams consist of resources based on the statistics generated by the actual human players. Put another way, the associated resources in the actual sports are treated as the primary ones in the fantasy leagues.[182]

Resource Format x Focus

Applying the format contrast between physical and digital resources to the focus distinction between primary and descriptive resources yields a useful framework with four categories of resources (Figure 4.4, “Resource Format x Focus.”).

Figure 4.4. Resource Format x Focus.

image

The distinctions of resource format and resource focus combine to distinguish four categories of resources: physical resources, digital resources, physical descriptions , and digital descriptions.

 

Physical Description of a Primary Physical Resource

The oldest relationship between descriptive resources and physical resources is when descriptions or other information about physical resources are themselves encoded in a physical form. Nearly ten thousand years ago in Mesopotamia small clay tokens kept in clay containers served as inventory information to count units of goods or livestock. It took 5000 years for the idea of stored tokens to evolve into Cuneiform writing in which marks in clay stood for the tokens and made both the tokens and containers unnecessary.[183]

A Cuneiform Document at the Pergamon

image

The Pergamon Museum in Berlin contains a very large collection of Babylonian, Persian, and Assyrian artifacts that are nearly three thousand years old. including numerous cuneiform clay tablets like this one.

(Photo by R. Glushko.)

Printed cards served as physical description resources for books in libraries for nearly two centuries.[184]

Digital Description of a Primary Physical Resource

Here, the digital resource describes a physical resource. The most familiar example of this relationship is the online library catalog used to find the shelf location of physical library resources, which beginning in the 1960s replaced the physical cards with database records. The online catalogs for museums usually contain a digital photograph of the painting, item of sculpture, or other museum object that each catalog entry describes.

Bar Code Shopping in A Virtual Supermarket

image

Woolworth’s Australia created a“virtual supermarket” with product photos and bar codes. Scanning places an order, which is delivered from the customer’s local supermarket.

(Photo by R. Glushko.)

Digital description resources for primary physical resources are essential in supply chain management, logistics retailing, transportation, and every business model that depends on having timely and accurate information about where things are or about their current states. This digital description resource is created as a result of an interaction with a primary physical resource like a temperature sensor or with some secondary physical resource that is already associated with the primary physical resource like an RFID tag or barcode.

Augmented reality systems combine a layer of real-time digital information about some physical object to a digital view or representation of it. The yellow “first down” lines superimposed in broadcasts of football games are a familiar example. Augmented reality techniques that superimpose identifying or descriptive metadata are used in displays to support the operation or maintenance of complex equipment, in smart phone navigation and tourist guides, in advertising, and in other domains where users might otherwise need to consult a separate information source. Advanced airplane cockpit technology includes heads-up displays that present critical data based on available instrumentation, including augmented reality runway lights when visibility is poor because of clouds or fog.

Augmented reality displays have recently been incorporated into wearable technology like Google Glass, which mounts on eyeglass frames to display information obtained from the Internet after being requested by voice commands. Some luxury car brands have incorporated similar technology to project dashboard data, traffic conditions, and directions on the driver’s windshield.

Digital Description of a Primary Digital Resource

A digital resource describes a digital resource. This is the relationship in a digital library or any web-based organizing system, making it possible to access a primary digital resource directly from the digital secondary resource.

Physical Description of a Primary Digital Resource

This is the relationship implemented when we encounter an embedded QR barcode in newspaper or magazine advertisements, on billboards, sidewalks, t-shirts, or on store shelves. Scanning the QR code with a mobile phone camera can launch a website that contains information about a product or service, place an order for one unit of the pointed-to- item in a web catalog, dial a phone number, or initiate another application or service identified by the QR code.[185]

Resource Identity

Determining the identity of resources that belong in a domain, deciding which properties are important or relevant to the people or systems operating in that domain, and then specifying the principles by which those properties encapsulate or define the relationships among the resources are the essential tasks when building any organizing system. In organizing systems used by individuals or with small scope, the methods for doing these tasks are often ad hoc and unsystematic, and the organizing systems are therefore idiosyncratic and do not scale well. At the other extreme, organizing systems designed for institutional or industry-wide use, especially in information-intensive domains, require systematic design methods to determine which resources will have separate identities and how they are related to each other. These resources and their relationships are then described in conceptual models which guide the implementation of the systems that manage the resources and support interactions with them.[186]

Identity and Physical Resources

Our human visual and cognitive systems do a remarkable job at picking out objects from their backgrounds and distinguishing them from each other. In fact, we have little difficulty recognizing an object or a person even if we are seeing them from a novel distance and viewing angle or with different lighting, shading, and so on. When we watch a football game, we do not have any trouble perceiving the players moving around the field, and their contrasting uniform colors allow us to see that there are two different teams.

The perceptual mechanisms that make us see things as permanent objects with contrasting visible properties are just the prerequisite for the organizing tasks of identifying the specific object, determining the categories of objects to which it belongs, and deciding which of those categories is appropriate to emphasize. Most of the time we carry out these tasks in an automatic, unconscious way; at other times we make conscious decisions about them. For some purposes we consider a sports team as a single resource, as a collection of separate players for others, as offense and defense, as starters and reserves, and so on.[187]

Although we have many choices about how we can organize football players, all of them will include the concept of a single player as the smallest identifiable resource. We are never going to think of a football player as an intentional collection of separately identified leg, arm, head, and body resources because there are no other ways to “assemble” a human from body parts. Put more generally, there are some natural constraints on the organization of matter into parts or collections based on sizes, shapes, materials, and other properties that make us identify some things as indivisible resources in some domain.

Identity and Bibliographic Resources

Pondering the question of identity is something relatively recent in the world of librarians and catalogers. Libraries have been around for about 4000 years, but until the last few hundred years librarians created “bins” of headings and topics to organize resources without bothering to give each individual item a separate identifier or name. This meant searchers first had to make an educated guess as to which bin might house their desired information“Histories”? “Medical and Chemical Philosophy”?then scour everything in the category in a quest for their desired item. The choices were ad hoc and always localthat is, each cataloger decided the bins and groupings for each catalog.[188]

The first systematic approach to dealing with the concept of identity for bibliographic resources was developed by Antonio Panizzi at the British Museum in the mid-19th century. Panizzi wondered: How do we differentiate similar objects in a library catalog? His solution was a catalog organized by author name with an index of subjects, along with his newly concocted “Rules for the Compilation of the Catalogue.” This contained 91 rules about how to identify and arrange author names and titles and what to do with anonymous works. The rules were meant to codify how to differentiate and describe each singular resource in his library. Taken together, the rules serve to group all the different editions and versions of a work together under a single identity.[189]

The concept of identity for bibliographic resources was refined in the 1950s by Lubetzky, who enlarged the concept of the work to make it a more abstract idea of an author’s intellectual or artistic creation. According to Lubetzky’s principle, an audio book, a video recording of a play, and an electronic book should be listed each as distinct items, yet still linked to the original because of their overlapping intellectual origin.[190]

The distinctions put forth by Panizzi, Lubetzky, Svenonius and other library science theorists have evolved today into a four-step abstraction hierarchy (see Figure 4.5, “The FRBR Abstraction Hierarchy.”) between the abstract work, an expression in multiple formats or genres, a particular manifestation in one of those formats or genres, and a specific physical item. The broad scope from the abstract work to the specific item is essential because organizing systems in libraries must organize tangible artifacts while expressing the conceptual structure of the domains of knowledge represented in their collections.

This hierarchy is defined in the Functional Requirements for Bibliographical Records(FRBR), published as a standard by the International Federation of Library Associations and Institutions(IFLA).[191]

If we revisit the question “What is this thing we call Macbeth?” we can see how different ways of answering fit into this abstraction hierarchy. The most specific answer is that Macbeth is a specific item, a very particular and individual resource, like that dog-eared paperback with yellow marked pages that you owned when you read Macbeth in high school. A more abstract answer is that Macbeth is an idealization called a work, a category that includes all the plays, movies, ballets, or other intellectual creations that share a recognizable amount of the plot and meaning from the original Shakespeare play.

Figure 4.5. The FRBR Abstraction Hierarchy.

image

The abstraction hierarchy for identifying resources yields four different answers about the identity of an information resource.

 

Identity and Information Components

In information-intensive domains, documents, databases, software applications, or other explicit repositories or sources of information are ubiquitous and essential to the creation of value for the user, reader, consumer, or customer. Value is created through the comparison, compilation, coordination or transformation of information in some chain or choreography of processes operating on information flowing from one information source or process to another. These processes are employed in accounting, financial services, procurement, logistics, supply chain management, insurance underwriting and claims processing, legal and professional services, customer support, computer programming, and energy management.

The processes that create value in information-intensive domains are “glued together” by shared information components that are exchanged in documents, records, messages, or resource descriptions of some kind. Information components are the primitive and abstract resources in information-intensive domains. They are the units of meaning that serve as building blocks of composite descriptions and other information artifacts.

The value creation processes in information-intensive domains work best when their component parts come from a common controlled vocabulary for components, or when each uses a vocabulary with a granularity and semantic precision compatible with the others. For example, the value created by a personal health record emerges when information from doctors, clinics, hospitals, and insurance companies can be combined because they all share the same “patient” component as a logical piece of information.

This abstract definition of information components does not help identify them, so we will introduce some heuristic criteria: An information component can be: (1) Any piece of information that has a unique label or identifier or (2) Any piece of information that is self-contained and comprehensible on its own.[192]

These two criteria for determining the identity of information components are often easy to satisfy through observations, interviews, and task analysis because people naturally use many different types of information and talk easily about specific components and the documents that contain them. Some common components (e.g., person, location, date, item) and familiar document types (e.g., report, catalog, calendar, receipt) can be identified in almost any domain. Other components need to be more precisely defined to meet the more specific semantic requirements of narrower domains. These smaller or more fine-grained components might be viewed as refined or qualified versions of the generic components and document types, like course grade and semester components in academic transcripts, airport codes and flight numbers in travel itineraries and tickets, and drug names and dosages in prescriptions.

Decades of practical and theoretical effort in conceptual modeling, relational theory, and database design have resulted in rigorous methods for identifying information components when requirements and business rules for information can be precisely specified. For example, in the domain of business transactions, required information like item numbers, quantities, prices, payment information, and so on must be encoded as a particular type of datainteger, decimal, Unicode string, etc. with clearly defined possible values and that follows clear occurrence rules.[193]

Identifying components can seem superficially easy at the transactional end of the Document Type Spectrum (see the sidebar in the section called “Resource Domain”), with orders or invoices, forms requiring data entry, or other highly-structured document types like product catalogs, where pieces of information are typically labeled and delimited by boxes, lines, white space or other presentation features that encode the distinctions between types of content. For example, the presence of ITEM, CUSTOMER NAME, ADDRESS, and PAYMENT INFORMATION labels on the fields of an online order form suggests these pieces of information are semantically distinct components in a retail application. In addition, these labels might have analogues in variable names in the source code that implements the order form, or as tags in a XML document created by the ordering application; <CustName>John Smith</CustName> and <Item>A-19</Item> in the order document can be easily identified when it is sent to the other services by the order management application.

But the theoretically grounded methods for identifying components like those of relational theory and normalization that work for structured data do not strictly apply when information requirements are more qualitative and less precise at the narrative end of the Document Type Spectrum. These information requirements are typical of narrative, unstructured and semi-structured types of documents, and information sources like those often found in law, education, and professional services. Narrative documents include technical publications, reports, policies, procedures and other less structured information, where semantic components are rarely labeled explicitly and are often surrounded by text that is more generic. Unlike transactional documents that depend on precise semantics because they are used by computers, narrative documents are used by people, who can ask if they are not sure what something means, so there is less need to explicitly define the meaning of the information components. Occasional exceptions, such as where components in narrative documents are identified with explicit labels like NOTE and WARNING, only prove the rule.

Identity and Active Resources

Active resources (

the section called “Use Controlled Vocabularies”) initiate effects or create value on their own. In many cases an inherently passive physical resource like a product package or shipping pallet is transformed into an active one when associated with an RFID tag or bar code. Mobile phones contain device or subscriber IDs so that any information they communicate can be associated both with the phone and often, through indirect reference, with a particular person. If the resource has an IP address, it is said to be part of the Internet of Things.”[194]

Organizing systems that create value from active resources often co-exist with or complement organizing systems that treat its resources as passive. In a traditional library, books sat passively on shelves and required users to read their spines to identify them. Today, some library books contain active RFID tags that make them dynamic information sources that self-identify by publishing their own locations. Similarly, a supermarket or department store might organize its goods as physical resources on shelves, treating them as passive resources; superimposed on that traditional organizing system is one that uses point-of-sale transaction information created when items are scanned at checkout counters to automatically re-order goods and replenish the inventory at the store where they were sold. In some stores the shelves contain sensors that continually “talk to the goods” and the information they gather can maintain inventory levels and even help prevent theft of valuable merchandise by tracking goods through a store or warehouse. The inventory becomes a collection of active resources; each item eager to announce its own location and ready to conduct its own sale. Another category of inanimate objects that are active resources are those that use Twitter to communicate their status or sensor measurements. These include bridges, rivers, and the Curiosity Rover on Mars.

Big Data Makes “Smart” Soccer Players

The German World Cup soccer team, which won the 2014 World Cup, took advantage of sophisticated data collection and analysis to optimize player skill and strategy training. German software firm SAP analyzed video data from on-field cameras that captured thousands of data points per second about player position and movement to identify improvements in passing and ball handling for German players and detect weaknesses in opponents. German sports equipment firm Adidas designed cleats with sensors that track mileage, field position, and movements. [(Norton 2014)] and [(Reynolds 2014)]. The extent to which an active resource is “smart” depends on how much computing capability it has available to refine the data it collects and communicates. A large collection of sensors can transmit a torrent of captured data that requires substantial processing to distinguish significant events from those that reflect normal operation, and also from those that are statistical outliers with strange values caused by random noise. This challenge gets qualitatively more difficult as the amount of data grows to big data size, because a one in million event might be a statistical outlier that can be ignored, but if there are a thousand similar outliers in a billion sensor readings, this cluster of data probably reveals something important. On the other hand, giving every sensor the computing capability to refine its data so that it only communicates significant information might make the sensors too expensive to deploy.[195]

Naming Resources

Determining the identity of the thing, document, information component, or data item we need is not always enough. We often need to give that resource a name, a label that will help us understand and talk about what it is. But naming is not just the simple task of assigning a sequence of characters. In this section, we will discuss why we name, some of the problems with naming, and the principles that help us name things in useful ways.

What’s in a Name?

When a child is born, its parents give it a name, often a very stressful and contentious decision. Names serve to distinguish one person from another, although names might not be unique—there are thousands of people named James Smith and Maria Garcia. Names also, intentionally or unintentionally, suggest characteristics or aspirations. The name given to us at birth is just one of the names we will be identified with during our lifetimes. We have nicknames, names we use professionally, names we use with friends, and names we use online. Our banks, our schools, and our governments will know who we are because of numbers they associate with our names. As long as it serves its purpose to identify you, your name could be anything.[196]

Resources other than people need names so we can find them, describe them, reuse them, refer or link to them, record who owns them, and otherwise interact with them. In many domains the names assigned to resources are also influenced or constrained by rules, industry practice, or technology considerations.

The Problems of Naming

Giving names to anything, from a business to a concept to an action, can be a difficult process and it is possible to do it well or do it poorly. The following section details some of the major challenges in assigning a name to a resource.

The Vocabulary Problem

Every natural language offers more than one way to express any thought, and in particular there are usually many words that can be used to refer to the same thing or concept. The words people choose to name or describe things are embodied in their experiences and context, so people will often disagree in the words they use. Moreover, people are often a bit surprised when it happens, because what seems like the natural or obvious name to one person is not natural or obvious to another. One way to avoid surprises is to have people cooperate in choosing names for resources, and information architects often use participatory design techniques of card sorting or free listing for this purpose.[197][198]

Back in the 1980s in the early days of computer user interface design, George Furnas and his colleagues at Bell Labs conducted a set of experiments to measure how much people would agree when they named some resource or function. The short answer: very little. Left to our own devices, we come up with a shockingly large number of names for a single common thing.

Unreliable Names: Knockin’ On Heaven’s Door

image

In 2008, Music recommendation service Last.fm employee Richard Jones compiled a list of the 100 most descriptions of the Guns N’ Roses recording of Bob Dylan’s song “Knockin’ on Heaven’s Door.” The 21st most common description of the song incorrectly attributes the recording to Aerosmith.

Reprinted in Figure 1 of [(Hemerly 2011)]. Used by permission here. In one experiment, a thousand pairs of people were asked to “write the name you would give to a program that tells about interesting activities occurring in some major metropolitan area.” Less than 12 pairs of people agreed on a name. Furnas called this phenomenon the vocabulary problem, concluding that no single word could ever be considered the “best” name.[199]

Homonymy, Polysemy, and False Cognates

Sometimes the same word can refer to different resourcesa “bank” can be a financial institution or the side of a river. When two words are spelled the same but have different meanings they are homographs; if they are also pronounced the same they are homonyms. If the different meanings of the homographs are related, they are polysemes. Resources with homonymous and polysemous names are sometimes incorrectly identified, especially by an automated process that cannot use common sense or context to determine the correct referent. Polysemy can cause more trouble than simple homography because the overlapping meaning might obscure the misinterpretation. If one person thinks of a “shipping container” as being a cardboard box and orders some of them, while another person thinks of a “shipping container” as the large box carried by semi-trailers and stacked on cargo ships, their disagreement might not be discovered until the wrong kinds of containers arrive.[200]

Many words in different languages have common roots, and as a result are often spelled the same or nearly the same. This is especially true for technology words; for example, “computer” has been borrowed by many languages. The existence of these cognates and borrowed words makes us vulnerable to false cognates. When a word in one language has a different meaning and refers to different resources in another, the results can be embarrassing or disastrous. Gift is poison in German; pain is bread in French.

Names with Undesirable Associations

False cognates are a special category of words that make poor names, and there are many stories relating product marketing mistakes, where a product name or description translates poorly, into other languages or cultures, with undesirable associations.[201] Furthermore, these undesirable associations differ across cultures. For example, even though floor numbers have the straightforward purpose to identify floors from lowest to highest levels, most buildings in Western cultures skip the 13th floor because many people think 13 is an unlucky number. In many East and Southeast Asian buildings, the 4th floor is skipped. In China the number 4 is dreaded because it sounds like the word for “death,” while 8 is prized because it sounds like the word for “wealth.”

While it can be tempting to dismiss unfamiliar biases and beliefs about names and identifiers as harmless superstitions and practices, their implications are ubiquitous and far from benign. Alphabetical ordering might seem like a fair and non-discriminatory arrangement of resources, but because it is easy to choose the name at the top of an alphabetical list, many firms in service businesses select names that begin with “A,” “AA,” or even “AAA” (look in any printed service directory). A consequence of this bias is that people or resources with names that begin with letters late in the alphabet are systematically discriminated against because they are often not considered, or because they are evaluated in the context created by resources earlier in the alphabet rather than on their own merit.[202]

Names that Assume Impermanent Attributes

Many resources are given names based on attributes that can be problematic later if the attribute changes in value or interpretation.

From ‘Kentucky Fried Chicken’ to ‘KFC’

image

Kentucky Fried Chicken was founded in 1930 by Harland Sanders as a tiny restaurant in a gas station storeroom in Corbin, Kentucky. It was one of the first fast-food chains to go international, and in 1987 was the first Western restaurant chain to open in China. It changed its name to KFC a few years later, no doubt in part because in Beijing, Moscow, London and other locations not anywhere near Kentucky many people have probably never heard of the place.

(Photo by Kyle Taylor. CC-BY-2.0 license.)

Web resources are often referred to using URLs that contain the domain name of the server on which the resource is located, followed by the directory path and file name on the computer running the server. This treats the current location of the resource as its name, so the name will change if the resource is moved. It also means that resources that are identical in content, like those at an archive or mirror website, will have different names than the original even though they are exact copies. An analogous problem is faced by restaurants or businesses with street names or numbers in their names if they lose their leases or want to expand.[203]

Some dynamic web resources that are generated by programs have URIs that contain information about the server technology used to create them. When the technology changes, the URIs will no longer work.[204]

Some resources have names that include page numbers, which disappear or change when the resource is accessed in a digital form. For example, the standard citation format for legal opinions uses the page number from the printed volume issued by West Publishing, which has a virtual monopoly on the publishing of court opinions and other types of legal documents.[205]

Some resources have names that contain dates, years or other time indicators, most often to point to the future. The film studio named 20th Century Fox took on that name in the 1930s to give it a progressive, forward-looking identity, but today a name with “20th Century” in it does the opposite.[206]

The Semantic Gap

The semantic gap is the difference in perspective in naming and description when resources are described by automated processes rather than by people.[207]

The semantic gap is largest when computer programs or sensors obtain and name some information in a format optimized for efficient capture, storage, decoding, or other technical criteria. The nameslike IMG20268.jpg on a digital photomight make sense for the camera as it stores consecutively taken photos but they are not good names for people. We may prefer names that describe the content of the picture, like GoldenGateBridge.jpg.

When we try to examine the content of computer-created or sensor-captured resources, like a clip of music or a compiled software program, a text rendering of the content simply looks like nonsense. It was designed to be interpreted by a computer program, not by a person.

Semantic Gap: Name This Tune

image

The format of this MP3 recording is designed to be read by a music player, not by people.

(Screenshot by R. Glushko.)

Choosing Good Names and Identifiers

If someone tells you they are having dinner with their best friend, a cousin, someone with whom they play basketball, and their professional mentor from work, how many places at the table will be set? Anywhere from two to five; it is possible all those relational descriptions refer to a single person, or to four different people, and because “friend,” “cousin,” “basketball teammate” and “mentor” do not name specific people you will have to guess who is coming to dinner.

If instead of descriptions you are told that the dinner guests are Bob, Carol, Ted, and Alice, you can count four names and you know how many people are having dinner. But you still cannot be sure exactly which four people are involved because there are many people with those names.

The uncertainty is eliminated if we use identifiers rather than names. Identifiers are names that refer unambiguously to a specific person, place, or resource because they are assigned in a controlled way. Identifiers are often strings of numbers or letters rather than words to avoid the biases and associations that words can convey. For example, a professor might grade exams that are identified by student numbers rather than names.

Names {and, or, vs} Identifiers

People change their names for many reasons: when they get married or divorced, because their name is often mispronounced or misspelled, to make a political or ethnic statement, or because they want to stand out. A few years a football player with a large ego named Chad Johnson, which is the second most common surname in the US, decided to change his name to his player number of 85, becoming Chad “Ochocinco.” He had an ochocinco.com website and used the ochocinco name on Facebook and Twitter. In a bit of irony, when Ochocinco wanted to put Ocho Cinco on the back of his football jersey, the football league would not let him because his legal name does not have a space in it. That surely contributed to his decision to change his name back to Chad Johnson in 2012.

When you go to coffee shops, you are often asked your name, which the cashier writes on the empty cup so that your drink can be identified after the barista makes it. They do not actually need your name; just as some establishments use a receipt number to distinguish orders, what they need is an identifier. So even if your name is Joe, you can tell them it is Thor, Wotan, Mercurio, El Greco, Clark Kent, or any other name that is likely to be a unique identifier for the minute it takes to make your beverage.[208]

The distinction between names and identifiers for people is often not appreciated. (See the sidebar, Names {and, or, vs} Identifiers.)

Make Names Informative

The most basic principle of naming is to choose names that are informative, which makes them easier to understand and remember. It is easier to tell what a computer program or XML document is doing if it uses names like “ItemCost” and “TotalCost” rather than just “I” or “T.” People will enter more consistent and reusable address information if a form asks explicitly for “Street,” “City,” and “PostalCode” instead of “Line1” and “Line2.”

Identifiers can be designed with internal structure and semantics that conveys information beyond the basic aspect of pointing to a specific resource. An International Standard Book Number(ISBN) like “978-0-262-07261-8” identifies a resource (07261=“Document Engineering”) and also reveals that the resource is a book (978), in English (0), and published by The MIT Press (262).[209]

The navigation points that mark intersections of radial signals from ground beacons or satellites that are crucial to aircraft pilots used to be meaningless five-letter codes that were changed to make them suggest their locations; semantic landmark names made pilots less likely to enter the wrong names into navigation systems, For example, some of the navigation points near Orlando, Floridathe home of Disney Worldare MICKI, MINEE, and GOOFY.[210]

Use Controlled Vocabularies

One way to encourage good names for a given resource domain or task is to establish a controlled vocabulary. A controlled vocabulary is like a fixed or closed dictionary that includes the terms that can be used in a particular domain. A controlled vocabulary shrinks the number of words used, reducing synonymy and homonymy, eliminating undesirable associations, leaving behind a set of words with precisely defined meanings and rules governing their use.

A controlled vocabulary is not simply a set of allowed words; it also includes their definitions and often specifies rules by which the vocabulary terms can be used and combined. Different domains can create specific controlled vocabularies for their own purposes, but the important thing is that the vocabulary be used consistently throughout that domain.[211]

For bibliographic resources important aspects of vocabulary control include determining the authoritative forms for author names, uniform titles of works, and the set of terms by which a particular subject will be known. In library science, the process of creating and maintaining these standard names and terms is known as authority control.

When evaluating what name to use for an author, librarians typically look for the name form that is used most commonly across that author’s body of work while conforming to rules for handling prefixes, suffixes and other name parts that often cause name variations. For example, a name like that of Johann Wolfgang von Goëthe might be alphabetized as both a “G” name and a “V” name, but using “G” is the authoritative way. “See” and “see also” references then map the variations to the authoritative name.[212]

Official authority files are maintained for many resource domains: a gazetteer associates names and locations and tells us whether we should be referring to Bombay or Mumbai; the Domain Name System(DNS) maps human-oriented domain and host names to their IP addresses; the Chemical Abstracts Service Registry assigns unique identifiers to every chemical described in the open scientific literature; numerous institutions assign unique identifiers to different categories of animal species.[213]

In some cases, authority files are created or maintained by a community, as in the case of MusicBrainz, an “open music encyclopedia” to which users contribute information about artists, releases, tracks, and other aspects of music. Music metadata is notoriously unreliable; one study found over 100 variations in the description of the Knockin’ on Heaven’s Door song (written by Bob Dylan) as recorded by Guns N’ Roses.[214]

Allow Aliasing

Aliasing: Bad for this Fish

image

A fish once known as the Patagonian Toothfish because of its large and unattractive teeth became popular in American restaurants when a fish wholesaler began marketing it as the Chilean Sea Bass even though it is usually found farther south in cold Antarctic waters and it is not a sea bass. Unfortunately for the fish, this alias was so successful that it led to overfishing, threatening the survival of the species. Some environmentally-oriented chefs, restauranteurs, and seafood distributors organized a boycott to save the fish. [(Fabricant 2002)]

(Photo published by the United States Government. Not protectable by copyright (17 USC Sec. 105).)

A controlled vocabulary is extremely useful to people who use it, but if you are designing an organizing system for other people who do not or cannot use it, you need to accommodate the variety of words they will actually use when they seek or describe resources. The authoritative name of a certain fish species is Amphiprion ocellaris, but most people would search for it as “clownfish,” “anemone fish,” or even by its more familiar film name of Nemo.

Furnas suggests “unlimited aliasing” to connect the uncontrolled or natural vocabularies that people use with the controlled one employed by the organizing system. By this he means that there must be many alternate access routes to each word or function that a user is trying to find. For example, the birth name of the 42nd President of the United States of America is William Jefferson Clinton, but web pages that refer to him as Bill Clinton are vastly more common, and searches for the former are redirected to the latter. A related mechanism used by search engines is spelling correction, essentially treating all the incorrect spellings as aliases of the correct one (“did you mean California?” when you typed “Claifornia”).

Make Identifiers Unique or Qualified

Even though an identifier refers to a single resource, this does not mean that no two identifiers are identical. One military inventory system might use stock number 99 000 1111 to identify a 24-hour, cold-climate ration pack, while another inventory system could use the same number to identify an electronic radio valve. Each identifier is unique in its inventory system, but if a supply request gets sent to the wrong warehouse hungry soldiers could be sent radio valves instead of rations.[215][216]

We can prevent or reduce identifier collisions by adding information about the namespace, the domain from which the names or identifiers are selected, thus creating what are often called qualified names. There are several dozen US cities named “Springfield” and “Washington,” but adding state codes to mail addresses distinguishes them. Likewise, we can add prefixes to XML element names when we create documents that reuse components from multiple document types, distinguishing <book:Title> from <legal:Title>.

We can fix problems like these by qualifying or extending the identifier, or by creating a globally unique identifier(GUID), one that will never be the same as another identifier in any organizing system anywhere else. One easy method to create a GUID is to use a URL you control and append a string to it, the same approach that gives every web page a unique address. GUIDs are often used to identify software objects, the resources in distributed systems, or data collections.[217]

Because they are not created by an algorithm whose results are provably unique, we do not consider fingerprints, or other biometric information, to be globally unique identifiers for people, but for all practical purposes they are.[218]

Distinguish Identifying and Resolving

Library call numbers are identifiers that do not contain any information about where the resource can be found in the library stacks on in a digital repository. This separation enables this identification system to work when there are multiple copies in different locations, in contrast to URIs that serve as both identifiers and locations much of the time. When the identifier does not contain information about resource location, it must be“resolved” to determine the location. With physical resources, resolution takes place with the aid of signs, maps, or other associated resources that describe the resource arrangement in some physical environment; for example, “You are here” maps associate each resource identifier with a coordinate or other means of finding it on the map. With digital resources, the resolver is a directory system or service that interprets an identifier and looks up its location or directly initiates resource retrieval.

Resources over Time

Problems of “what is the resource?” and “how do we identify it?” are complex and often require ongoing work to ensure they are properly answered as an organizing system evolves. We might need to know how a resource does or does not change over time (its persistence), whether its state and content come into play at a specified point in time (its effectivity), whether the resource is what it is said to be (its authenticity), and sometimes who has certified its authenticity over time (its provenance). A resource might have persistence, but only the provenance provided by an documented chain of custody enables questions about authenticity to be answered with authority. Effectivity describes the limits of a resource’s lifespan on the time line.

Figure 4.6. Resources over Time.

image

Four considerations that arise with respect to the maintenance of resources over time are their persistence, provenance, authenticity, and effectivity.

 

Figure 4.6, “Resources over Time.” portrays the relationships among the concepts of Persistence, Provenance, Effectivity, and Authenticity.

Persistence

The Great Sphinx at Giza

image

The Great Sphinx has persisted for over four thousand years. It has survived acts of vandalism, target practice by Napoleon’s artillery, shoulder-deep burial in desert sandstorms, and eventual excavation in the early 20th century.

(Photo by R. Glushko.)

Even if you have reached an agreement as to the meaning of a thing in your organizing system, you still face the question of the identity of the resource over time, or its persistence.

Persistent Identifiers

How long must an identifier last? Coyle gives the conventional, if unsatisfying, answer: “As long as it is needed.”[219] In some cases, the time frame is relatively short. When you order a specialty coffee and the barista asks for your name, this identifier only needs to last until you pick up your order at the end of the counter. But other time frames are much longer. For libraries and repositories of scientific, economic, census, or other data the time frame might be forever.”

The design of a scheme for persistent identifiers must consider both the required time frame and the number of resources to be identified. When the Internet Protocol(IP) was designed in 1980, it contained a 32-bit address scheme, sufficient for over 4 billion unique addresses. But the enormous growth of the Internet and the application of IP addresses to resources of unexpected types have required a new addressing scheme with 128 bits.[220]

Recognition that URIs are often not persistent as identifiers for web-based resources led the Association of American Publishers(AAP) to develop the Digital Object Identifier(DOI) system. The location and owner of a digital resource can change, but its DOI is permanent.[221]

Persistent Resources

Even though persistence often has a technology dimension, it is more important to view it as a commitment by an institution or organization to perform activities over time to ensure that a resource is available when it is needed. Put another way, preservation (the section called “Preservation”) and governance (the section called “Governance”) are activities carried out to ensure the outcome of persistence.

The subtle relationship between preservation and persistence raises some interesting questions about what it means for a resource to stay the same over time. One way to think of persistence is that a persistent resource is never changed. However, physical resources often require maintenance, repair, or restoration to keep them accessible and usable, and we might question whether at some point these activities have transformed them into different resources.[222] Likewise, digital resources require regular backup and migration to keep them available and this might include changing their digital format.

Many resources like online newspapers or blog feeds continually change their content but still have persistent identifiers. This suggests we should think of persistence more abstractly, and consider as persistent resources any that remain functionally the same to support the same interactions at any point in their lifetimes, even if their physical properties or information values change.

Active resources implemented as computational agents or web services might be re-implemented numerous times, but as long as they do not change their interfaces they can be deemed to be persistent from the perspective of other resources that use them. Similarly, the dataset that defines a user or customer model in a recommendation system should be treated as a persistent resource; it includes information like name and date of birth that is persistent in the traditional sense; but it might also include “last purchase” and “current location,” which must change frequently to maintain the accuracy and usefulness of the customer model.

Some organizing systems closely monitor their resources and every interaction with them to prevent or detect tampering or other unauthorized changes. Some organizing systems, like those for software or legal documents, explicitly maintain every changed version to satisfy expectations of persistence, because different users might not be relying on the same version. With digital resources, determining whether two resources are the same or determining how they are related or derived from one another are very challenging problems.[223]

Effectivity

In Which Country Do You Live?

Even if you always live in the same place, the answer to “what country do you live in?” can depend on when it is asked. Consider the case of an elderly woman born in 1929 in Zemum, a district in the eastern European city of Belgrade, who has never moved. The place she lives has been part of seven different countries during her lifetime: Kingdom of Yugoslavia (1929-1941); Independent State of Croatia (1941-1945); Federal People’s Republic of Yugoslavia (1945-1963); Socialist Federal Republic of Yugoslavia (1963-1992); Federal Republic of Yugoslavia (1992-2003); State Union of Serbia and Montenegro (2003-2006); Republic of Serbia (2007—present).

Many resources, or their properties, also have locative or temporal effectivity, meaning that they come into effect at a particular time and/or place; will almost certainly cease to be effective at some future date, and may cease to be effective in different places.

Temporal effectivity, sometimes known as “time-to-live,” is generally expressed as a range of two dates. It consists of a date on which the resource is effective, and optionally a date on which the resource ceases to be effective, or becomes stale. For some types of resources, the effective date is the moment they are created, but for others, the effective date can be a time different from the moment of creation. For example, a law passed in November may take effect on January 1 of the following year, and credit cards first need to be activated and then can no longer be used after their expiration date. An “effective date” is the counterpart of the “Best Before” date on perishable goods. That date indicates when a product goes bad, whereas an item’s effective date is when it “goes good” and the resource that it supersedes needs to be disposed of or archived.

Locative effectivity considers borders, security, roadways, altitude, depth and other geographic factors. Some types of resources, including people, are restricted as to where they may or may not be transported and/or used, such as hazardous cargo, explosives, narcotics, pharmaceuticals, alcohol, and cannabinoids. Jurisdictional issues concern borders, transportation corridors, weather stations, and geographic surveys. Parachutes are altitude-sensitive and scuba diving cylinders are depth-sensitive.

Effectivity concerns sometimes intersect with authority control for names and places. Name changes for resources often are tied to particular dates, events, and locations. Laws and regulations differ across organizational and geopolitical boundaries, and those boundaries often change. Some places that have been the site of civil unrest, foreign occupation, and other political disruptions have had many different names over time, and even at the same time.[224]

Today these disputed borders cause a problem for Google Maps when it displays certain international borders. Because Google is subject to the laws of the country where its servers are located, it must present disputed borders to conform with the point of view of the host country when a country-specific Google site is used to access the map.[225]

In most cases effectivity implies persistence requirements because it is important to be able to determine and reconstruct the configuration of resources that was in effect at some prior time. A new tax might go into effect on January 1, but if the government audits your tax returns what matters is whether you followed the law that was in effect when you filed your returns.[226]

Authenticity

Do You Trust This?

image

Ustar.net sells photos autographed by celebrities, and each comes with a Certificate of Authenticity that includes a replica of the photo and a signature from a Ustar employee to guarantee that the autograph is an actual hand-signed one. But Ustar does not provide a certificate to guarantee that the employee signature is an authorized one.

(Screenshot by R. Glushko. Source: ustar.net.)

In ordinary use we say that something is authentic if it can be shown to be, or has come to be accepted as what it claims to be. The importance and nuance of questions about authenticity can be seen in the many words we have to describe the relationship between “the real thing” (the “original”) and something else: copy, reproduction, replica, fake, phony, forgery, counterfeit, pretender, imposter, ringer, and so on.

It is easy to think of examples where authenticity of a resource matters: a signed legal contract, a work of art, a historical artifact, even a person’s signature.

The creator or operator of an organizing system, whether human or machine, can authenticate a newly created resource. A third party can also serve as proof of authenticity. Many professional careers are based on figuring out if a resource is authentic.[227]

There is a large body of techniques for establishing the identity of a person or physical resource. We often use judgments about the physical integrity of recorded information when we consider the integrity of its contents.

Digital authenticity is more difficult to establish. Digital resources can be reproduced at almost no cost, exist in multiple locations, carry different names on identical documents or identical names on different documents, and bring about other complications that do not arise with physical items. Technological solutions for ensuring digital authenticity include time stamps, watermarking, encryption, and digital signatures. However, while scholars generally trust technological methods, technologists are more skeptical of them because they can imagine ways for them to be circumvented or counterfeited. Even when a technologically sophisticated system for establishing authenticity is in place, we can still only assume the constancy of identity as far back as this system reaches in the chain of custody of the document.

Provenance

In the section called “Looking “Upstream” and “Downstream” to Select Resources” we recommended that you analyze any evidence or records about the use of resources as they made their way to you from their headwaters to ensure they have maintained their quality over time. The concept of provenance transforms the passive question of what has happened to this resource?” into actions that can be taken to ensure that nothing bad can happen to a resource or to enable it to be detected.

Chinese Manuscript With Provenance Seals

image

This beautiful manuscript, preserved in the National Palace Museum in Taipei, was created by Zhao Ji (), Emperor Huizong, the 8th Emperor of the Chinese Song Dynasty about a thousand years ago. He was famous for his skills in poetry, painting, and calligraphy. There are two poems here; the one on the right describes the techniques for Chinese landscape paintings, while the left one expresses the Emperor’s appreciation of plum blossoms, which signal the onset of spring.

The red seals are those of several Ching Dynasty emperors over many generations, with the oldest being at least five hundred years after Huizong created the poems. Stamping your personalized red seal on a resource is analogous to but vastly more elegant and informative than “Liking” a web page today.

(Photo by R. Glushko.)

The idea that important documents must be created in a manner that can be authenticated and then preserved, with an unbroken chain of custody, goes back to ancient Rome. Notaries witnessed the creation of important documents, which were then protected to maintain their integrity or value as evidence. In organizing systems like museums and archives that preserve rare or culturally important objects or documents this concern is expressed as the principle of provenance. This is the history of the ownership of a collection or the resources in it, where they have been and who has had access to the resources.

A uniquely Chinese technique in organizing systems is the imprinting of elaborate red seals on documents, books, and paintings that collectively record the provenance of ownership and the review and approval of the artifact by emperors or important officials.

However, it is not only art historians and custodians of critical documents that need to be concerned with provenance. If you are planning to buy a used car, it is wise to check the vehicle history (using the Vehicle Identification Number, the car’s persistent identifier) to make sure it hasn’t been wrecked, flooded, or stolen.

Key Points in Chapter Four

4.6.1. What is the relationship between a resource and a category?

4.6.2. What factors affect the size of a category?

4.6.3. What is metadata?

4.6.4. What is an identifier and what design goals must it satisfy?

4.6.5. What are active resources?

4.6.6. What is the recall/precision tradeoff?

4.6.7. What is agency?

4.6.8. What are active resources?

4.6.9. Is there a fundamental difference between a primary resource and metadata associated with it?

4.6.10. What is the Document Type Spectrum?

4.6.11. What is the FRBR four-level abstraction hierarchy?

4.6.12. What is the Internet of Things?

4.6.13. What is the vocabulary problem?

4.6.14. What is a potential problem with basing names on a resource attribute?

4.6.15. What is the Semantic Gap?

4.6.16. What is the most basic principle of naming?

4.6.17. What is a controlled vocabulary?

4.6.18. What is authority control?

4.6.19. Which organizing system activities promote persistence?

4.6.20. What is effectivity?

4.6.21. What guarantees the authenticity of a resource?

4.6.1.

What is the relationship between a resource and a category?

 

We can consider a resource to be one of many members of a very broad category, as the unique instance of a category with only one member, or anywhere in between.

(See the section called “What Is a Resource?”)

4.6.2.

What factors affect the size of a category?

 

The size of the categorythe number of resources that are treated as equivalentis determined by the properties or characteristics we consider when we examine the resource.

(See the section called “What Is a Resource?”)

4.6.3.

What is metadata?

 

Organizing systems for physical information resources emphasize description resources or surrogates like bibliographic records that describe the information content rather than their physical properties.

(See the section called “Bibliographic Resources, Information Components, and “Smart Things” as Resources”)

4.6.4.

What is an identifier and what design goals must it satisfy?

 

An identifier is a special kind of name assigned in a controlled manner and governed by rules that define possible values and naming conventions. The design of a scheme for persistent identifiers must consider both the required time frame and the number of resources to be identified.

(See the section called “Identity, Identifiers, and Names”)

4.6.5.

What are active resources?

 

Active resources create effects or value on their own, sometimes when they initiate interactions with passive resources. Active resources can be people, other living resources, computational agents, active information sources, web-based services, self-driving cars, robots, appliances, machines or otherwise ordinary objects like light bulbs, umbrellas, and shoes that have been made “smarter.”

(See the section called “Active or Operant Resources”)

4.6.6.

What is the recall/precision tradeoff?

 

More fine-grained organization reduces recall, the number of resources you find or retrieve in response to a query, but increases the precision of the recalled set, the proportion of recalled items that are relevant.

(See the section called “Identity and Information Components”)

4.6.7.

What is agency?

 

Agency is the extent to which a resource can initiate actions on its own. We can define a continuum between completely passive resources that cannot initiate any actions and active resources that can initiate actions based on information they sense from their environments or obtain through interactions with other resources.

(See the section called “Resource Agency”)

4.6.8.

What are active resources?

 

Resources become active resources when they contain sensing and communication capabilities.

(See the section called “Resource Agency”)

4.6.9.

Is there a fundamental difference between a primary resource and metadata associated with it?

 

Which resources are primary and which are metadata is often just a decision about which resource is the focus of our attention.

(See the section called “Resource Focus”)

4.6.10.

What is the Document Type Spectrum?

 

It can be useful to view domains of information resources on the Document Type Spectrum from weakly-structured narrative content to highly structured transactional content.

(See the sidebar, The Document Type Spectrum)

4.6.11.

What is the FRBR four-level abstraction hierarchy?

 

The concept of identity for bibliographic resources has evolved into a four-level abstraction hierarchy between the abstract work, an expression in multiple formats or genres, a particular manifestation in one of those formats or genres, and a specific physical item.

(See the section called “Identity and Bibliographic Resources” and Figure 4.5, “The FRBR Abstraction Hierarchy.”)

4.6.12.

What is the Internet of Things?

 

If the resource has an IP address, it is part of the “Internet of Things.”

(See the section called “Identity and Active Resources”.)

4.6.13.

What is the vocabulary problem?

 

Every natural language offers more than one way to express any thought, and in particular there are usually many words that can be used to refer to the same thing or concept.

(See the section called “The Problems of Naming”)

4.6.14.

What is a potential problem with basing names on a resource attribute?

 

Many resources are given names based on attributes that can be problematic later if the attribute changes in value or interpretation.

(See the section called “Names that Assume Impermanent Attributes”)

4.6.15.

What is the Semantic Gap?

 

The semantic gap is the difference in perspective in naming and description when resources are described by automated processes rather than by people.

(See the section called “The Semantic Gap”)

4.6.16.

What is the most basic principle of naming?

 

The most basic principle of naming is to choose names that are informative.

(See the section called “Make Names Informative”)

4.6.17.

What is a controlled vocabulary?

 

One way to encourage good names for a given resource domain or task is to establish a controlled vocabulary. A controlled vocabulary is like a fixed or closed dictionary that includes the terms that can be used in a particular domain. A controlled vocabulary shrinks the number of words used, reducing synonymy and homonymy, eliminating undesirable associations, leaving behind a set of words with precisely defined meanings and rules governing their use.

(See the section called “Use Controlled Vocabularies”)

4.6.18.

What is authority control?

 

For bibliographic resources important aspects of vocabulary control include determining the authoritative forms for author names, uniform titles of works, and the set of terms by which a particular subject will be known. In library science, the process of creating and maintaining these standard names and terms is known as authority control.

(See the section called “Use Controlled Vocabularies”)

4.6.19.

Which organizing system activities promote persistence?

 

Preservation and governance are activities carried out to ensure that resources will last as long as they are needed.

(See the section called “Persistence”)

4.6.20.

What is effectivity?

 

Many resources, or their properties, also have locative or temporal effectivity, meaning that they come into effect at a particular time and/or place; will almost certainly cease to be effective at some future date, and may cease to be effective in different places.

(See the section called “Effectivity”)

4.6.21.

What guarantees the authenticity of a resource?

 

The only guarantee of a resource’s authenticity is having total oversight over the “chain of custody” from its creation to the present.

(See the section called “Authenticity” and the section called “Provenance”)

 


[161] [(Shakespeare 1623)], [(Kurosawa 1957)], and [(Wilson 1968, p. 9)].

[162] Separating information content from its structure and presentation is essential to re-purposing it for different scenarios, applications, devices, or users. The global information economy is increasingly driven by automated information exchange between business processes. When information flows efficiently from one type of document to another in this chain of related documents, the overlapping content components act as the “glue” that connects the information systems or web services that produce and consume the documents. [(Glushko and McGrath 2005)].

[163] [(Furnas, Landauer, Gomez, and Dumais 1987)].

[164] [(Linnaeus 1735)]. Linnaeus is sometimes called the father of modern taxonomy (which is unfair to Aristotle) but he certainly deserves enormous credit for the systematic approach to biological classification that he proposed in Systema Naturae, published in 1735. This seminal work contains the familiar kingdom, class, order, family, genus, species hierarchy.

[165] [(Glushko and McGrath 2005)].

[166] [(Kuniavsky 2010)].

[167] Project Gutenberg, begun in 1971, was the first large-scale effort to digitize books; its thousands of volunteers have created about 40,000 digital versions of classic printed works. Systematic research in digital libraries began in the 1990s when the US National Science Foundation(NSF), the Advanced Research Projects Agency(ARPA), and NASA launched a Digital Library Initiative that emphasized the enabling technologies and infrastructure. At about the same time numerous pragmatic efforts to digitize library collections began, characterized by some as a race against time as old books in libraries were literally disintegrating and turning into dust. The Internet Archive, started in 1996, now has a collection of over 3 million texts and has estimated the cost of digitizing to be about $30 for the average book. Multiply this by the scores of millions of books held in the world’s research libraries and it is easy to why many libraries endorsed Google’s offer to digitize their collections.

[168] The ASCII scheme was standardized in the 1960s when computer memory was expensive and most computing was in English-speaking countries, so it is minimal and distinguishes only 128 characters. [(Cerf1969)] American Standard Code for Information Interchange(ASCII) is an ANSI specification. (See http://en.wikipedia.org/wiki/ASCII.)

[169] Unicode 6.0 (http://www.unicode.org/) has room to encode 109,449 characters for all the writing systems in the world, so a single standard can represent the characters of every existing language, even “dead” ones like Sumerian and Hittite. Unicode encodes the scripts used in languages, rather than languages per se, so there only needs to one representation of the Latin, Cyrillic, Arabic, etc scripts that are used for writing multiple language. Unicode also distinguishes characters from glyphs, the different forms for the same characterenabling different fonts to be identified as the same character.

[170] Encoding of structure in documents is valuable because titles, sections, links and other structural elements can be leveraged to enhance the user interface and navigational interactions with the digital document and enable more precise information retrieval. Some uses of documents require formats that preserve their printed appearance. “Presentational fidelity” is essential if we imagine a banker or customs inspector carefully comparing a printed document with a computer-generated one to ensure they are identical.

[171] Text encoding specs are well-documented; see (http://www.wotsit.org/list.asp?fc=10).

[172] [(Chapman and Chapman 2009)].

[173] The ambitious use of virtual world technology to create novel forms of interaction described by [(Rothfarb and Doherty 2007)] reflects the highly interactive character of its host museum, the Exploratorium in San Francisco (http://www.exploratorium.edu/). Similarly, the Google Art Project (http://googleartproject.com) is notable for its goal of complementing and extending, rather than merely imitating, the museum visitor’s encounter with artwork [(Proctor 2011)]. A feature that let people create a “personal art collection” is very popular, enabling a fan of Vincent Van Gogh to bring together paintings that hang in different museums.

[174] However, scratching can be simulated using a smart phone or tablet app called djay. See http://www.algoriddim.com/djay.

[175] As a result, digital books are somewhat controversial and problematic for libraries, whose access models were created based on the economics of print publication and the social contract of the copyright first sale doctrine that allowed libraries to lend printed books. Digital books change the economics and first sale is not as well-established for digital works, which are licensed rather than sold [(Aufderheide and Jaszi 2011)]. To protect their business models, many publishers are limiting the number of times ebooks can be lent before they “self-destruct.” Some librarians have called for boycotts of publishers in response (http://boycottharpercollins.com).

[176] The opposing categories of operands and operants have their roots in debates in political economics about the nature of work and the creation of value [(Vargo and Lusch 2004)] and have more recently played a central role in the development of modern thinking about service design [(Constantin and Lusch 1994)], [(Maglio et al. 2009)].

[177] See [(Allmendinger and Lombreglia 2005)], [(Want 2006)]. [(Crawford and Johnson, 2012)]

[178] Luis Von Ahn [(von Ahn 2004)] was the first to use the web to get people to perform “microwork” or “human computation” tasks when he released what he called the ESP game” that randomly paired people trying to agree on labeling an image. Not long afterward Amazon created the MTurk platform (http://www.mturk.com) that lets people propose microwork and others sign up to do it, and today there are both hundreds of thousands of tasks offered and hundreds of thousands of people offering to be paid to do them.

[179] For semi-structured or more narrative documents these descriptions might be authoring templates used in word processors or other office applications, document schemas in XML applications, style sheets, or other kinds of transformations that change one resource representation into another one. Primary resources that are highly and regularly structured are invariably organized in databases or enterprise information management systems in which a data schema specifies the arrangement and type of data contained in each field or component of the resource.

[180] Describing information as “metadata” suggests that it is of secondary importance, not as essential or informative as the resource being described. This is surely the reason why the US National Security Agency and those of other governments, whose unauthorized surveillance of global communications were revealed in 2013 by Edward Snowden, often stressed that they were only collecting message metadata, not its content. Of course, information about who you communicate with and when you do so defines your social network, information that is potentially very valuable, and the NSA knows this just as Facebook and Twitter do.

[181] There are a large number of third-party Twitter apps. See http://twitter.pbworks.com/w/page/1779726/Apps. For a scholarly analysis see [(Efron 2011)].

[182] The basic idea behind fantasy sports is quite simple. You select a team of existing players in any sport, and then compare their statistical performance against other teams similarly selected by other people. Fantasy sports appeal mostly to die-hard fans who study player statistics carefully before “drafting” their players. The global fantasy sports business for companies who organize and operate fantasy leagues is estimated as between 1 and 2 billion US dollars annually [(Montague 2010)].

[183] [(Schmandt-Besserat 1997)] and [(McGrath 2015)].

[184] The oldest known lists of books were created about 4000 years ago in Sumeria. The first use of cards in library catalogs was literal; when the revolutionary government of France seized private book collections, an inventory was created stating in 1791 using the blank backs of playing cards. 110 years later the US Library of Congress began selling pre-printed catalog cards to libraries, but in the mid-1960s the creation of the Machine-Readable Cataloging(MARC) format marked the beginning of the end of printed cards. See [(Strout 1956)]. The MARC standards are at http://www.loc.gov/marc/.

[185] We treat resource format and resource focus as distinct dimensions, so there are four categories here. This contrasts with David Weinberger’s three “orders of order” that he proposes in the first chapter of a book called Everything is Miscellaneous [(Weinberger 2007)]. Weinberger starts with the assumption that physical resources are inherently the primary ones, so the first “order of order” emerges when physical resources are arranged. The second “order of order” emerges when physical description resources are arranged, and the third “order of order” emerges when digital description resources for physical resources are arranged. Later in the book Weinberger mentions the use of bar codes associated with websites, a physical description of a digital resource, but because he started with the assumption that physical resources define the “first order” this example does not fit into his orders of order.

[186] These methods go by different names in different disciplines, including “data modeling,” “systems analysis,” and “document engineering” (e.g., [(Kent 2012)], [(Silverston 2000)], [(Glushko and McGrath 2005)]. What they have in common is that they produce conceptual models of a domain that specify their components or parts and the relationships among these components or parts. These conceptual models are called “schemas” or “domain ontologies” in some modeling approaches, and are typically implemented in models that are optimized for particular technologies or applications.

[187] Specifically, an NFL football team needs to be considered a single resource for games through the season and in playoffs, and 53 individual players for other situations, like the NFL draft or play-calling. The team and the team’s roster can be thought of as resources, and the team’s individual players are also resources that make up the whole team.

[188] [(Denton 2007)] is a highly readable retelling of the history of cataloging that follows four themesthe use of axioms, user requirements, the work, and standardization and internationalizationculminating with their synthesis in the Functional Requirements for Bibliographic Records(FRBR).

[189] This was a surprisingly controversial activity. Many opposed Panizzi’s efforts as a waste of time and effort because they assumed that “building a catalog was a simple matter of writing down a list of titles”[(Denton 2007, p. 38)].

[190] Seymour Lubetzky worked for the US Library of Congress from 1943-1960 where he tirelessly sought to simplify the proliferating mass of special case cataloging rules proposed by the American Library Association, because at the time the LOC had the task of applying those rules and making the catalog cards other libraries used. Lubetzky’s book on Cataloguing Rules and Principles [(Lubetzky 1953)] bluntly asks “Is this rule necessary?” and was a turning point in cataloging.

[191] In between the abstraction of the work and the specific single item are two additional levels in the FRBR abstraction hierarchy. An expression denotes the multiple the multiple realizations of a work in some particular medium or notation, where it can actually be perceived. There are many editions and translations of Macbeth, but they are all the same expression, and they are a different expression from all of the film adaptations of Macbeth. A manifestation is the set of physical artifacts with the same expression. All of the copies of the Folger Library print edition of Macbeth are the same manifestation.

[192] This kind of advice can be found in many data or conceptual modeling texts, but this particular statement comes from [(Glushko, Weaver, Coonan, and Lincoln 1988)]. Similar advice can also be found in the information science literature: “A unit of information…would have to be…correctly interpretable outside any context” [(Wilson 1968, p. 18)].

[193] A group of techniques collectively called “normalization” produces a set of tightly defined information components that have minimal redundancy and ambiguity. Imagine that a business keeps information about customer orders using a “spreadsheet” style of organization in which a row contains cells that record the date, order number, customer name, customer address, item ID, item description, quantity, unit price, and total price. If an order contains multiple products, these would be recorded on additional rows, as would subsequent orders from the same customer. All of this information is important to the business, but this way of organizing it has a great deal of redundancy and inefficiency. For example, the customer address recurs in every order, and the customer address field merges street, city, state and zip code into a large unstructured field rather than separating them as atomic components of different types of information with potentially varying uses. Similar redundancy exists for the products and prices. Canceling an order might result in the business deleting all the information it has about a particular customer or product.

Normalization divides this large body of information into four separate tables, one for customers, one for customer orders, one for the items contained in each order, and one for item information. This normalized information model encodes all of the information in the “spreadsheet style” model, but eliminates the redundancy and avoids the data integrity problems that are inherent in it.

Normalization is taught in every database design course. The concept and methods were proposed by [(Codd 1970)], who invented the relational data model, and has been taught to students in numerous database design textbooks like [(Date 2003)].

[194] The “Internet of Things” concept spread very quickly after it was proposed in 1999 by Kevin Ashton, who co-founded the Auto-ID center at MIT that year to standardize RFID and sensor information. For a popular introduction, see [(Gershenfeld, Krikorian, and Cohen 2004)]. For a recent technical survey and a taxonomy of application domains and scenarios see [(Atzori, Iera, and Morabito 2010)].

[195] Pattern analysis can help escape this dilemma by enabling predictive modeling to make optimal use of the data. In designing smart things and devices for people, it is helpful to create a smart model in order to predict the kinds of patterns and locations relevant to the data collected or monitored. These allow designers to develop a set of dimensions and principles that will act as smart guides for the development of smart things. Modeling helps to enable automation, security, or energy efficiency, and baseline models can be used to detect anomalies. As for location, exact locations are unnecessary; use of a “symbolic space” to represent each “sensing zone”e.g., rooms in a houseand an individual’s movement history as a string of symbolse.g., abcdegiaworks sufficiently as a model of prediction. See [(Das et al. 2002)].

[196] Well, maybe not anything. Books list traditional meanings of various names, charts rank names by popularity in different eras, and dozens of websites tout themselves as the place to find a special and unique name. See http://www.ssa.gov/oact/babynames/ for historical trends about baby names in the US with an interactive visualization at http://www.babynamewizard.com/voyager#.

Different countries have rules about characters or words that may be used in names. In Germany, for example, the government regulates the names parents can give to their children; there’s even a book, the International Handbook of Forenames, to guide them [(Kulish 2009)]. In Portugal, the Ministry of Justice publishes lists of prohibited names (BBC News, 2007a). Meanwhile, in 2007, Swedish tax officials rejected a family’s attempt to name their daughter Metallica (http://news.bbc.co.uk/2/hi/6525475.stm).

We can also change our names. Whether a woman takes on her husband’s surname after marriage or, like the California man who changed his name to Trout Fishing,” we just find something that better suits us than our given name.

[197] While you may think that certain terms are more obviously “good” than others, studies show that “there is no one good access term for most objects. The idea of an ‘obvious,’ ‘self-evident,’ or ‘natural’ term is a myth!” [(Furnas et al. 1987, p. 967)].

[198] [(Spencer 2009)]. Free listing (see http://boxesandarrows.com/beyond-cardsorting-free-listing-methods-to-explore-user-categorizations/)

[199] The most common names for this service were activities, calendar and events, but in all over a hundred different names were suggested, including cityevents, whatup, sparetime, funtime, weekender, and nightout, “People use a surprisingly great variety of words to refer to the same thing,” Furnas wrote, “If everyone always agreed on what to call things, the user’s word would be the designer’s word would be the system’s word. … Unfortunately, people often disagree on the words they use for things” [(Furnas et al. 1987, p. 964)].

[200] This example comes from [(Farish 2002)], who analyzes “What’s in a Name?” and suggests that multiple names for the same thing might be a good idea because non-technical business users, data analysts, and system implementers need to see things differently and no one standard for assigning names will work for all three audiences.

[201] See, for example, Handbook of Cross-Cultural Marketing, [(Kaynak 1997)].

[202] See “As easy as YZX,http://www.economist.com/node/760345. In addition, the convention to list the co-authors of scientific publications in alphabetic order has been shown to affect reputation and employment by giving undeserved advantages to people whose names start with letters that come early in the alphabet. This bias might also affect admission to selective schools. [(Efthyvoulou 2008)].

[203] The Kentucky Fried Chicken franchise solved this problem by changing its name to KFC, which you can now find in Beijing, Moscow, London and other locations not anywhere near Kentucky and where many people have probably never heard of the place.

Why is the professional basketball team in Los Angeles called the “Lakers” when there are few natural lakes there? The team was originally located in Minneapolis, Minnesota, a state nicknamed “The Land of 10,000 Lakes.”

[204] Tim Berners-Lee, the founder of the web, famously argued that “Cool URIs Don’t Change” [(Berners-Lee 1998)].

[205] Any online citation to one of the West printed court reports will use the West format. However, when Mead Data wanted to use the West page numbers in its LEXIS online service to link to specific pages, West sued for copyright infringement. The citation for the West Publishing vs. Mead Data Central case is 799 F.2d 1219 (8th Cir 1986), which means that the case begins on page 1219 of volume 799 in the set of opinions from the 8th Circuit Court of Appeals that West published in print form. West won the case and Mead Data had to pay substantial royalties. Fortunately, this logic behind this decision was repudiated by the US Supreme Court a few years later in a case that West published as Feist Publications, Inc., v. Rural Telephone Service Co., 499 U.S. 340 (1991), and West can no longer claim copyright on page numbers.

[206] When George Orwell gave the title 1984 to a novel he wrote in 1949, he intended it as a warning about a totalitarian future as the Cold War took hold in a divided Europe, but today 1984 is decades in the past and the title does not have the same impact.

[207] [(Dorai and Venkatesh 2002)].

[208] [(Queenan 2011)].

Most common US surnames; http://names.mongabay.com/most_common_surnames.htm.

Chad Ochocinco story: http://en.wikipedia.org/wiki/Chad_Ochocinco.

Fake names at Starbucks: http://online.wsj.com/article/SB10001424053111904106704576582834147448392.html.

Twitter on sports jerseys: http://www.forbes.com/sites/alexknapp/2011/12/30/pro-lacrosse-team-replaces-names-with-twitter-handles-on-jerseys/?partner=technology_newsletter.

[209] Identifiers with meaningful internal structure are said to be structured or intelligent. Those that contain no additional information are sometimes said to be unstructured, opaque, or dumb. The 8 in the ISBN example is a check digit, not technically part of the identifier, that is algorithmically derived from the other digits to detect errors in entering the ISBN.

[210] [(McCartney 2006)].

[211] [(Svenonius 2000)] calls vocabulary control “the sine qua non of information organization” (p. 89). “The imposition of vocabulary control creates an artificial language out of a natural language” (p. 89), leaving behind an official, normalized set of terms and their uses.

[212] This mapping is “the means by which the language of the user and that of a retrieval system are brought into sync” [(Svenonius 2000, p. 93)] and allows an information-seeker to understand the relationship between, say, Samuel Clemens and Mark Twain. The Library of Congress(LOC) maintains a list of standard, accepted names for authors, subjects, and titles called the Name Authority File. http://id.loc.gov/authorities/names.html.

[213] Pan-European Species Directory Infrastructure (PESI): http://www.eu-nomen.eu/pesi; Consortium for the Barcode of Life (CBOL): http://www.barcoding.si.edu/; NatureServe: http://services.natureserve.org/BrowseServices/getSpeciesData/getSpeciesListREST.jsp.

[214] [(Hemerly 2011)].

[215] This rations / radio confusion is described in [(Wheatley 2004)]. In 2008 a similar mistake in managing inventory at a US military warehouse led to missile launch fuses being sent to Taiwan instead of helicopter batteries, causing a high-level diplomatic furor when the Chinese government objected to this as a treaty violation [(Hoffman 2008)].

[216] Organizing systems in libraries, museums, and businesses often give sequential accession numbers to resources when they are added to a collection, but these identifiers are of no use outside of the context in which they are assigned, as when a union catalog or merged database is created.

[217] A more general technique is to use the Universally Unique Identifier(UUID) standard, which standardizes some algorithms that generate 128-bit tokens that, for all practical purposes, will be unique for hundreds, if not thousands, of years.

[218] [(OASIS 2003)]. The Organization for the Advancement of Structured Information Systems(OASIS) XML Common Biometric Format(XCBF) was developed to standardize the use of biometric data like DNA, fingerprints, iris scans, and hand geometry to verify identity (https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=xcbf).

[219] [(Coyle 2006, p. 429)].

[220] IP v6 for Internet addresses. The threat of exhaustion was the motivation for remedial technologies, such as classful networks, Classless Inter-Domain Routing(CIDR) methods, and Network Address Translation(NAT) that extend the usable address space.

[221] Digital Object Identifier(DOI) system (http://www.doi.org). However, DOI has its issues too. It is a highly political, publisher-controlled system, not a universal solution to persistence.

[222] This is called the Paradox of Theseus, a philosophical debate since ancient times. Every day that Theseus’s ship is in the harbor, a single plank gets replaced, until after a few years the ship is completely rebuilt: not a single original plank remains. Is it still the ship of Theseus? And suppose, meanwhile, the shipbuilders have been building a new ship out of the replaced planks? Is that the ship of Theseus? [(Furner 2008, p. 6)].

[223] See [(Renear and Dubin 2003)], [(Wynholds 2011)].

[224] See http://www.nationsonline.org/oneworld/hist_country_names.htm for a list of formerly used country names and their respective effectivity.

[225] See [(Gravois 2010)]. One specific example of this effect of international geopolitics on an organizing system involves the northern border of the Crimean Peninsula. When running a query for “Ukraine” via google.com/maps (USA), the border appears as a dotted line, which reflects a “neutral” perspective in the aftermath of recent political and military conflicts. Alternatively, when submitting the same query via google.com.ua/maps (Ukraine), there is no border at all, which is a reflection of a Ukrainian perspective that the Crimean Peninsula is part of Ukraine. Lastly, when the query is submitted via google.ru/maps (Russia), the border is represented as a solid line, which reflects a Russian perspective that the territory is part of Russia. A 2014 study of Google Maps found 32 situations where the answer to “what country is that on the map?” depended on where it was asked [(Yanovsky 2014)]

[226] Effectivity in the tax code is simple compared to that relating to documents in complex systems, like commercial aircraft. Because of their long lifetimesthe Boeing 737 has been flying since the 1960sand continual upgrading of parts like engines and computers, each airplane has its own operating and maintenance manual that reflects changes made to the plane over time. Every change to the plane requires an update to the repair manual, making the old version obsolete. And while an aircraft mechanic might refer to “the 737 maintenance manual,” each 737 aircraft actually has its own unique manual.

[227] A notary public is used to verify that a signature on an important document, such as a mortgage or other contract, is authentic, much as signet rings and sealing wax once proved that no one has tampered with a document since it was sealed.

License

The Discipline of Organizing Copyright © by Robert J. Glushko. All Rights Reserved.

Share This Book