3 Chapter 3. Activities in Organizing Systems
Robert J. Glushko
Erik Wilde
Jess Hemerly
Isabelle Sperano
Robyn Perry
Table of Contents
3.2.2. Looking “Upstream” and “Downstream” to Select Resources
3.3.1. Organizing Physical Resources
3.3.3. Organizing Digital Resources
3.3.4. Organizing With Descriptive Statistics
3.3.5. Organizing with Multiple Resource Properties
3.4. Designing Resource-based Interactions
3.4.1. Affordance and Capability
3.4.2. Interaction and Value Creation
3.5.1. Motivations for Maintaining Resources
3.6. Key Points in Chapter Three
Introduction
There are four activities that occur naturally in every organizing system; how explicit they are depend on the scope, the breadth or variety of the resources, and the scale, the number of resources that the organizing system encompasses. Consider the routine, everyday task of managing your wardrobe. When you organize your clothes closet, you are unlikely to write a formal selection policy that specifies what things go in the closet. You do not consciously itemize and prioritize the ways you expect to search for and locate things, and you are unlikely to consider explicitly the organizing principles that you use to arrange them. From time to time you will put things back in order and discard things you no longer wear, but you probably will not schedule this as a regular activity on your calendar.
Your clothes closet is an organizing system; defined as “an intentionally arranged collection of resources and the interactions they support.” As such, it exposes these four highly interrelated and iterative activities:
Determining the scope of the organizing system by specifying which resources should be included. (Should I hang up my sweaters in the clothes closet or put them in a dresser drawer in the bedroom?)
Specifying the principles or rules that will be followed to arrange the resources. (Should I sort my shirts by color, sleeve type, or season?)
Designing resource-based interactions
Designing and implementing the actions, functions or services that make use of the resources. (Do I need storage places for clothes to be laundered? Should I have separate baskets for white and colors? Dry cleaning?)
Managing and adapting the resources and the organization imposed on them as needed to support the interactions. (When is it time to straighten up the closet? What about mending? Should I toss out clothes based on wear and tear, how long I have owned them, or whether I am tired of them? What about excess hangers?)
Selecting
Organizing
Designing resource-based interactions
What are the most important and frequent queries that need to be pre-configured?
Maintaining
Figure 3.1, “Four Activities in all Organizing Systems.” illustrates these four
activities in all organizing systems, framing the depiction of the organizing and interaction design activities shown in Figure 1.1, “An Organizing System.” with the selection and maintenance activities that necessarily precede and follow them.
Figure 3.1. Four Activities in all Organizing Systems.
Four activities take place in all organizing systems: selection of resources for a collection; intentional organization of the resources; design and implementation of interactions with individual resources or with the collection, and; maintenance of the resources and the interactions over time.
These activities are deeply ingrained in academic curricula and professional practices, with domain-specific terms for their methods and results. Libraries and museums usually make their selection principles explicit in collection development policies. Adding a resource to a library collection is called acquisition, but adding to a museum collection is called accessioning. Documenting the contents of library and museum collections to organize them is called cataloging. Circulation is a central interaction in libraries, but because museum resources do not circulate the primary interactions for museum users are viewing or visiting the collection. Maintenance activities are usually described as preservation or curation.
In business information systems, selection of resources can involve data generation, capture, sampling, or extraction. Adding resources could involve loading, integration, or insertion. Schema development and data transformation are important organizing activities. Supported interactions could include querying, reporting, analysis, or visualization. Maintenance activities are often described as deletion, purging, data cleansing, governance, or compliance.
What about “Creating” Resources?
Domain-specific methods and vocabularies evolve over time to capture the complex and distinctive sets of experiences and practices of their respective disciplines. We can identify correspondences and overlapping meanings, but they are not synonyms or substitutes for each other. We propose more general terms like selection and maintenance, not as lowest common denominator replacements for these more specialized ones, but to facilitate communication and cooperation across the numerous disciplines that are concerned with organizing.
Part of what a database administrator can learn from a museum curator follows from the rich associations the curator has accumulated around the concept of curation that are not available around the more general concept of maintenance. Without the shared concept of maintenance to bridge their disciplines, this learning could not take place.
Navigating this chapter
In the section called “The Concept of “Resource”” and the section called “What Is Being Organized?” we briefly discussed the fundamental concept of a resource. In this chapter, we describe the four primary activities with resources, using examples from many different kinds of organizing systems.
We emphasize the activities of organizing and of designing resource-based interactions that make use of the organization imposed on the resources. We discuss selection and maintenance to create the context for the organizing activities and to highlight the interdependencies of organizing and these other activities. This broad survey enables us to compare and contrast the activities in different resource domains, setting the stage for a more thorough discussion of resources and resource description in Chapter 4, Resources in Organizing Systems and Chapter 5, Resource Description and Metadata.
Selecting Resources
Selecting is the process by which resources are identified, evaluated, and then added to a collection in an organizing system. Selection is first shaped by the domain and then by the scope of the organizing system, which can be analyzed through six interrelated aspects:
the number and nature of users
the time span or lifetime over which the organizing system is expected to operate
the expected changes to the collection
the physical or technological environment in which the organizing system is situated or implemented
the relationship of the organizing system to other ones that overlap with it in domain or scope.
(In Chapter 11, The Organizing System Roadmap, we discuss these six aspects in more detail.)
Selection Criteria
For some types of resources, the specifications that guide selection can be precise and measurable. Precise specifications are especially important when an organizing system will contain or make use of all resources of a particular type, or if all the resources produced from a particular source become part of the organizing system on some regular schedule. Selection specifications can also be shaped by laws, regulations or policies that require or prohibit the collection of certain kinds of objects or types of information.[44]
For example, when a manufacturer of physical goods selects the materials or components that are transformed into its products, it carefully evaluates the candidate resources and their suppliers before making them part of its supply chain. The manufacturer would test the resources against required values of measurable characteristics like chemical purity, strength, capacity, and reliability. A business looking for transactional or demographic data to guide a business expansion strategy would specify different measurable characteristics; data files must be valid with respect to a schema, must contain no duplicates or personally identifiable information, and must be less than one month old when they are delivered. Similarly, employee selection has become highly data-intensive; employers hire people after assessing the match between their competencies and capabilities (expressed verbally or in a resume, or demonstrated in some qualification test) and what is needed to do the required activities.[45]
Selection is an essential activity in creating organizing systems whose purpose is to combine separate web services or resources to create a composite service or application according to the business design philosophy of Service Oriented Architecture(SOA).[46] When an information-intensive enterprise or application combines its internal services with ones provided by others via Application Programming Interfaces (APIs), the resources are selected to create a combined collection of services according to the “core competency” principle: resources are selected and combined to exploit the first party’s internal capabilities and those of its service partners better than any other combination of services could. For example, instead of writing millions of lines of code and collecting detailed maps to build an interactive map in an application, you can get access to the Google Maps organizing system with just a few lines of code.[47] (See the sidebar, Selection of Web-based Resources)
Selection of Web-based Resources
The nature and scale of the web changes how we collect resources and fundamentally challenges how we think of resources in the first place. Web-based resources cannot be selected for a collection by consulting a centralized authoritative directory, catalog, or index because one does not exist. ProgrammableWeb and other directories organize thousands of web-accessible APIs, and the dominant resource-organizing firms Amazon, Salesforce, Facebook, and Twitter offer hundreds of APIs to access massive amounts of information about products, people, and posts, but APIs enable access to only a fraction of the web’s content. And although your favorite web search engine consults an index or directory of web resources when you enter a search query, you do not know where that index or directory came from or how it was assembled.[48][49]
However, the web has universal scope and global reach, making most of the web irrelevant to most people most of the time. Researchers have attacked this problem by treating the web as a combination of a very large number of topic-based or domain-specific collections of resources, and then developing techniques for extracting these collections as digital libraries targeted for particular users and uses.[50]
Scientific and business data are ideally selected after assessments of their quality and their relevance to answering specific questions. But this is easy to say and hard to do. It is essential to assess the quality of individual data items to find data entry problems such as misspellings and duplicate records, or data values that are illegal, statistical outliers, or otherwise suspicious. It is also essential to assess the quality of data as a collection to determine if there are problems in what data was collected, by whom or how it was collected and managed, the format and precision in which it is stored, whether the schema governing each instance is rigorous enough, and whether the collection is complete. In addition, copyright, licensing, consumer protection laws, competitive considerations, or simply the lack of incentives to share resources make it difficult to obtain the best or most appropriate resources.[51] (See the sidebar, Assessing and Addressing Data Quality)
Both libraries and museums typically formalize their selection principles in collection development policies that establish priorities for acquiring resources that reflect the people they serve and the services they provide to them. The diversity of user types in public libraries and many museums implies that narrowly-targeted criteria would produce a collection of resources that would fail to satisfy many of the users. As a result, libraries typically select resources on the basis of broader criteria like their utility and relevance to their user populations, and try to choose resources that add the most value to their existing collections, given the cost constraints that most libraries are currently facing. Museums often emphasize intrinsic value, scarcity, or uniqueness as selection criteria, even if the resources lack any contemporary use.[52]
Looking “Upstream” and “Downstream” to Select Resources
In the section called “Selection Criteria” we discussed the activity of selecting resources by assessing their conformance with specifications for required properties or capabilities. However, if you can determine where the resources come from, you can make better selection decisions by evaluating the people, processes, and organizing systems that create them. Using the analogy of a river, we can follow a resource “upstream” from us until we find the “headwaters.” Physical resources might have their headwaters in a factory, farm, or artist’s studio. Digital resources might have headwaters in a government agency, a scientist’s laboratory, or a web-based commerce site.
When interaction resources (the section called “The Concept of “Interaction Resource””) are incorporated into the organizing system that creates them, as when records of a person’s choices and behaviors are used to personalize subsequent information, the headwaters are obviously easy to find. However, even though finding the headwaters where resources come from is often not easy and sometimes not possible, that is where you are most likely to find the people best able to answer the questions, described in Chapter 2, Design Decisions in Organizing Systems, that define any organizing system. The resource creators or producers will know the assumptions and tradeoffs they made that influence whether the resources will satisfy your requirements, and you can assess what they (or their documents that describe the resources) tell you and the credibility they have in telling it. You should also try to evaluate the processes or algorithms that produce the resources, and then decide if they are capable of yielding resources of acceptable quality.
The best outcome is to find a credible supplier of good quality resources. However, if an otherwise desirable supplier does not currently produce resources of sufficient quality, it is worth trying to improve the quality by changing the process using instruction or incentives. Advocates for open government have succeeded in getting numerous US government entities to publish data for free in machine-readable formats, but it was partly as a result of somewhat subversive demonstration projects and shaming that the government finally created data.gov in 2009. A clear lesson from the “quality movement” and statistical process control is that interventions that fix quality problems at their source are almost always a better investment than repeated work to fix problems that were preventable (see endnote[297]). But if you cannot find the headwaters or you are not able to address quality problems at their source, you can sometimes transform the resources to give them the characteristics or quality they need.[53] (See the sidebar, Assessing and Addressing Data Quality, and the section called “Transforming Resources for Interactions”.)
Assessing and Addressing Data Quality
If an organizing system uses data acquired from some external source, it is essential to assess its quality as an “intake” process. Ideally, the data comes with a schema that explicitly specifies what is expected, including legal structures, data types, and values (See the section called “Structuring Descriptions”). This intake process runs tests that find problems and then runs processes to fix the problems.
Other data quality problems are harder to detect because they are contextual; a data value might be valid in some contexts but the same value might be invalid in others. For example, if you live in San Francisco and your credit card is used for transactions in Barcelona or Berlin, it could be fraud, or maybe you are on vacation. Similarly, high or low ratings for business establishments on sites like Yelp might be appropriate responses to excellent or poor service, but might also reflect “pay for rating” manipulation in the former case, and efforts by competitors to undermine rival businesses in the latter.[54]
When you cannot obtain resources directly from their source, even if you have confidence in their quality at that point, it is important to analyze any evidence or records of their use or interactions as they flow downstream. (See the section called “Resources over Time”) Physical resources are often associated with printed or digital documents that make claims about their origin and authenticity, and often have bar codes, RFID tags, or other technological mechanisms that enable them to be tracked from their headwaters to the places where they are used. Tracking is very important for data resources because they can often be added to, derived from, or otherwise changed without leaving visible traces. Just as the water from melted mountain snow becomes less pure as its flows downstream, a data resource can become “dirty” or “noisy” over time, reducing its quality from the perspective of another person or computational agent further downstream. Data often gets dirty when it is combined with other datasets that contain duplicate or seemingly-duplicate information. Data can also become dirty when the hardware or software that stores it changes. Subtle differences in representation formats, transaction management, enforcement of integrity constraints, and calculations of derived values can change the original data.
In addition, a data resource can become inaccurate or obsolete simply because the world that the data describes has changed with the passage of time. People move, change jobs, get married or divorced, or die. Likewise, companies move, merge, get spun off, or go out of business. A poll taken a year before an election is often not a good predictor of the ultimate winner.[55]
Organizing Resources
In this book we use property in a generic and ordinary sense as a synonym for feature or “characteristic.” Many cognitive and computer scientists are more precise in defining these terms and reserve property for binary predicates (e.g., something is red or not, round or not). If multiple values are possible, the property is called an attribute, “dimension,” or “variable.” Feature is used in data science and machine learning contexts for both “raw” or observable variables and “latent” ones, extracted or constructed from the original set.[56]
For most types of resources, any number of principles could be used as the basis for their organization depending on the answers to the “why?” (the section called “Why Is It Being Organized?”), “how much?” (the section called “How Much Is It Being Organized?”), and “how?” (the section called “How (or by Whom) Is It Organized?”) questions posed in Chapter 2, Design Decisions in Organizing Systems.
A simple principle for organizing resources is colocation —putting all the resources in the same location: in the same container, on the same shelf, or in the same email in-box. However, most organizing systems use principles that are based on specific resource properties or properties derived from the collection as a whole. What properties are significant and how to think about them depends on the number of resources being organized, the purposes for which they are being organized, and on the experiences and implicit or explicit biases of the intended users of the organizing system. The implementation of the organizing system also shapes the need for, and the nature of, the resource properties.[57]
“Subject matter” organization involves the use of a classification system that provides categories and descriptive terms for indicating what a resource is about. Because they use aboutness properties that are not directly perceived, methods for assigning subject classifications are intellectually-intensive and in many cases require rigorous training to be performed consistently and appropriately.[58] Nevertheless, the cost and time required for this human effort motivates the use of computational techniques for organizing resources.
As computing power steadily increases, the bias toward computational organization gets even stronger. However, an important concern arises when computational methods for organizing resources use so-called “black box” methods that create resource descriptions and organizing principles that are not inspectable or interpretable by people. In some applications more efficient information retrieval or question answering, more accurate predictions, or more personalized recommendations justify making the tradeoff. But comprehensibility is critical in many medical, military, financial, or scientific applications, where trusting a prediction can have life or death implications or cause substantial time or money to be spent.[59]
Organizing Physical Resources
Similarly, because they have different material manifestations, we usually organize our printed books in a different location than our record albums, which might be near but remain separate from our CDs and DVDs. This is partly because the storage environments for physical resources (shelves, cabinets, closets, and so on) have co-evolved with the physical resources they store.[60]
Organizing People into Businesses
How people are organized into businesses is the essence of the discipline of management, and different aspects are taught in industrial organization and behavior, operations, entrepreneurship, and other courses. Organizing people in a business is often called “human resource management,” and many of the principles for organizing physical resources and information resources apply to organizing people.
Regardless of how the firm is organized, we can analyze it using economist Ronald Coase’s idea of “transaction costs,” which a business incurs in searching for and negotiating with suppliers, business partners, and customers, and in particular we can consider how new information and computing technologies reduce these costs to make the firm more efficient while remaining flexible.[61]
Organizing with Properties of Physical Resources
This inescapable aspect of visual perception was first formalized by German psychologists starting a century ago as the Gestalt principles (see the sidebar, Gestalt Principles). Likewise, because people have limited attentional capacity, we ignore a lot of the ongoing complexity of visual (and auditory) stimulation, making us perceive our sensory world as simpler than it really is. Taken together, these two ideas explain why we automatically or “pre-attentively” organize separate things we see as groups or patterns based on their proximity and similarity. They also explain why arranging physical resources using these quickly perceived attributes can seem more aesthetic or satisfying than organizing them using properties that take more time to understand. Look at the cover of this book; the most organized arrangement of the colors and shapes just jumps out at you more than the others.
Psychologists Max Wertheimer, Wolfgang Kohler, and Kurt Koffka proposed several principles—proximity, similarity, continuity, connection, enclosure, and closure—that explain how our visual system imposes order on what it sees. There are always multiple interpretations of the sensory stimuli gathered by our visual system, but the mind imposes the simplest ones: things near each other are grouped, complex shapes are viewed as simple shapes that are overlapping, missing information needed to see separate visual patterns as continuous or whole is filled in, and ambiguous figure-ground illusions are given one interpretation at a time.
Koffka’s pithy way of explaining the core idea of all the principles was that “The whole is other than the sum of the parts,” which has been distorted over time to the cliché that “the whole is more than the sum of the parts.”[62]
Designers of graphics and information visualizations rely on Gestalt rules because the automatic interpretations created by the human visual system enable their designs to be understood more quickly. This of course implies that designs that violate the Gestalt rules will be harder to understand. Camouflage—the use of disruptive coloration, colors and patterns that resemble backgrounds, countershading, shadow elimination, and similar techniques that make it difficult for the visual system to detect objects and edges—proves the power of Gestalt processing.[63]
Some arrangements of physical resources are constrained or precluded by resource properties that might cause problems for other resources or for their users. Hazardous or flammable materials should not be stored where they might spill or ignite; lions and antelopes should not share the same zoo habitat or the former will eat the latter; adult books and movies should not be kept in a library where children might accidentally find them; and people who are confrontational, passive aggressive, or arrogant do not make good team members when tough decisions need to be made. For almost any resource, it seems possible to imagine a combination with another resource that might have unfortunate consequences. We have no shortage of professional certifications, building codes, MPAA movie ratings, and other types of laws and regulations designed to keep us safe from potentially dangerous resources.
Organizing with Descriptions of Physical Resources
Organizing Places
Places are physical resources, but unlike the previous two subsections where we treat the environment as given (the library or museum building, the card catalog or bookshelf) and discuss how we organize resources like books in that environment, we can take an alternative perspective and discuss how we design that physical environment. These environments could be any of the following:
The land itself, as when we lay out city plans when organizing how people live together and interact in cities.
A “built environment,” a human-made space, particular building, or a set of connected spaces and buildings. A built environment could be a museum, airport, hospital, casino, department store, farm, road system, or any kind of building or space where resources are arranged and people interact with them.
The orientation and navigation aids that enable users to understand and interact in built environments. These are resource descriptions that support the interaction requirements of the users.
These are not entirely separable contexts, but they are easier to discuss as if they are considered as such.
Organizing the Land
Cities naturally emerge in places that can support life and commerce. Almost all major cities are built on coasts or rivers because water provides sustenance, transportation and commercial links, and power to enable industry. Many very old cities have crowded and convoluted street plans that do not seem intentionally organized, but grid plans in cities also have a very long history. Cities in the Middle East were laid out in rough grids as far back as 2000+ BCE. Using long axes was a way to create an impression of importance and power.
Because the United States, and especially the American West, was not heavily settled until much more recently compared to most of Europe and Asia, it was a place for people to experiment with new ideas in urban design. The natural human tendency to impose order on habitation location had ample room to do just that. The easiest and most efficient way to organize space is using a coordinate grid, with streets intersecting at perpendicular angles. Salt Lake City, Albuquerque, Phoenix, and Seattle are notable examples of grid cities. An interesting hybrid structure exists in Washington DC, which has radiating diagonal avenues overlaid on a grid.[64]
Organizing Built Environments
Built environments influence the expectations, behaviors, and experiences of everyone who enters the space—employees, visitors, customers, and inhabitants are all subject to the design of the spaces they occupy. These environments can be designed to encourage or discourage interactions between people, to create a sense of freedom or confinement, to reward exploration or enforce efficiency, and of course, much much more. The arrangement of the resources in a built environment also encourages or discourages interactions, and sometimes the built environment is designed with a specific collection of resources in mind to enable and reinforce some particular interaction goals or policies.
If we contrast the built environments of museums, airports, and casinos, and the way in which each of them facilitates or constrains interactions are more obvious. Museums are often housed in buildings designed as architectural monuments that over time become symbols of national, civic, or cultural identity. Many old art museums mimic classical architecture, with grand stairs flanked by tall columns. They have large and dramatic entry halls that invite visitors inside. Modern museums are decidedly less traditional, and some people complain that the architecture of modern art museums can overshadow the art collection contained within because people are induced to pay more attention to the building than to its contents.
Some recently built airports have been designed with architectural flair, but airport design is more concerned with efficiency, walkability (maybe with the aid of moving walkways), navigability, and basic comfort for travelers getting in and out the airport. Wide walkways, multiple staircases, and people movers whose doors open in one direction at a time, all encourage people to move in certain directions, sometimes without the people even realizing they are being directed.
If you have ever been lost in a casino or had trouble finding the exit you can be sure you experienced a casino that achieved its main design goals: keeping people inside and making it easy for them to lose track of time because they lack both windows and clocks. As American architect Robert Venturi points out, “The intricate maze under the low ceiling never connects with outside light or outside space…This disorients the occupant in space and time… He loses track of where he is and when it is.”[65]
If one accepts the premise that values and bias are at work in decisions about organizing systems, it is difficult not to see it in built environments. Consider queue design in banks, supermarkets, or boarding airplanes. Assuming that it is desirable to organize people efficiently to minimize wait times and crowding, how should the queue be designed? How many categories of people should there be? What is the basis for the categories?
However, consider the dynamic created by queue design at Disneyland to give priority to people with physical limitations and disabilities. This seemingly socially respectful decision was exploited by a devious collaboration between disabled people and wealthy non-disabled people who hired them to pose as family members, enabling the entire “family” to cut ahead of everyone else. In response, Disney modified the policy favoring disabled patrons, causing numerous complaints about Disney’s insensitivity to their concerns.[66]
There are many other examples of how values and biases become part of built environments. In the mid-20th century the road systems of Long Island in New York were designed with low overpasses, which prevented public buses from passing under them, effectively segregating the beaches. The trend in college campus design after the student protests of the 1960s and 1970s was to create layouts that would prevent or frustrate large demonstrations.[67]
Orientation and Wayfinding Mechanisms
It is easy to move through an environment and stay oriented if the design is simple and consistent, but most built environments must include additional features or descriptions to assist people in these tasks. Distinctive architectural elements can create landmarks for orientation, and spaces can be differentiated with color, lighting, furnishings, or other means. More ubiquitous mechanisms include signs, room numbers, or directional arrows highlighting the way and distance to important destinations.
In airports, for example, there are many orientation signs and display terminals that help passengers find their departure gates, baggage, or ground transportation services. In contrast, casinos provide little orientation and navigation support because increased confusion leads to lengthier visits, and more gambling on the part of the casino’s visitors.
A recent innovation in wayfinding and orientation mechanisms is to give them sensing and communication capabilities so they can identify people by their smartphones and then provide personalized directions or information. These so-called “beacon” systems have been deployed at numerous airports, including London’s Gatwick, San Francisco, and Miami. [68]
Organizing Digital Resources
Organizing systems that arrange digital resources like digital documents or information services have some important differences from those that organize physical resources. Because digital resources can be easily copied or interlinked, they are free from the “one place at a time” limitation.[69] The actual storage locations for digital resources are no longer visible or very important. It hardly matters if a digital document or video resides on a computer in Berkeley or Bangalore if it can be located and accessed efficiently.[70]
Moreover, because the functions and capabilities of digital resources are not directly manifested as physical properties, the constraints imposed on all material objects do not matter to digital content in many circumstances.[71]
An emerging issue in the field of digital humanities is the requirement to recognize the materiality of the environment that enables people to create and interact with digital resources Even if the resources themselves are intangible, it can be necessary to study and preserve the technological and social context in which they exist to fully understand them.[72]
An organizing system for digital resources can also use digital description resources that are associated with them. Since the incremental costs of adding processing and storage capacity to digital organizing systems are small, collections of both primary digital resources and description resources can be arbitrarily large. Digital organizing systems can support collections and interactions at a scale that is impossible in organizing systems that are entirely physical, and they can implement services and functions that exploit the exponentially growing processing, storage and communication capabilities available today. This all sounds good, unless you are the small local business with limited onsite inventory that cannot compete with global web retailers that offer many more choices from a network of warehouses.[73]
Information resources in either physical or digital form are typically organized using intrinsic properties like author names, creation dates, publisher, or the set of words that they contain. Information resources can also be organized using assigned properties like subject classifications, names, or identifiers. Information resources can also be organized using behavioral or transactional properties collected about individuals or about groups of people with similar interaction histories. For example, Amazon and Netflix use browsing and purchasing behavior to make book and movie recommendations.[74]
Memories can be viewed either as physical (because at some level they are represented in the brain) or as digital (because they are retrieved as electrical impulses), but memory techniques like the method of loci and memory palaces reify this duality in an interesting way.
While physical resources must be stored in physical locations, our powerful spatial memory provides an opportunity for us to, in a sense, store mental resources in physical locations. Our hippocampus, the brain component dedicated to memory, is highly developed for storing and recalling memories of physical locations. The ancient Greeks relied on this capability and devised a mnemonic system—the method of loci—which involved attaching things to remember, the key ideas in a speech perhaps, to well-known physical locations. While giving the speech, then, all one must do is imagine walking through that physical location from idea to idea. Today, champion memorizers use this technique to associate items with places in vividly imagined “memory palaces.” While you may not be interested in memorizing the order of a deck of cards, recognizing the power of our spatial memory may be worth considering when designing your organizing system or when analyzing the successes or failures of a system.[75]
Organizing Web-based Resources
The Domain Name System(DNS) is the most inherent scheme for organizing web resources. Top-level domains for countries (.us, .jp, .cn, etc.) and generic resource categories (.com, .edu. .org, gov, etc.) provide some clues about the resources organized by a website. These clues are most reliable for large established enterprises and publishers; we know what to expect at ibm.com, Berkeley.edu, and sfgov.org.[76]
The network of hyperlinks among web resources challenges the notion of a collection, because it makes it impractical to define a precise boundary around any collection smaller than the complete web.[77] Furthermore, authors are increasingly using “web-native” publication models, creating networks of articles that blur the notions of articles and journals. For example, scientific authors are interconnecting scientific findings with their underlying research data, to discipline-specific data repositories, or to software for analyzing, visualizing, simulation, or otherwise interacting with the information.[78]
The conventional library is both a collection of books and the physical space in which the collection is managed. On the web, rich hyper linking and the fact that the actual storage location of web resources is unimportant to the end users fundamentally undermine the idea that organizing systems must collect resources and then arrange them under local control to be effective. The spectacular rise during the 1990s of the AOL “walled garden,” created on the assumption that the open web was unreliable, insecure, and pernicious, was for a time a striking historical reminder and warning to designers of closed resource collections until its equally spectacular collapse in the following decade.[79] But Facebook so far is succeeding by following a walled garden strategy.
“Information Architecture” and Organizing Systems
The discipline known as information architecture can be viewed as a specialized approach for designing the information models and their systematic manifestations in user experiences on websites and in other information-intensive organizing systems.[80] Abstract patterns of information content or organization are sometimes called architectures, so it is straightforward from the perspective of the discipline of organizing to define the activity of information architecture as designing an abstract and effective organization of information and then exposing that organization to facilitate navigation and information use. Note how the first part of this definition refers to intentional arrangement of resources, and the second to the interactions enabled by that arrangement.
The Activities of Information Architecture
Selecting Resources: To make good choices about what content to include in an information system or service, methods and tools for creating and organizing the information that is potentially available are important. Glushko and McGrath’s method for creating a “Document Inventory” and Halvorson and Rach’s “Information Inventory” both use a matrix or grid format to list information sources and various associated properties. Once the inventory is completed, the information must be evaluated with respect to the user and information requirements. This usually requires a more fine-grained analysis to choose the most reliable or reusable source when there are alternatives. This process is usually called content auditing, and tools or templates for organizing the work are easy to find on the web.
Organizing Resources: Tidwell proposes a set of design patterns for input forms, text and graphic editors, information graphics, calendars, and other common types of web applications that organize resources. Morville and Rosenfield classify design patterns as “organization schemes” and “organization structures,” reinforcing the idea that information architecture is a sub-specialty of the discipline of organizing.
Designing Interactions: Kalbach presents design patterns and implementations for navigation interactions. Resmini and Rosati discuss architectures and examples for information architectures that interconnect physical and digital channels. Marcotte introduces techniques for adapting user interfaces to the size and capabilities of different devices, collectively called responsive web design.
Information architects use a variety of tools for representing information and process models. Common ones include site maps, workflow and dataflow diagrams, and wireframe models. Brown’s Communicating Design and Abel and Baillie’s The Language of Content Strategy are concise sources.[81]
Some information design conventions have become design patterns. Documents use headings, boxes, white space, and horizontal rules to organize information by type and category. Large type signifies more important content than small type, red type indicates an advisory or warning, and italics or bold says “pay attention.”
Some patterns are general and apply to an entire website, page, or interface genre such as a government site, e-commerce site, blog, social network site, home page, “about us” page, and so on. Other patterns are more specific and affect a part of a site or a single component of a page (e.g., autocompletion of a text field, breadcrumb menu, slideshow).
In websites, different categories of content or interactions are typically arranged in different menus. The choices within each menu are then arranged to reflect typical workflows or ordered according to some commonly used property like size, percentage, or price.
All design patterns reflect and reinforce the user’s past experiences with content and interface components, and this familiarity reduces the cognitive complexity of user interface interaction, requiring users to pay less attention.[82]
However, interface designers can take advantage of this familiarity and employ design patterns in a less beneficial way to manipulate users, control their behaviors, or trick them into taking actions they do not intend. Patterns used this way are sometimes called Dark Patterns.
Some websites and applications employ Dark Patterns, which rely on user familiarity with good design patterns to induce users to take actions or fail to take actions in ways counter to their best interests. For example, a website may exploit familiar patterns to induce users to click on an ad disguised as a news item, sign up for unwanted e-mails, disclose personal information, or ignore important terms and conditions because they are buried in tiny text or in unusual locations.
Darkpatterns.org collects and classifies dark patterns. The largest categories are “bait and switch” (suggest one action but cause another), “trick questions” (misleading phrasing of an option), and “misdirection” (focusing attention on one thing to distract from another). The website has numerous examples of interfaces that try to get users to install additional software or change their defaults to a company’s product during installation. Other examples are from commerce sites that conceal the cheapest options, add additional fees at the very end of the purchase process, or make it difficult to accurately compare costs.
These practices are enough of a concern that some governments have begun to regulate the information that must be provided to consumers when purchasing digital products. The Directive on Consumer Rights published by the European Commission in June 2014 contains instructions about design choices that should be avoided, such as allowing additional purchases and payments without the consumer’s consent. The Directive even includes a model set of patterns to help designers comply with it.[83]
Dark patterns can be used to manipulate interactions with physical resources too. Gas pumps with three or four grades of gasoline invariably arrange the pumps in order of price, with the cheapest gas at the left and the most expensive on the right. Some gas stations put the cheapest gas in the middle, which causes inattentive customers who are relying on the usual pattern to buy more expensive gas than they intended.
Many organizing systems need to support interactions to find, identify, and select resources. Some of these systems contain both physical and digital resources, as in a bookstore with both web and physical channels, and many interactions are implemented across more than one device. Both the cross-channel and multiple-device situations create user expectations that interactions will be consistent across these different contexts. Starting with a conceptual model and separating content and structure from presentation, as we discussed in the section called “The Concept of “Organizing Principle””, gives organizing systems more implementation alternatives and makes them more robust in the face of technology diversity and change.
A model-based foundation is also essential in information visualization applications, which depict the structure and relationships in large data collections using spatial and graphical conventions to enable user interactions for exploration and analysis. By transforming data and applying color, texture, density, and other properties that are more directly perceptible, information visualization applications enable people to obtain more information than they can from text displays.[84]
Organizing With Descriptive Statistics
Descriptive statistics summarize a collection of resources or dataset with two types of measures:
Measures of central tendency: Mean, median, and mode; which measure is appropriate depends on the level of measurement represented in the numbers being described (these measures and the concept of levels of measurements are defined in the section called “Organizing Digital Resources”).
Exploratory Analysis to Understand Data
A dataset whose fields or attributes lack information about data types and units of measure has little use because the data lacks meaning. When some, but not all parts of the data are named or annotated, avoid over-interpreting these descriptions’ meanings. (See the section called “Naming Resources”.)
We will do some exploratory analysis to understand what an example dataset contains and how we might use it. For our example, we consider a collection of a few hundred records from a healthcare study, whose first eight records and first five data fields in each record are shown in Figure 3.2a, “Example Dataset”.
ID |
Sex |
Temp |
Age |
Weight |
… |
… |
… |
… |
… |
1 |
1 |
97.6 |
32 |
135 |
|
|
|
|
|
2 |
0 |
97.6 |
19 |
118 |
|
|
|
|
|
3 |
0 |
97.6 |
23 |
128 |
|
|
|
|
|
4 |
1 |
98.7 |
34 |
140 |
|
|
|
|
|
5 |
1 |
98.5 |
52 |
162 |
|
|
|
|
|
6 |
1 |
98.7 |
60 |
160 |
|
|
|
|
|
7 |
0 |
98.3 |
36 |
148 |
|
|
|
|
|
8 |
0 |
98.3 |
38 |
155 |
|
|
|
|
|
… |
… |
|
|
|
|
|
|
|
|
260 |
1 |
99.0 |
23 |
123 |
|
|
|
|
|
The “Temp” column contains several hundred different numeric values in the complete dataset, ranging from 96.8 to 100.6, with a mean of 98.6. These values are sensible if the label “Temp” means the under-the-tongue body temperature in degrees Fahrenheit of the study participant when the other measures were obtained. This type of data is usefully viewed as a histogram to get a sense of the spread and shape, shown in Figure 3.2b, “Temperature”.
The End of Average tells the story of how the U.S. military designed aircraft cockpits beginning in 1926 on the basis of the average dimensions of a 1926 pilot. In 1950, researchers measured over four thousand pilots only to discover that no actual pilot had average values on all the measures, and recommended adjustable seats and controls in cockpit design.
The “Weight” column has about 220 different numeric values, from 82 to 300, and judging from this range we can infer that the weights are measured in pounds. The data follows an uneven distribution with peaks around 160 and 200, and a small peak at 300. This odd shape appears in the histogram of Figure 3.2d, “Weight”. The two peaks in this so-called multi-modal histogram suggest that this measure is mixing two different kinds of resources, and indeed it is because weights of men and women follow different distributions. It would thus be useful to use the categorical “Sex” data to separate these populations, and Figure 3.2e, “Sex and Weight: Female” shows how analyzing weight for women and men as different populations is much more informative as an organizing principle than combining them.
Figure 3.2e. Sex and Weight: Female
Figure 3.2f. Sex and Weight: Male
Detecting Errors and Fraud in Data
Because of the very high transaction rate and the relatively small probability of fraud, credit card fraud is detected using machine learning algorithms. The classifier is trained with known good and bad transactions using properties like average amount, frequency, and location to develop a model of each cardholder’s “data behavior” so that a transaction can quickly be assigned a probability that it is fraudulent. (More about this kind of computational classification in Chapter 7, Categorization: Describing Resource Classes and Types.)[86]
Organizing with Multiple Resource Properties
If multiple resource properties are considered in a fixed order, the resulting arrangement forms a logical hierarchy. The top level categories of resources are created based on the values of the property evaluated first, and then each category is further subdivided using other properties until each resource is classified in only a single category. Consider the hierarchical system of folders used by a professor to arrange the digital resources on his computer; the first level distinguishes personal documents from work-related documents; work is then subdivided into teaching and research, teaching is subdivided by year, and year divided by course.
An alternative to hierarchical organization that is often used in digital organizing systems is faceted classification, in which the different properties for the resources can be evaluated in any order. For example, you can select wines from the wine.com store catalog by type of grape, cost, or region and consider these property facets in any order. Three people might each end up choosing the same moderately-priced Kendall Jackson California Chardonnay, but one of them might have started the search based on price, one based on the grape varietal, and the third with the region. This kind of interaction in effect generates a different logical hierarchy for every different combination of property values, and each user made his final selection from a different set of wines.
Faceted classification allows a collection of description resources to be dynamically re-organized into as many categories as there are combinations of values on the descriptive facets, depending on the priority or point of view the user applies to the facets. Of course this only works because the physical resources are not themselves being rearranged, only their digital descriptions.
Applications that organize large collections of digital information, including those for search, natural language processing, image classification, personalized recommendation, and other computationally intensive domains, often use huge numbers of resource properties (which are often called “features” or “dimensions”). For example, in document collections each unique word might initially be treated as a feature by machine learning algorithms, so there might be tens of thousands of features.
Chapter 8, Classification: Assigning Resources to Categories explains principles and methods for hierarchical and faceted classification in more detail.
Designing Resource-based Interactions
We need to focus on the interactions that are enabled because of the intentional acts of description or arrangement that transform a collection of resources into an organizing system. With physical resources, it is easy to distinguish the interactions that are designed into and directly supported by an organizing system because of intentional acts of description or arrangement from those that can take place with resources after they have been accessed. For example, when a book is checked out of a library it might be read, translated, summarized, criticized, or otherwise used—but none of these interactions would be considered a capability of the book that had been designed into the library. Some physical resources can initiate interactions, as surely “human resources” and “smart” objects with sensors and other capabilities can, but most physical resources are passive. We will discuss this idea of resource agency in the section called “Resource Agency”.
Additional issues in the design of interactions with resources are whether users have direct or mediated access to the resources, and whether they interact with the resources themselves or only with copies or descriptions of them. For example, users have direct access to original resources in a collection when they browse through library stacks or wander in museum galleries.[87] Users have mediated or indirect access when they use catalogs or search engines. Because digital resources can be easily reproduced, it can be difficult to distinguish a copy from the original, which raises questions of authenticity we will discuss in the section called “Authenticity”.
Affordance and Capability
The concept of affordance, introduced by J. J. Gibson, then extended and popularized by Donald Norman, captures the idea that physical resources and their environments have inherent actionable properties that determine, in conjunction with an actor’s capabilities and cognition, what can be done with the resource.[88]
Including capabilities and cognition brings accessibility considerations into the definition of affordance. A resource is only accessible when it supports interactions, and it is ineffective design to implement interactions with resources that some people are unable to perform. A person who cannot see text cannot read it, or if they are confined to a wheelchair they cannot select a book from a tall library shelf. Describing or transforming resources to ensure their accessibility is discussed in greater detail in the section called “Accessibility”.
We can analyze the organizing systems with physical resources to identify the affordances and the possible interactions they imply. We can compare the affordances or overall interaction capability enabled by different organizing systems for some type of physical resources, and we often do this without thinking about it. The tradeoffs between the amount of work that goes into organizing a collection of resources and the amount of work required to find and use them are inescapable when the resources are physical objects or information resources are in physical form. We can immediately see that storing information on scrolls does not enable the random access capability that is possible with books.
First, the affordances we can perceive might not be tied to any useful interaction. Donald Norman joked that every computer screen within reaching distance affords touching, but unless the display is touch-sensitive, this affordance only benefits companies that sell screen-cleaning materials.[89]
Second, most of the interactions that are supported by digital resources are not apparent when you encounter them. You cannot tell from their names, but you probably know from past experience what interactions are possible with files of types “.doc” and “.pdf.” You probably do not know what interactions take place with “.xpi” and “.mobi” files.[90]
Once you have discovered it, the capability of digital resources and information systems can be assessed by counting the number of functions, services, or application program interfaces. However, this very coarse measure does not take into account differences in the capability or generality of a particular interaction. For example, two organizing systems might both have a search function, but differences in the operators they allow, the sophistication of pre-processing of the content to create index terms, or their usability can make them vastly differ in power, precision, and effectiveness.[91]
For example, a “smart travel agent” service can use a user’s appointment calendar, past travel history, and information sources like airline and hotel reservation services to transform a minimal interaction like “book a business trip to New York for next week’s meeting” into numerous hidden queries that would have otherwise required separate interactions. These queries are interconnected by logical or causal dependencies that are represented by information that overlaps between them. For example, all travel-related services (airlines, hotels, ground transportation) need the traveler’s identity and the time and location of his travel. A New York trip might involve all of these services, and they need to fit together in time and location for the trip to make sense. The hotel reservation needs to begin the day the flight arrives in the destination city, the limousine service needs to meet the traveler shortly after the plane lands, and the restaurant reservation should be convenient in time and location to the hotel.[92]
Interaction and Value Creation
Furthermore, Apte and Mason recognized that the proportions of these three types of value creating activities can be treated as design parameters, especially where the value created by retrieving or computing information could be completely separated from the value created by physical actions and person-to-person encounters. This configuration of value creation enables automated self-service, in which the human service provider can be replaced by technology, and outsourcing, in which the human provider is separated in space or time from the customer.[93]
Value Creation with Physical Resources
A large university library contains millions of books and academic journals, and access to those resources can require a long walk deep into the library stacks after a consultation with a reference librarian or a search in a library catalog. For decades library users searched through description resources—first printed library cards, and then online catalogs and databases of bibliographic citations—to locate the primary resources they wanted to access. The surrogate descriptions of the resources needed to be detailed so that users could assess the relevance of the resource without expending the significant effort of obtaining and examining the primary resource.[94]
However, for most people the primary purpose of interacting with a library is to access the information contained in its resources. Many people prefer accessing digital documents or books to accessing the original physical resource because the incidental physical and interpersonal interactions are unnecessary. In addition, many library searches are for known items, which is easily supported by digital search.[95]
In some organizing systems robotic devices, computational processes, or other entities that can act autonomously with no need for a human agent carry out interactions with physical resources. Robots have profoundly increased efficiency in materials management, “picking and packing” in warehouse fulfillment, office mail delivery, and in many other domains where human agents once located, retrieved, and delivered physical resources. A “library robot” system that can locate books and grasp them from the shelves can manage seven times as many books in the same space used by conventional open stacks.[96]
An automated robot library system at San Francisco State University.
(Photo by Scott Abel. Used with permission.)
Value Creation with Digital Resources
With digital resources, neither physical manipulation nor interpersonal contact is required, and the essence of the interaction is information exchange or symbolic manipulation of the information contained in the resource.[97] Put another way, by replacing interactions that involve people and physical resources with symbolic ones, organizing systems can lower costs without reducing user satisfaction. This is why so many businesses have automated their information-intensive processes with self-service technology.
Similarly, web search engines eliminate the physical effort required to visit a library and enable users to consult more readily accessible digital resources. A search engine returns a list of the page titles of resources that can be directly accessed with just another click, so it takes little effort to go from the query results to the primary resource. This reduces the need for the rich surrogate descriptions that libraries have always been known for because it enables rapid evaluation and iterative query refinement.[98]
Stop and Think: Browsing for Books
Libraries recognize that they need to do a better job integrating their collections into the “web spaces” and web-based activities of their users if they hope to change the provably suboptimal strategies of “information foraging” most people have adopted that rely too much on the web and too little on the library.[99] Some libraries are experimenting with Semantic Web and “Linked Data” technologies that would integrate their extensive bibliographic resources with resources on the open web.[100]
Museums have aggressively embraced the web to provide access to their collections. While few museum visitors would prefer viewing a digital image over experiencing an original painting, sculpture, or other physical artifact, the alternative is often no access at all. Most museum collections are far larger than the space available to display them, so the web makes it possible to provide access to otherwise hidden resources.[101]
Richer interactions with digital text resources are possible when they are encoded in an application or presentation-independent format. Automated content reuse and “single-source publishing” is most efficiently accomplished when text is encoded in XML, but much of this XML is produced by transforming text originally created in word processing formats. Once it is in XML, digital information can be distributed, processed, reused, transformed, mixed, remixed, and recombined into different formats for different purposes, applications, devices, or users in ways that are almost impossible to imagine when it is represented in a tangible (and therefore static) medium like a book on a shelf or a box full of paper files.[103]
As a result, digital books are somewhat controversial and problematic for libraries, whose access models were created based on the economics of print publication and the social contract of the copyright first sale doctrine that allowed libraries to lend printed books.[104]
Accessibility
The United Nations Convention on the Rights of Persons with Disabilities recognizes accessibility to information and communications technologies as a basic human right. There is also a strong business case for accessibility: studies show that accessible websites are used more often, are easier to maintain, and produce better search results.[105]
Many of the techniques for making a resource accessible involve transforming the resource or its description into a different form so someone who could not perceive it or interact with it in its original form can now do so. The most common operating systems all come with general-purpose accessibility features such as reading text aloud, recognizing speech, magnifying text, increasing cursor size, signaling with flashing lights instead of with sounds, lights to signal keyboard shortcuts for selecting and navigating, and connecting to devices for displaying Braille. Google Translate converts text in one language to another, and many people use it to create a rough draft that is finished by a human translator.[106]
Other techniques are not generic and automatic, and instead require investment by authors or designers to make information accessible. Websites are more accessible when images or other non-text content types have straightforward titles, captions, and “alt text” that describes what they are about. Consistent placement and appearance of navigation controls and interaction widgets is essential; for example, in a shopping site “My Cart” might always be found at the top right corner of the page.[107]
If authors apply semantic and structural markup to the text and use formats that distinguish it from presentation instructions, page outlines and summaries can be generated to enhance navigation, and search can be made more precise by limiting it to particular sections or content types. As the “Information IQ” of the source format increases, more can be done to make it more accessible (see the section called “Resource Format” and Figure 4.3, “Information IQ.”).[108]
The Smithsonian Museum in Washington, DC invites visitors to record audio descriptions on mobile devices of the nearly 137 million objects in its collection, and then makes these available to everyone. This is just a small part of its efforts to make its exhibits more accessible. A company called D-Scriptive enables blind people to enjoy Broadway shows more by recording hundreds of audio descriptions that are synchronized with dialog spoken by the actors.[109]
Transcriptions created by skilled people are highly accurate but labor-intensive to produce, so speech-to-text software is increasingly being used to transcribe speech using pre-trained acoustic and language models. Training these models is computationally intensive, and there are many clever techniques to acquire the “labeled” inputs. However, most of them are conceptually simple; they take the huge amount of data collected by voice search applications and analyze what the searcher does with the results to assess the accuracy of the transcription. Transcription accuracy can be improved when models can be specialized by industry or application. For example, speech-to-text software for doctors is trained to recognize medical terminology, while software for use by generic voice recognition services like Apple’s is trained to understand dictation and commands or questions one would ask of a smartphone.
Since text transcripts are machine-readable, unlike audio or video files, adding text transcripts makes it possible for search engines to index audio and video in ways that were previously impossible. Pop Up Archive, an audio search company in Oakland, California, works with speech-to-text software specially trained for news media and spoken word content to make radio, podcasts, and archival audio searchable. A challenge for audio search is that even though a transcription with a few mistakes works just fine for search engines, people often expect transcriptions to be perfect.[110]
When the speech is in a language that is not understood, it needs to be translated as well. Perhaps you have watched a movie on an international flight and were able to choose from subtitles in many different languages. Creating subtitles for a foreign film is an asynchronous task that is substantially easier task than doing a real-time translation, and the demand for skilled translators for speeches and other synchronous situations (and interpreters, who translate speech to sign language for people with hearing disabilities) remains high.
Access Policies
Because of their commercial and competitive purposes, organizing systems in business domains are more likely to enforce a granular level of access control that distinguishes people according to their roles and the nature of their interactions with resources. For example, administrative assistants in a company’s Human Resources department are not allowed to see salaries; HR employees in a benefits administration role can see salaries but not change them; management-level employees in HR can change the salaries. Some firms limit access to specific times from authorized computers or IP addresses.[111]
We can further contrast access policies based on their origins or motivations.
Designed resource access policies are established by the designer or operator of an organizing system to satisfy internally generated requirements. Examples of designed access policies are:
giving more access to paying users than to users who do not pay;
Imposed Policies are mandated by an external entity and the organizing system must comply with them. For example, an organizing system might have to follow information privacy, security, or other regulations that restrict access to resources or the interactions that can be made with them.
University libraries typically complement or replace parts of their print collections with networked access to digital content licensed from publishers. Typical licensing terms then require them to restrict access to users that are associated with the university, either by being on campus or by using virtual private network (VPN) software that controls remote access to the library network.[112] Copyright law limits the uses of a substantial majority of the books in the collections of major libraries, prohibiting them from being made fully available in digital formats. Museums often prohibit photography because they do not own the rights to modern works they display.
Maintaining Resources
Maintaining resources is an important activity in every organizing system because resources must be available at the time they are needed. Beyond these basic shared motivations are substantial differences in maintenance goals and methods depending on the domain of the organizing system.
However, different domains sometimes use the same terms to describe different maintenance activities and different terms for similar activities. Common maintenance activities are storage, preservation, curation, and governance. Storage is most often used when referring to physical or technological aspects of maintaining resources; backup (for short-term storage), archiving (for long-term storage), and migration (moving stored resources from one storage device to another) are similar in this respect. The other three terms generally refer to activities or methods that more closely overlap in meaning; we will distinguish them in the section called “Preservation” through the section called “Governance”.
Selection and maintenance are interdependent. Selection is based on an initial set of rules that determine which resources enter the organizing system. Maintenance includes the work to preserve the resources, the processes for evaluating and revising the original selection criteria, and the removal of resources from the system when they no longer need to be preserved. More stringent rules for selecting resources generally imply a maintenance plan that carefully enforces the same constraints that limit selection. This is just common sense whether the resource is a piece of art, an automobile, a software package, or a star basketball player; if you worked hard to find or paid a lot to acquire a resource, you are going to take care of it and will not soon be buying another one.
Ideally, maintenance requirements for resources should be anticipated when organizing principles are defined and implemented. Resource descriptions to support preservation of digital resources are especially important.[113]
Motivations for Maintaining Resources
The concept of memory institution broadly applies to a great many organizing systems that share the goal of preserving knowledge and cultural heritage.[114] The primary resources in libraries, museums, data archives or other memory institutions are fixed cultural, historic, or scientific artifacts that are maintained because they are unique and original items with future value. This is why the Musée du Louvre preserves the portrait of the Mona Lisa and the United States National Archives preserves the Declaration of Independence.[115]
In contrast, in businesses organizing systems, many of the resources that are collected and managed have limited intrinsic value. The motivation for preservation and maintenance is economic; resources are maintained because they are essential in running the business. For example, businesses collect and preserve information about employees, inventory, orders, invoices, etc., because it ensures internal goals of efficiency, revenue generation, and competitive advantage. The same resources (e.g., customer information) are often used by more than one part of the business.[116] Maintaining the accuracy and consistency of changing resources is a major challenge in business organizing systems.[117]
Unlike libraries, archives, and museums, indefinite preservation is not the central goal of most business organizing systems. These organizing systems mostly manage information needed to carry out day-to-day operations or relatively recent historical information used in decision support and strategic planning. In addition to these internal mandates, businesses have to conform to securities, taxation, and compliance regulations that impose requirements for long-term information preservation.[118]
Of course, libraries, museums, and archives also confront economic issues as they seek to preserve and maintain their collections and themselves as memory institutions.[119] They view their collections as intrinsically valuable in ways that firms generally do not. Because of this, extensive energy goes into preservation, protection, and storage of resources in memory institutions, and it is more rare that resources may be discarded or de-accessioned. Art galleries are an interesting hybrid because they organize and preserve collections that are valuable, but if they do not manage to sell some things, they will not stay in business.
In between these contrasting purposes of preservation and maintenance are the motives in personal collections, which occasionally are created because of the inherent value of the items but more typically because of their value in supporting personal activities. Some people treasure old photos or collectibles that belonged to their parents or grandparents and imagine their own children or grandchildren enjoying them, but many old collections seem to end up as offerings on eBay. In addition, many personal organizing systems are task-oriented, so their contents need not be preserved after the task is completed.[120]
Preservation
When the goal is indefinite preservation, other maintenance issues arise if resources deteriorate or are damaged. How much of an artifact’s worth is locked in with the medium used to express it? How much restoration should be attempted? How much of an artifact’s essence is retained when digitized?
Digitization and Preserving Resources
Technological obsolescence is the major challenge in maintaining digital resources. The most visible one is a result of the relentless evolution of the physical media and environments used to store digital information in both institutional or business and personal organizing systems. Computer data began to be stored on magnetic tape and hard disk drives six decades ago, on floppy disks four decades ago, on CDs three decades ago, on DVDs two decades ago, on solid-state drives half a decade ago, and in “cloud-based” or “virtual” storage environments in the last decade. As the capacity of storage technologies grows, economic and efficiency considerations often make the case to adopt new technology to store newly acquired digital resources and raise questions about what to do with the existing ones.[121]
The second challenge might seem paradoxical. Even though digital storage capacity increases at a staggering pace, the expected useful lifetimes of the physical storage media are measured in years or at best in decades. Colloquial terms for this problem are data rot or “bit rot.” In contrast, books printed on acid-free paper can last for centuries. The contrast is striking; books on library shelves do not disappear if no one uses them, but digital data can be lost if no one wants access to it within a year or two after its creation.[122]
However, limits to the physical lifetime of digital storage media are much less significant than the third challenge, the fact that the software and its associated computing environment used to parse and interpret the resource at the time of preservation might no longer be available when the resource needs to be accessed. Twenty-five years ago most digital documents were created using the Word Perfect word processor, but today the vast majority is created using Microsoft Word and few people use Word Perfect today. Software and services that convert documents from old formats to new ones are widely available, but they are only useful if the old file can be read from its legacy storage medium.[123]
The Hathi Trust Digital Library
The Hathi Trust is a worldwide partnership of several dozen major research institutions and libraries dedicated to “collecting, organizing, preserving, communicating, and sharing the record of human knowledge.” The Hathi Trust was established in 2008 to coordinate the efforts of libraries in managing the digital copies of the books they received in return for providing books to Google for its book digitization projects. Since then the Hathi Trust has broadened its scope to include the public domain books collected by the Internet Archive and numerous other digital collections, and today its digital library has over ten million volumes. The costs of running the Hathi Trust and its digital library are shared in a transparent manner by the institutions that contributed digital collections or that want access to them, which reduces the costs for everyone compared to a “go it alone” strategy. The Hathi Trust Digital Library has separate modes for catalog search and full-text search of the library contents, unlike commercial search engines that do not distinguish them. A second important difference between the Hathi Trust Digital Library and commercial search engines is the absence of display advertising and “sponsored search” results.
(Interoperability and integration are discussed in Chapter 10, Interactions with Resources.)
Preserving the Web
Preservation of web resources is inherently problematic. Unlike libraries, museums, archives, and many other kinds of organizing systems that contain collections of unchanging resources, organizing systems on the web often contain resources that are highly dynamic. Some websites change by adding content, and others change by editing or removing it.[124]
Longitudinal studies have shown that hundreds of millions of web pages change at least once a week, even though most web pages never change or change infrequently.[125] Nevertheless, the continued existence of a particular web page is hardly sufficient to preserve it if it is not popular and relevant enough to show up in the first few pages of search results. Persistent access requires preservation, but preservation is not meaningful if there is no realistic probability of future access.
Comprehensive web search engines like Google and Bing use crawlers to continually update their indexed collections of web pages and their search results link to the current version, so preservation of older versions is explicitly not a goal. Furthermore, search engines do not reveal any details about how frequently they update their collections of indexed pages.[126]
The Internet Archive and the “Wayback Machine”
The Internet Archive (Archive.org), founded by Brewster Kahle, makes preservation of the web its first and foremost activity, and when you enter a URI into its “Wayback Machine” you can see what a site looked like at different moments in time. For example, www.berkeley.edu was archived about 2500 times between October 1996 and January 2013, including about twice a week on average during all of 2012. Even so, since a large site like berkeley.edu often changes many times a day, the Wayback Machine’s preservation of berkeley.edu is incomplete, and it only preserves a fraction of the web’s sites. Since 2006 the Internet Archive has hosted the “Archive-It” service to enable hundreds of schools, libraries, historical societies, and other institutions to archive collections of digital resources.[127]
Preserving Resource Instances
A focus on preserving particular resource instances is most clear in museums and archives, where collections typically consist of unique and original items. There are many copies and derivative works of the Mona Lisa, but if the original Mona Lisa were destroyed none of them would be acceptable as a replacement.[128]
Archivists and historians argue that it is essential to preserve original documents because they convey more information than just their textual content. Paul Duguid recounts how a medical historian used faint smells of vinegar in 18th-century letters to investigate a cholera epidemic because disinfecting letters with vinegar was thought to prevent the spread of the disease. Obviously, the vinegar smell would not have been part of a digitized letter.[129]
Zoos often give a distinctive or attractive animal a name and then market it as a special or unique instance. For example, the Berlin Zoo successfully marketed a polar bear named Knut to become a world famous celebrity, and the zoo made millions of dollars a year through increased visits and sales of branded merchandise. Merchandise sales have continued even though Knut died unexpectedly in March 2011, which suggests that the zoo was less interested in preserving that particular polar bear than in preserving the revenue stream based on that resource.[130]
Recent developments in sensor technology enable very extensive data collection about the state and performance of machines, engines, equipment, and other types of physical resources, including human ones. (Are you wearing an activity tracker right now?) When combined with historical information about maintenance activity, predictive analytics techniques can use this data to determine normal operating ranges and indicators of coming performance degradation or failures. Predictive maintenance can maximize resource lifetimes while minimizing maintenance and inventory costs. These techniques have recently been used to predict when professional basketball players are at risk of an injury, potentially enabling NBA teams to identify the best time to rest their star players without impairing their competitive strategy.[131]
Preserving Resource Types
(Photo by Mike Saechang. Creative Commons CC BY-SA 2.0 license.)
Some business organizing systems are designed to preserve types or classes of resources rather than resource instances. In particular, systems for content management typically organize a repository of reusable or “source” information resources from which specific “product” resources are then generated. For example, content management systems might contain modular information about a company’s products that are assembled and delivered in sales or product catalogs, installation guides, operating guides, or repair manuals.[132]
Businesses strive to preserve the collective knowledge embodied in the company’s people, systems, management techniques, past decisions, customer relationships, and intellectual property. Much of this knowledge is “know how”—knowing how to get things done or knowing how things work—that is tacit or informal. Knowledge management systems(KMS) are a type of business organizing system whose goal is to capture and systematize these information resources.[133] As with content management, the focus of knowledge management is the reuse of “knowledge as type,” putting the focus on the knowledge rather than the specifics of how it found its way into the organizing system.
Libraries have a similar emphasis on preserving resource types rather than instances. The bulk of most library collections, especially public libraries, is made up of books that have many equivalent copies in other collections. When a library has a copy of Moby Dick it is preserving the abstract work rather than the particular physical instance—unless the copy of Moby Dick is a rare first edition signed by Melville.
Even when zoos give their popular animals individual names, it seems logical that the zoo’s goal is to preserve animal species rather than instances because any particular animal has a finite lifespan and cannot be preserved forever.[134]
Preserving Resource Collections
In some organizing systems any specific resource might be of little interest or importance in its own right but is valuable because of its membership in a collection of essentially identical items. This is the situation in the data warehouses used by businesses to identify trends in customer or transaction data or in the huge data collections created by scientists. These collections are typically analyzed as complete sets. A scientist does not borrow a single data point when she accesses a data collection; she borrows the complete dataset consisting of millions or billions of data points. This requirement raises difficult questions about what additional software or equipment need to be preserved in an organizing system along with the data to ensure that it can be reanalyzed.[135]
Sometimes, specific items in a collection might have some value or interest on their own, but they acquire even greater significance and enhanced meaning because of the context created by other items in the collection that are related in some essential way. The odd collection of “things people swallow that they should not” at the Mütter Museum is a perfect example.[136]
Curation
For almost a century curation has referred to the processes by which a resource in a collection is maintained over time, which may include actions to improve access or to restore or transform its representation or presentation.[137]
Furthermore, especially in cultural heritage collections, curation also includes research to identify, describe, and authenticate resources in a collection. Resource descriptions are often updated to reflect new knowledge or interpretations about the primary resources.[138]
Institutional Curation
Curation in these institutional contexts requires extensive professional training. The institutional authority empowers individuals or groups to make curation decisions. No one questions whether a museum curator or a compliance manager should be doing what they do.[139]
Individual Curation
The Life-Changing Magic of Tidying Up
Many of the ever growing number of self-help books about organizing seem to approach it as an intellectual contest to devise more elaborate and optimized storage strategies. Marie Kondo’s wildly popular 2014 book The Life-changing Magic of Tidying Up, an international best-seller, has upended the conversation with an unapologetic dogma of removal that promises to yield a happier—and much more minimalist—life for individuals with their at-home organizing systems..
Kondo’s method mandates that only what brings one joy may be kept. Everything else must be tossed — unused gifts, books kept only for reference but never referenced, unworn clothing, and anything else that does not bring its owner joy. Kondo’s approach is designed for personal organizing systems, and would be difficult to implement in systems in systems used by multiple individuals, much less institutions. However, Kondo’s rejection of the concept that things should be saved for a rainy day might benefit organizations by making them more attentive to the costs of maintaining resources with no current use.
While people must make up their own minds about how they manage their possessions, there is compelling evidence from cognitive science and behavioral economics that decision-making throughout the day can be mentally exhausting. Kondo’s approach implicitly recognizes this limitation by requiring cognitive energy up front to reduce the total number of resources to the bare minimum necessary (by one’s own “joy standards”). This philosophy has people spend decision-making energy where it counts the most and makes it easier to make maintenance decisions over time.
Curation by individuals has been studied a great deal in the research discipline of Personal Information Management (PIM).[140] Much of this work has been influenced for decades by a seminal article written by Vannevar Bush titled “As We May Think.” Bush envisioned the Memex, “a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility.” Bush’s most influential idea was his proposal for organizing sets of related resources as “trails” connected by associative links, the ancestor of the hypertext links that define today’s web.[141]
Social and Web Curation
Many individuals spend a great amount of time curating their own websites, but when a site can attract large numbers of users, it often allows users to annotate, “tag,” “like,” “+1,” and otherwise evaluate its resources. The concept of curation has recently been adapted to refer to these volunteer efforts of individuals to create, maintain, and evaluate web resources.[142] The massive scale of these bottom-up and distributed activities is curation by “crowdsourcing,” the continuously aggregated actions and contributions of users.[143]
The informal and organic “folksonomies” that result from their aggregated effort create organization and authority through network effects.[144] This undermines traditional centralized mechanisms of organization and governance and threatens any business model in publishing, education, and entertainment that has relied on top-down control and professional curation.[145] Professional curators are not pleased to have the ad hoc work of untrained people working on websites described as curation.
Most websites are not curated in a systematic way, and the decentralized nature of the web and its easy extensibility means that the web as a whole defies curation. It is easy to find many copies of the same document, image, music file, or video and not easy to determine which is the original, authoritative or authorized version. Broken links return “Error 404 Not Found” messages.[146]
Problems that result from lazy or careless webmastering are minor compared to those that result from deliberate misclassification, falsification, or malice. An entirely new vocabulary has emerged to describe these web resources with bad intent: “spam,” “phishing,” “malware,” “fakeware,” “spyware,” “keyword stuffing,” “spamdexing,” “META tag abuse,” “link farms,” “cybersquatters,” “phantom sites,” and many more.[147] Internet service providers, security software firms, email services, and search engines are engaged in a constant war against these kinds of malicious resources and techniques.[148]
Computational Curation
Computational curation is more predictable than curation done by people, but search engines have long been accused of bias built into their algorithms. For example, Google’s search engine has been criticized for giving too much credibility to websites with .edu domain names, to sites that have been around for a long time, or that are owned by or that partner with the company, like Google Maps or YouTube.[149]
In organizing systems that contain data, there are numerous tools for name matching, the task of determining when two different text strings denote the same person, object, or other named entity. This problem of eliminating duplicates and establishing a controlled or authoritative version of the data item arises in numerous application areas but familiar ones include law-enforcement and counter-terrorism. Done incorrectly, it might mean that you end up on a “watch list” and experience difficulties every time you want to fly commercially.
An extremely promising new approach to computational curation involves using scientific measuring equipment to analyze damaged physical resources and then building software models of the resources that can be manipulated to restore the resources or otherwise improve access to their content. For example, the first sound recordings were made using rotating wax cylinders; sounds caused a diaphragm to vibrate, the pattern of vibration was transferred into a connected stylus, which then cut a groove into the wax. When the cylinder was rotated past a passive stylus, it would vibrate according to the groove pattern, and the amplified vibrations could be heard as the replayed sound. Unfortunately, wax cylinders from the 19th century are now so fragile that they would fall apart if they were played. This dilemma was resolved by Carl Haber, an experimental physicist at the Lawrence Berkeley Laboratory. Haber used image processing techniques to convert microscope-detailed scans of the grooves in the wax cylinders. Measurements of the grooves could then be transformed to reproduce the sounds captured in the grooves.[150]
A second example of computational curation applied to digital preservation is work done by a research team led by Melissa Terras and Tim Weyrich at University College London to build a 3-dimensional model of a 17th-century “Great Parchment Book” damaged in an 18th-century fire. The parchment was singed, shriveled, creased, folded, and nearly impossible to read (see website). After traditional document restoration techniques (e.g., illustrated in photos in the section called “Preservation”) went as far as they could, the researchers used digital image capture and modeling techniques to create a software model of the parchment that could stretch and flatten the digital document to discover text hidden by the damage.
Discarding, Removing, and Not Keeping
So far, we have discussed maintenance as activities involved in preserving and protecting resources in an organizing system over time. An essential part of maintenance is the phasing out of resources that are damaged or unusable, expired or past their effectivity dates, or no longer relevant to any interaction.
Many organizations admit to a distinct lack of strategy in the removal aspect of maintenance. A firm with outdated storage technology might have to discard older data simply to make room for new data, and might do so without considering that keeping some summary statistics would be valuable for historical analysis. Other firms might be biased towards keeping information just because they went to the trouble of collecting or acquiring it. Some amount of “intelligent” removal is an essential ingredient in any maintenance regime, and a popular book argues forcefully for continually discarding resources from personal organizing systems as a method of focusing on the resources that really matter. (See the sidebar, The Life-Changing Magic of Tidying Up.)
Efforts by libraries to automate the discarding of books that have not circulated for several years might seem like the obvious counterpart to their automated acquisition, but such efforts often produce passionate complaints from library patrons.[151]
Other domains have other mechanisms and terms for removing resources. Employess are removed by firing, layoff, or retirement . Athletes are cut or waived or sent down from a sports team if their performance deteriorates.
Keeping an organizing system current often involves some amount of elimination of older resources in order to make space for the new: in fashion retail, the floor is constantly restocked with the latest styles. Software development teams will halt active support and documentation efforts of legacy versions.
Information resources are often discarded to comply with laws about retaining sensitive data. Governments and office holders sometimes destroy documents that might prove damaging or embarrassing if they are discovered through Freedom of Information requests or by opposing political parties.
More positively, the “right to be forgotten” movement and intentional destruction of information records about prior bankruptcy, credit problems, or juvenile arrests after a certain period of time has passed can be seen as a policy of “social forgetfulness” that gives people a chance to get on with their lives.[152]
It is worth noting that the ability to discard without having to reuse is relatively recent. Historically, the urge and need to discard has clashed with the availability of resources. In the Middle Ages, liturgical texts or music would be phased out, perhaps when the music had gone out of style or when entire sections of the liturgy were phased out by decree. When this happened, they would reuse the parchment or vellum, either by scraping it down or by flipping it over, pasting it in a book, and using the other side. The former of these solutions often created a palimpsest, a document or other resource in which the remnants of older content remain visible under the new.
Some people have difficulty in discarding things, regardless of their actual value. This behavior is called hoarding, and is now regarded as a kind of obsessive-compulsive disorder that requires treatment because it can cause emotional, physical, social, and even legal problems for the hoarder and family members. It seems unsympathetic that many TV shows and stories have been produced about especially compulsive hoarding. A famous example is that of the Collyer brothers in New York, who shut themselves off from the world for years, and when they were found dead inside their home in 1947 it contained 140 tons of collected items, including 25,000 books, fourteen pianos, thousands of bottles and tin cans, hundreds of yards of fabrics, and even a Model T car chassis.[153]
Governance
Governance overlaps with
curation in meaning, but typically has more of policy focus (what should be done), rather than a process focus (how to do it). Governance is also more frequently used to describe curation in business and scientific organizing systems rather than in libraries, archives, and museums. Governance has a broader scope than curation because it extends beyond the resources in a collection and also applies to the software, computing, and networking environments needed to use them. This broader scope also means that governance must specify the rights and responsibilities for the people who might interact with the resources, the circumstances under which that might take place, and the methods they would be allowed to use.
Corporate governance is a common term applied to the ongoing maintenance and management of the relationship between operating practices and long-term strategic goals.[154]
Data governance policies are often shaped by laws, regulations or policies that prohibit the collection of certain kinds of objects or types of information. Privacy laws prohibit the collection or misuse of personally identifiable information about healthcare, education, telecommunications, video rental, and in some countries restrict the information collected during web browsing.[155]
Governance in Business Organizing Systems
Governance is essential to deal with the frequent changes in business organizing systems and the associated activities of data quality management, access control to ensure security and privacy, compliance, deletion, and archiving. For many of these activities, effective governance involves the design and implementation of standard services to ensure that the activities are performed in an effective and consistent manner.[156]
Stop and Think: Business Data Governance
Today’s information-intensive businesses capture and create large amounts of digital data. The concept of “business intelligence” emphasizes the value of data in identifying strategic directions and the tactics to implement them in marketing, customer relationship management, supply chain management and other information-intensive parts of the business.[157] A management aspect of governance in this domain is determining which resources and information will potentially provide economic or competitive advantages and determining which will not. A conceptual and technological aspect of governance is determining how best to organize the useful resources and information in business operations and information systems to secure the potential advantages.
Business intelligence is only as good as the data it is based on, which makes business data governance a critical concern that has rapidly developed its own specialized techniques and vocabulary. The most fundamental governance activity in information-driven businesses is identifying the “master data” about customers, employees, materials, products, suppliers, etc., that is reused by different business functions and is thus central to business operations.[158]
Because digital data can be easily copied, data governance policies might require that all sensitive data be anonymized or encrypted to reduce the risk of privacy breaches. To identify the source of a data breach or to facilitate the assertion of a copyright infringement claim a digital watermark can be embedded in digital resources.[159]
Governance in Scientific Organizing Systems
Scientific data poses special governance problems because of its enormous scale, which dwarfs the datasets managed in most business organizing systems. A scientific data collection might contain tens of millions of files and many petabytes of data. Furthermore, because scientific data is often created using specialized equipment or computers and undergoes complex workflows, it can be necessary to curate the technology and processing context along with data in order to preserve it. An additional barrier to effective scientific data curation is the lack of incentives in scientific culture and publication norms to invest in data retention for reuse by others. [160]
Almost all scientists admit that they are holding “dark data,” data that has never been made available to the rest of the scientific community. There may only be a few scientists worldwide that would want to see a particular dataset, but there are many thousands of these datasets. Other dark data comes from research that fails to find effects; because these negative findings are less likely to be published, literature reviews can be skewed by their omission. Just as Netflix makes the long tail of movies available, perhaps dark data would become more accessible if it could be could easily uploaded to a Netflix for Science. [(Heidorn 2008)]
Key Points in Chapter Three
3.6.1. Which activities are common to all organizing systems?
3.6.3. What is the first decision to be made when creating an organizing system?
3.6.4. Why does selection by memory institutions differ from sampling in scientific research?
3.6.5. Does making selection principles clear and consistent ensure that they are good ones?
3.6.6. How does “looking upstream” support better resource selection?
3.6.7. What is a resource property?
3.6.8. What is the relationship between resource properties and organizing principles?
3.6.9. What problems can arise when arranging physical resources?
3.6.13. What is materiality?
3.6.14. Why is the level of measurement important when organizing numeric data?
3.6.15. How can statistics help organize a set of resources?
3.6.16. What factors affect the organization of resources?
3.6.17. What is the fundamental tradeoff faced when organizing physical resources?
3.6.18. What are affordance and capability?
3.6.19. What does it mean for a resource to be accessible?
3.6.21. What is the basis of value creation when interacting with a digital resource?
3.6.22. What factors improve the usability of digital resources?
3.6.23. What is preservation?
3.6.24. What is the relationship between digitization and preservation?
3.6.25. What are curation and governance?
3.6.26. In what ways can computation improve the maintenance of resources?
3.6.27. For what reasons is discarding resources an essential maintenance activity?
3.6.28. What is the role of governance in business organizing systems?
3.6.29. How is governance different in scientific organizing systems?
Which activities are common to all organizing systems? |
|
|
Selection, organizing, interaction design, and maintenance activities occur in every organizing system. |
Are selection, organizing, interaction design, and maintenance the same activities in every organizing systems? |
|
|
These activities are not identical in every domain, but the general terms enable communication and learning about domain-specific methods and vocabularies. |
What is the first decision to be made when creating an organizing system? |
|
|
The most fundamental decision for an organizing system is determining its resource domain, the group or type of resources that are being organized. |
Why does selection by memory institutions differ from sampling in scientific research? |
|
|
Memory institutions select rare and distinctive resources, but in scientific research, a sample must contain representative instances. |
Does making selection principles clear and consistent ensure that they are good ones? |
|
|
Even when the selection principles behind a collection are clear and consistent, they can be unconventional, idiosyncratic, or otherwise biased. |
How does “looking upstream” support better resource selection? |
|
|
If you can determine where the resources come from, you can make better selection decisions by evaluating the people, processes, and organizing systems that create them. (See the section called “Looking “Upstream” and “Downstream” to Select Resources”) |
What is a resource property? |
|
|
In this book we use property in a generic and ordinary sense as a synonym for feature or “characteristic.” Many cognitive and computer scientists are more precise in defining these terms and reserve property for binary predicates (e.g., something is red or not, round or not). If multiple values are possible, the property is called an attribute, “dimension,” or “variable.” |
What is the relationship between resource properties and organizing principles? |
|
|
Most organizing systems use principles that are based on specific resource properties or properties derived from the collection as a whole. |
What problems can arise when arranging physical resources? |
|
|
Some arrangements of physical resources are constrained or precluded by resource properties that might cause problems for other resources or for their users. (See the section called “Organizing with Properties of Physical Resources”) |
What are some of the ways in which the mind follows Gestalt principles and imposes simpler interpretations on visual sensations? |
|
|
There are always multiple interpretations of the sensory stimuli gathered by our visual system, but the mind imposes the simplest ones: things near each other are grouped, complex shapes are viewed as simple shapes that are overlapping, missing information needed to see separate visual patterns as continuous or whole is filled in, and ambiguous figure-ground illusions are given one interpretation at a time. (See the sidebar, Gestalt Principles) |
How can built environments influence the expectations, behaviors, and experiences of everyone who enters the space? |
|
|
Built environments can be designed to encourage or discourage interactions between people, to create a sense of freedom or confinement, to reward exploration or enforce efficiency. |
How can we define the activity of “Information Architecture” using the language of the discipline of organizing? |
|
|
It is straightforward from the perspective of the discipline of organizing to define the activity of information architecture as designing an abstract and effective organization of information and then exposing that organization to facilitate navigation and information use. (See the section called ““Information Architecture” and Organizing Systems”) |
What is materiality? |
|
|
An emerging issue in the field of digital humanities is the requirement to recognize the materiality of the environment that enables people to create and interact with digital resources |
Why is the level of measurement important when organizing numeric data? |
|
|
The level of measurement (nominal, ordinal, interval, or ratio) of data determines how much quantitative organization of your data will be sensible. (See the section called “Organizing With Descriptive Statistics”) |
How can statistics help organize a set of resources? |
|
|
Statistical descriptions summarize a set of resources, and reveal other details that enable comparison of instances with the collection as a whole (such as identifying outliers). (See the section called “Organizing With Descriptive Statistics”) |
What factors affect the organization of resources? |
|
|
Multiple properties of the resources, the person organizing or intending to use them, and the social and technological environment in which they are being organized can collectively shape their organization. (See the section called “Organizing with Multiple Resource Properties”) |
What is the fundamental tradeoff faced when organizing physical resources? |
|
|
The tradeoff between the amount of work that goes into organizing a collection of resources and the amount of work required to find and use them is inescapable when the resources are physical objects or information resources are in physical form. |
What are affordance and capability? |
|
|
|
What does it mean for a resource to be accessible? |
|
|
|
Why are techniques for transforming the format of a resource or its description important in achieving accessibility? |
|
|
Many of the techniques for making a resource accessible involve transforming the resource or its description into a different form so someone who could not perceive it or interact with it in its original form can now do so. |
What is the basis of value creation when interacting with a digital resource? |
|
|
With digital resources, the essence of the interaction is information exchange or symbolic manipulation of the information contained in the resource. (See the section called “Value Creation with Digital Resources”) |
What factors improve the usability of digital resources? |
|
|
The variety and functions of interactions with digital resources are determined by the amount of structure and semantics represented in their digital encoding, in the descriptions associated with the resources, or by the intelligence of the computational processes applied to them. (See the section called “Value Creation with Digital Resources”) |
What is preservation? |
|
|
Preservation of resources means maintaining them in conditions that protect them from physical damage or deterioration. |
What is the relationship between digitization and preservation? |
|
|
Preservation is often a key motive for digitization, but digitization alone is not preservation. (See the section called “Digitization and Preserving Resources”) |
What are curation and governance? |
|
|
The essence of curation and governance is having clear policies for collecting resources and maintaining them over time that enable people and automated processes to ensure that resource descriptions or data are authoritative, accurate, complete, consistent, and non-redundant. (See the section called “Curation” and the section called “Governance”) |
In what ways can computation improve the maintenance of resources? |
|
|
Data cleaning algorithms can eliminate duplicate data, search engines can improve the relevance of results using selection and navigation behavior, and sensor data can predict when machines need servicing. |
For what reasons is discarding resources an essential maintenance activity? |
|
|
An essential part of maintenance is the phasing out of resources that are damaged or unusable, expired or past their effectivity dates, or no longer relevant to any interaction. (See the section called “Discarding, Removing, and Not Keeping”) |
What is the role of governance in business organizing systems? |
|
|
Governance is essential to deal with frequent changes in business organizing systems, data quality management, access control to ensure security and privacy, compliance, deletion, and archiving. (See the section called “Governance in Business Organizing Systems”) |
How is governance different in scientific organizing systems? |
|
|
Scientific data poses special governance problems because of its scale. (See the section called “Governance in Scientific Organizing Systems”) |
[44] Some governments attempt to preserve and prevent misappropriation of cultural property by enforcing import or export controls on antiquities that might be stolen from archaeological sites [(Merryman 2006)]. For digital resources, privacy laws prohibit the collection or misuse of personally identifiable information about healthcare, education, telecommunications, video rental, and might soon restrict the information collected during web browsing.
[45] The popular LinkedIn site, which has hundreds of millions of resumes that it data mines to find statistically superior job candidates, is literally a gold mine for the company because it makes money by referring those candidates to potential employers. Data-intensive hiring practices in baseball are entertainingly presented in the book entitled Moneyball book [(Lewis 2003)] or the 2011 movie starring Brad Pitt. Pro football teams have begun to assess college football players by comparing them statistically with the best pro players [(Robbins, 2016)].
Many examples of business strategies that required significant investment to acquire data assets with no current value are reported in [(Provost and Fawcett 2013)].
[46] See [(Cherbakov et al. 2005)], [(Erl 2005a)]. The essence of SOA is to treat business services or functions as components that can be combined as needed. An SOA enables a business to quickly and cost-effectively change how it does business and whom it does business with (suppliers, business partners, or customers). SOA is generally implemented using web services that exchange Extensible Markup Language(XML) documents in real-time information flows to interconnect the business service components. If the business service components are described abstractly it can be possible for one service provider to be transparently substituted for another—a kind of real-time resource selection—to maintain the desired quality of service. For example, a web retailer might send a Shipping Request to many delivery services, one of which is selected to provide the service. It probably does not matter to the customer which delivery service handles his package, and it might not even matter to the retailer.
[47] The idea that a firm’s long term success can depend on just a handful of critical capabilities that cut across current technologies and organizational boundaries makes a firm’s core competency a very abstract conceptual model of how it is organized. This concept was first proposed by [(Pralahad and Hamel 1990)], and since then there have been literally hundreds of business books that all say essentially the same thing: you cannot be good at everything; choose what you need to be good at and focus on getting better at them; let someone else do things that you do not need to be good at doing.
[48] See [(Borgman 2000)] on digitization and libraries. But while shared collections benefit users and reduce acquisition costs, if a library has defined itself as a physical place and emphasizes its holdings— the resources it directly controls—it might resist anything that reduces the importance of its physical reification, the size of its holdings, or the control it has over resources [(Sandler 2006)]. A challenge facing conventional libraries today is to make the transition from emphasizing creation and preservation of physical collections to facilitating the use and creation of knowledge regardless of its medium and the location from which it is accessed.
[49] [(Arasu et al. 2001)], [(Manning et al. 2008)]. The web is a graph, so all web crawlers use graph traversal algorithms to find URIs of web resources and then add any hyperlink they find to the list of URIs they visit. The sheer size of the web makes crawling its pages a bandwidth- and computation intensive process, and since some pages change frequently and others not at all, an effective crawler must be smart at how it prioritizes the pages it collects and how it re-crawls pages. A web crawler for a search engine can determine the most relevant, popular, and credible pages from query logs and visit them more often. For other sites, a crawler adjusts its “revisit frequency” based on the “change frequency” [(Cho and Garcia-Molina 2000)].
[50] Web resources are typically discovered by computerized “web crawlers” that find them by following links in a methodical automated manner. Web crawlers can be used to create topic-based or domain-specific collections of web resources by changing the “breadth-first” policy of generic crawlers to a “best-first” approach. Such “focused crawlers” only visit pages that have a high probability of being relevant to the topic or domain, which can be estimated by analyzing the similarity of the text of the linking and linked pages, terms in the linked page’s URI, or locating explicit semantic annotation that describes their content or their interfaces if they are invokable services [(Bergmark et al. 2002)], [(Ding et al. 2004)].
[51] FTC Fair Information Practice Principles say that consumer data collected for one purpose cannot be used for other purposes without the consumer’s consent. Sometimes called the consumer privacy bill of rights.
See also [(Zhu et al., 2014)] and [(Marchioni et al., 2012)]
[52] Large research libraries have historically viewed their collections as their intellectual capital and have policies that specify the subjects and sources that they intend to emphasize as they build their collections. See [(Evans 2000)]. Museums are often wary of accepting items that might not have been legally acquired or that have claims on them from donor heirs or descendant groups; in the USA, much controversy exists because museums contain many human skeletal remains and artifacts that Native American groups want to be repatriated.
[53] See [(Tauberer 2014)] for a history of the “civic hacking” and the open data movement.
[54] On data modeling: see [(Kent 2012)], [(Silverston 2000)], [(Glushko and McGrath 2005)]. For data warehouses see [(Turban et al. 2010)].
For a classification and review of data cleaning problems and methods, see [(Rahm and Do, 2000)]. A recent and popular analysis that describes data cleaning as “data wrangling, data munging, and data janitor work” is [(Lohr 2014)]. For a survey of anomaly detection see [(Chandola 2009)].
[56] See [(Barsalou and Hale 1983)] for a rigorous contrast between feature lists and other representational formalisms in models of human categories.
[57] For example, a personal or small organizing system would typically use properties that are easy to identify and understand. In contrast, an organizing system for very large collections of resources, or data about them, would choose properties that are statistically optimal, even if they are not interpretable by people, because of the greater need for operational efficiency and predictive accuracy.
[58] Libraries and bookstores use different classification systems. The kitchen in a restaurant is not organized like a home kitchen because professional cooks think of cooking differently than ordinary people do. Scientists use the Latin or binomial (genus + species) scheme for identifying and classifying living things to avoid the ambiguities and inconsistencies of common names, which differ across languages and often within different regions in a single language community.
[59] [(Freitas 2014)] and [(Burrell 2015)].
[60] Many of the ancient libraries in Greece and Rome have been identified by archaeologists by characteristic architectural features [(Casson 2002)]. See also [(Battles 2003)].
[61] [(Robertson 2015)] and [(Coase 1937)].
[62] The Gestalt principles are a staple in every introductory psychology textbook, but the classic text [(Koffka 1935)] has recently been reprinted. A group of distinguished contemporary researchers in visual perception [(Wagemans et al, 2012)] recently reviewed the history and impact of Gestalt psychology on their hundredth birthday.
[63] Texts that ground graphic design and information visualization in Gestalt principles include [(Cairo 2012)] and [(Few 2004)]. [(Johnson 2013)] explains them within the broader scope of user interface design.
[64] Salt Lake City takes the use of a grid to an extreme because the central area is extremely flat. Streets are named by numbers and letters, so you might find yourself at the intersection of “North A Street” and “3rd Avenue N,” or at the intersection of “W 100 S” and “S 200 W.” It is a little creepy to think that your street address is a pinpoint location in the big grid.
(See also Pierre Charles L’Enfant’s plan for DC at http://en.wikipedia.org/wiki/Pierre_Charles_L%27Enfant)
This is not to say that imposing arbitrary grids on top of a physical environment to create a simple and easily understood organization is always desirable. It is essential that any organization imposed on a region be sensitive to any social, cultural, linguistic, ethnic, or religious organizing systems already in place. Much of the recent conflict and instability in the Middle East can be attributed to the implausibly straight line borders drawn by the French and British to carve up the defeated Ottoman Empire a century ago. Because the newly-created countries of Syria and Iraq lacked ethnic and religious cohesion, they could only be held together by dictatorships. [(Trofimov 2015)]
[65] [(Shiner 2007)]. The comparison of the organizing systems in casinos and airports comes from [(Curran 2011)]. [(Venturi 1972)]
[66] The number of queues, their locations and their layout (if spatial) is referred to as the “queue configuration.” The “queue discipline” is the policy for selecting the next customer from the queue Most common discipline is “First come, first served.” Frequent, higher-paying, or some other customer segment might have their own queue with FCFS applied within it.
See the New York Post article at http://nypost.com/2013/05/14/rich-manhattan-moms-hire-handicapped-tour-guides-so-kids-can-cut-lines-at-disney-world/
[67] The designer of the road system, Robert Moses, heralded as the master builder of mid-20th century New York City, built roads to enforce his idea of who should frequent Long Island (affluent whites). The overpasses were intentionally designed with clearances (often around nine feet) that were too low for public buses. Consequently, low-income bus riders (largely people of color) had no way to get to beaches. See [(Winner 1980)].
[68] [(Arthur and Passini 1992)] [(McCartney 2015)]McCartney, Scott. Technology will speed you through the airport of the future. Wall Street Journal, July 15 2015.
[69] In principle, it is easy to make perfect copies of digital resources. In practice, however, many industries employ a wide range of technologies including digital rights management, watermarking, and license servers to prevent copying of documents, music or video files, and other digital resources. The degree of copying allowed in digital organizing systems is a design choice that is shaped by law.
[70] Web-based or “cloud” services are invoked through URIs, and good design practice makes them permanent even if the implementation or location of the resource they identify changes [(Berners-Lee 1998)]. Digital resources are often replicated in content delivery networks to improve performance, reliability, scalability, and security [(Pathan et al. 2008)]; the web pages served by a busy site might actually be delivered from different parts of the world, depending on where the accessing user is located.
[71] Whether a digital resource seems intangible or tangible depends on the scale of the digital collection and whether we focus on individual resources or the entire collection. An email message is an identified digital resource in a standard format, RFC 2822 [(Resnick 2001)]. We can compare different email systems according to the kinds of interactions they support and how easy it is to carry them out, but how email resources are represented does not matter to us and they surely seem intangible. Similarly, the organizing system we use to manage email might employ a complex hierarchy of folders or just a single searchable in-box, but whether that organization is implemented in the computer or smart phone we use for email or exists somewhere “in the cloud” for web-based email does not much matter to us either. An email message is tangible when we print it on paper, but all that matters then is that there is well-defined mapping between the different representations of the abstract email resource.
[72] [(Schreibman, Siemens, and Unsworth 2005)] and [(Leonardi 2010)]. For example, a “Born-Digital Archives” program at Emory University is preserving a collection of the author Salman Rushdie’s work that includes his four personal computers and an external hard drive. [(Kirschenbaum 2008)], and [(Kirschenbaum et al. 2009)].
[73] For example, a car dealer might be able to keep track of a few dozen new and used cars on his lot even without a computerized inventory system, but web-based AutoTrader.com offered more than 2,000,000 cars in 2012. The cars are physical resources where they are located in the world, but they are represented in the AutoTrader.com organizing system as digital resources, and cars can be searched for using any combination of the many resource properties in the car listings: price, body style, make, model, year, mileage, color, location, and even specific car features like sunroofs or heated seats.
[74] Even when organizing principles such as alphabetical, chronological, or numerical ordering do not explicitly consider physical properties, how the resources are arranged in the “storage tier” of the organizing system can still be constrained by their physical properties and by the physical characteristics of the environments in which they are arranged. Books can only be stacked so high whether they are arranged alphabetically or by frequency of use, and large picture books often end up on the taller bottom shelf of bookcases because that is the only shelf they fit. Nevertheless, it is important to treat these idiosyncratic outcomes in physical storage as exceptions and not let them distort the choice of the organizing principles in the “logic tier.”
[75] [(Spence 1985)] This memory technique has continued to be used since, and in addition to being found in tips for studying and public speaking, is applied in memorization competitions. For example, journalist and author Joshua Foer, in his book on memory and his journey from beginner to winning the 2006 U.S. Memory Championship [(Foer 2011)], wrote that Scott Hagwood, a four-time winner of the same competition, used locations in Architectural Digest to place his memories.
[76] The Domain Name System(DNS) [(Mockapetris 1987)] is the hierarchical naming system that enables the assignment of meaningful domain names to groups of Internet resources. The responsibility for assigning names is delegated in a distributed way by the Internet Corporation for Assigned Names and Numbers(ICANN) (http://www.icann.org). DNS is an essential part of the Web’s organizing system but predates it by almost twenty years.
[77] HTML5 defines a “manifest” mechanism for making the boundary around a collection of web resources explicit even if somewhat arbitrary to support an “offline” mode of interaction in which all needed resources are continually downloaded (http://www.w3.org/TR/html5/browsers.html#offline), but many people consider it unreliable and subject to strange side effects.
[78] [(Aalbersberg and Kahler 2011)].
[80] This definition of information architecture combines those in a Wikipedia article (http://en.wikipedia.org/wiki/Information_architecture) and in a popular book with the words in its title [(Morville and Rosenfield 2006)]. Given the abstract elegance of “information” and “architecture” any definition of “information architecture” can seem a little feeble.
See [(Resmini and Rosati 2011)] for a history of information architecture.
[81] See [(Halvorson and Rach 2012)], [(Tidwell 2008)], [(Morville and Rosenfield 2006)], [(Kalbach 2007)], [(Resmini and Rosati 2011)], [(Marcotte 2011)], [(Brown 2010)], [(Abel and Baillie 2014)]
[82] Some popular collections of design patterns are [(Van Duyne et. al, 2006)], [(Tidwell 2010)], and http://ui-patterns.com/
[83] The Directives can be found at http://ec.europa.eu/consumers/consumer_rights/rights-contracts/directive/index_en.htm
[84] The classic text about information visualization is The Visual Display of Quantitative Information [(Tufte 1983)]. More recent texts include [(Few 2012)] and [(Yau 2011)].
[86] See https://chapters.theiia.org/ottawa/Documents/Digital_Analysis.pdf for a short introduction to data analysis for fraud detection. See [(Durtschi et al 2004)] for the use of Benford’s Law in forensic accounting.
[87] Except when the resources on display are replicas of the originals, which is more common than you might suspect. Many nineteenth-century museums in the United States largely contained copies of pieces from European museums. Today, museums sometimes display replicas when the originals are too fragile or valuable to risk damage [(Wallach 1998)]. Whether the “resource-based interaction” is identical for the replica and original is subjective and depends on how well the replica is implemented.
[88] [(Gibson 1977)], [(Norman 1988)]. See also [(Norman 1999)] for a short and simple explanation of Norman’s (re-)interpretation of Gibson.
[90] The “.xpi” file type is used for Mozilla/Firefox browser extensions, small computer programs that can be installed in the browser to provide some additional user interface functionality or interaction. The “.mobi” file type was originally developed to enable better document display and interactions on devices with small screens. Today its primary use is as the base ebook format for the Amazon Kindle, except that the Kindle version is more highly compressed and locked down with digital rights management.
[91] See [(Hearst 2009)], [(Buettcher et al. 2010)].
[92] [(Glushko and Nomorosa 2013)].
[93] [(Apte and Mason 1995)] introduced this framework to analyze services rather than interactions per se.
[94] Furthermore, many of the resources might not be available in the user’s own library and could only be obtained through inter-library loan, which could take days or weeks.
[95] In contrast, far fewer interactions in museum collections are searches for known items, and serendipitous interactions with previously unknown resources are often the goal of museum visitors. As a result, few museum visitors would prefer an online visit to experiencing an original painting, sculpture, or other physical artifact. However, it is precisely because of the unique character of museum resources that museums allow access to them but do not allow visitors to borrow them, in clear contrast to libraries.
[96] [(Viswanadham 2002)], [(Madrigal 2009)], [(Prats et al. 2008)]. A video of a robot librarian in action at the University of Missouri, Kansas City is at http://www.youtube.com/watch?v=8wJJLlTq7ts.
See also the Popular Science article “How It Works: Underground Robot Library” available at http://www.popsci.com/content/underground-robot-library.
[97] Providing access to knowledge is a core mission of libraries, and it is worth pointing out that library users obtain knowledge both from the primary resources in the library collection and from the organizing system that manages the collection.
[98] It also erodes the authority and privilege that apply to resources because they are inside the library when a web search engine can search the “holdings” of the web faster and more comprehensively than you can search a library’s collection through its online catalog.
[100] [(Byrne and Goddard 2010)].
[101] See [(Simon 2010)]. An exemplary project to enhance museum access is Delphi [(Schmitz and Black 2008)], the collections browser for the Phoebe A. Hearst Museum of Anthropology at University of California, Berkeley. Delphi very cleverly uses natural language processing techniques to build an easy-to-use faceted browsing user interface that lets users view over 600,000 items stored in museum warehouses. Delphi is being integrated into Collection Space (http://www.collectionspace.org/), an open source web collections management system for museum collections, collaboratively being developed by University of California, Berkeley, Cambridge University, Ontario Academy of Art and Design(OCAD), and numerous museums.
[103] Even sophisticated text representation formats such as XML have inherent limitations: one important problem that arises in complex management scenarios, humanities scholarship, and bioinformatics is that XML markup cannot easily represent overlapping substructures in the same resource [(Schmidt 2009)].
[104] Digital books change the economics and first sale is not as well-established for digital works, which are licensed rather than sold [(Aufderheide and Jaszi 2011)]. To protect their business models, many publishers are limiting the number of times ebooks can be lent before they “self-destruct.” Some librarians have called for boycotts of publishers in response (http://boycottharpercollins.com).
In contrast to these new access restrictions imposed by publishers on digital works, many governments as well as some progressive information providers and scientific researchers have begun to encourage the reuse and reorganization of their content by making geospatial, demographic, environmental, economic, and other datasets available in open formats, as web services, or as data feeds rather than as “fixed” publications [(Bizer 2009a)], [(Robinson et al. 2008)]. And we have made this book available as an open content repository so that it can be collaboratively maintained and customized.
[105] We cannot explain this any better than the UN does: “The Convention follows decades of work by the United Nations to change attitudes and approaches to persons with disabilities. It takes to a new height the movement from viewing persons with disabilities as ‘objects’ of charity, medical treatment and social protection towards viewing persons with disabilities as ‘subjects’ with rights, who are capable of claiming those rights and making decisions for their lives based on their free and informed consent as well as being active members of society.” See https://www.un.org/disabilities/default.asp?navid=12&pid=150
[106] See Microsoft Windows Accessibility, Apple Accessibility, and Android Accessibility Features.
[107] The Web Accessibility Initiative works to make the Web accessible to people with visual, auditory, speech, cognitive, neurological, and physical disabilities.
[108] Accessible Rich Internet Applications(ARIA)
[109] Smithsonian Guidelines for Accessible Exhibition Design
Because not every performance of a Broadway is exactly the same, the D-Scriptive audio snippets are tied to particular bits of dialog. The theater’s stage manager triggers the D-Scriptive system to broadcast the corresponding visual explanations to audience members listening on earpieces. For example, in the Lion King a snippet might explain that “on the left are two giraffes and a cheetah.” [(Giridharadas 2014)]
In 2015 Netflix began a similar audio description service to accompany some of its original series. See http://blog.netflix.com/2015/04/netflix-begins-audio-description-for.html
[110] For a recent historical and highly technical review of speech recognition written by some of the most prominent researchers in the field, see [(Huang, Baker, and Reddy 2014)] An easier to read story about Apple’s Siri voice recognition program is [(Geller 2012)]. Popup archive is https://www.popuparchive.org/ and its audio search service is https://www.audiosear.ch/
[111] These access controls to the organizing system or its host computer are enforced using passwords and more sophisticated software and hardware techniques. Some access control policies are mandated by regulations to ensure privacy of personal data, and policies differ from industry to industry and from country to country. Access controls can improve the credibility of information by identifying who created or changed it, especially important when traceability is required (e.g., financial accounting).
[112] In response to this trend, however, many libraries are supporting “open access” initiatives that strive to make scholarly publications available without restriction [(Bailey 2007)]. Libraries and ebook vendors are engaged in a tussle about the extent to which the “first sale” rule that allows libraries to lend physical books without restrictions also applies to ebooks [(Howard 2011)].
[113] [(Guenther and Wolfe 2009)].
[114] This is the historical and dominant conception of the research library, but libraries are now fighting to prove that they are much more than just repositories because many of their users place greater value “on-the-fly access” of current materials. See [(Teper 2005)] for a sobering analysis of this dilemma.
[115] Today the United States National Archives displays the Declaration of Independence, Bill of Rights, and the U.S. Constitution in sealed titanium cases filled with inert argon gas. Unfortunately, for over a century these documents were barely preserved at all; the Declaration hung on the wall at the United States Patent Office in direct sunlight for about 40 years.
[116] Customer information drives day-to-day operations, but is also used in decision support and strategic planning.
[117] For businesses “in the world,” a “customer” is usually an actual person whose identity was learned in a transaction, but for many web-based businesses and search engines a customer is a computational model extracted from browser access and click logs that is a kind of “theoretical customer” whose actual identity is often unknown. These computational customers are the targets of the computational advertising in search engines.
[118] The Sarbanes-Oxley Act in the United States and similar legislation in other countries require firms to preserve transactional and accounting records and any document that relates to “internal controls,” which arguably includes any information in any format created by any employee [(Langevoort 2006)]. Civil procedure rules that permit discovery of evidence in lawsuits have long required firms to retain documents, and the proliferation of digital document types like email, voice mail, shared calendars and instant messages imposes new storage requirements and challenges [(Levy and Casey 2006)]. However, if a company has a data retention policy that includes the systematic deletion of documents when they are no longer needed, courts have noted that this is not willful destruction of evidence.
[119] Libraries are increasingly faced with the choice of providing access to digital resources through renewable licensing agreements, “pay per view” arrangements, or not at all. To some librarians, however, the failure to obtain permanent access rights “offends the traditional ideal of libraries” as memory institution [(Carr 2010)].
[120] For example. students writing a term paper usually organize the printed and digital resources they rely on; the former are probably kept in folders or in piles on the desk, and the latter in a computer file system. This organizing system is not likely to be preserved after the term paper is finished. An exception that proves the rule is the task of paying income taxes for which (in the USA) taxpayers are legally required to keep evidence for up to seven years after filing a tax return (http://www.irs.gov/Businesses/Small-Businesses-&-Self-Employed/How-long-should-I-keep-records%3F).
[123] Many of those Word Perfect documents were stored on floppy disks because floppy disk drives were built into almost every personal computer for decades, but it would be hard to find such disk drives today. And even if someone with a collection of word processor documents stored of floppy disks in 1995 had copied those files to newer storage technologies, it is unlikely that the current version of the word processor would be able to read them. Software application vendors usually preserve “backwards compatibility” for a few years with earlier versions to give users time to update their software, but few would support older versions indefinitely because to do so can make it difficult to implement new features.
[124] This is tautologically true for sites that publish news, weather, product catalogs with inventory information, stock prices, and similar continually updated content because many of their pages are automatically revised when events happen or as information arrives from other sources. It is also true for blogs, wikis, Facebook, Flickr, YouTube, Yelp and the great many other “Web 2.0” sites whose content changes as they incorporate a steady stream of user-generated content.
[125] [(Fetterly et al. 2003)].
[126] However, when a website disappears its first page can often be found in the search engine’s index “cache” rather than by following what would be a broken link.
[127] Brewster Kahle has been described as a computer engineer, Internet entrepreneur, Internet activist, advocate of universal access to knowledge, and digital librarian (http://en.wikipedia.org/wiki/Brewster_Kahle). In addition to websites, the Internet Archive preserves several million books, over a million pieces of video, 400,000 news programs from broadcast TV, over a million audio recordings, and over 100,000 live music concerts.
The Memento project has proposed a specification for using HTTP headers to perform “datetime negotiation” with the Wayback Machine and other archives of web pages, making it unnecessary for Memento to save anything on its own. Memento is implemented as a browser plug-in to “browse backwards in time” whenever older versions of pages are available from archives that use its specification. [(VandeSompel 2010)].
[128] People might still enjoy the many Mona Lisa parodies and recreations. See http://www.megamonalisa.com, http://www.oddee.com/item_96790.aspx, http://www.chilloutpoint.com/art_and_design/the-best-mona-lisa-parodies.html.
[129] [(Brown and Duguid 2002)].
[132] The set of content modules and their assembly structure for each kind of generated document conforms to a template or pattern that is called the document type model when it is expressed in XML.
[133] Company intranets, wikis, and blogs are often used as knowledge management technologies; Lotus Notes and Microsoft SharePoint are popular commercial systems. (See the case study in the section called “Knowledge Management for a Small Consulting Firm”.)
[134] In addition, the line between “preserving species” and “preserving marketing brands” is a fine one for zoos with celebrity animals, and in animal theme parks like Sea World, it seems to have been crossed. “Shamu” was the first killer whale (orca) to survive long in captivity and performed for several years at SeaWorld San Diego. Shamu died in 1971 but over forty years later all three US-based SeaWorld parks have Shamu shows and Shamu webcams.
[135] [(Manyika et al. 2011)].
[136] The College of Physicians of Philadelphia’s Mütter Museum houses a novel collection of artifacts meant to “educate future doctors about anatomy and human medical anomalies.” No museum in the world is like it; it contains display cases full of human skulls, abnormal fetuses in jars, preserved human bodies, a garden of medicinal herbs, and many other unique collections of resources.
However, one sub-collection best reflects the distinctive and idiosyncratic selection and arrangement of resources in the museum. Chevalier Jackson, a distinguished laryngologist, collected over 2,000 objects extracted from the throats of patients. Because of the peculiar focus and educational focus of this collection, and because there are few shared characteristics of “things people swallow that they should not,” the characteristics and principles used to organize and describe the collection would be of little use in another organizing system. What other collection would include toys, bones, sewing needles, coins, shells, and dental material? It is hard to imagine that any other collection that would include all of these items plus fully annotated record of sex and approximate age of patient, the amount of time the extraction procedure took, the tool used, and whether or not the patient survived.
[137] Curation is a very old concept whose Medieval meaning focused on the “preservation and cure of souls” by a pastor, priest, or “curate” [(Simpson and Weiner 1989)]. A set of related and systematized curation practices for some class of resources is often called a curation system, especially when they are embodied in technology.
[138] Information about which resources are most often interacted with in scientific or archival collections is essential in understanding resource value and quality.
[139] In memory institutions, the most common job titles include “curator” or “conservator.” In for-profit contexts where “governance” is more common than “curation” job titles reflect that difference. In addition to “governance,” job titles often include “recordkeeping,” “compliance,” or “regulatory” prefixes to “officer,” “accountant,” or “analyst” job classifications.
[140] Because personal collections are strongly biased by the experiences and goals of the organizer, they are highly idiosyncratic, but still often embody well-thought-out and carefully executed curation activities [(Kirsh 2000)], [(Jones 2007)], [(Marshall 2007)],[(Marshall 2008)].
[141] [(Bush 1945)]. Bush imagined that Memex users could share these packages of trails and that a profession of trailbuilders would emerge. However, he did not envision that the Memexes themselves could be interconnected, nor did he imagine that their contents could be searched computationally.
[143] The most salient example of this so called “community curation” activity is the work to maintain the Wikipedia open-source encyclopedia according to a curation system of roles and functions that governs how and under what conditions contributors can add, revise, or delete articles; receive notifications of changes to articles; and resolve editing disputes [(Lovink and Tkacz 2011)]. Some museums and scientific data repositories also encourage voluntary curation to analyze and classify specimens or photographs [(Wright 2010)].
[145] Some popular “community content” sites like Yelp where people rate local businesses have been criticized for allowing positive rating manipulation. Yelp has also been criticized for allowing negative manipulation of ratings when competitors slam their rivals.
[146] The resource might have been put someplace else when the site was reorganized or a new web server was installed. It is no longer the same resource because it will have another URI, even if its content did not change.
[147] All of these terms refer to types of web resources or techniques whose purpose is to mislead people into doing things or letting things be done to their computers that will cost them their money, time, privacy, reputation, or worse. We know too well what spam is. Phishing is a type of spam that directs recipients to a fake website designed to look like a legitimate one to trick them into entering account numbers, passwords, or other sensitive personal information. Malware, fakeware, or spyware sites offer tempting downloadable content that installs software designed to steal information from or take control of the visiting computer. Keyword stuffing, spamdexing, and META tag abuse are techniques that try to mislead search engines about the content of a resource by annotating it with false descriptions. Link farms or scraper sites contain little useful or original content and exist solely for the purpose of manipulating search engine rankings to increase advertising revenue. Similarly, cybersquatters register domain names with the hope of profiting from the goodwill of a trademark they do not own.
[149] [(Diaz 2005)], [(Grimmelmann 2009)].
[150] See video of Haber explaining how this works, Haber has recently been able to build a version of his scanning and image processing technology for use outside the laboratory that he calls Irene (Image, Reconstruct, Erase Noise, Etc.). [(Cowen 2015)] and [(Wilkinson 2014)]
[151] For an explanation of automated acquisition see Eva Guggemos, Professional archivist and academic librarian https://www.quora.com/How-do-libraries-decide-which-books-to-purchase-and-which-books-to-remove-from-circulation.
For a cogent discussion of when and for what reasons weeding must take place in university libraries, see https://mrlibrarydude.wordpress.com/2014/03/12/why-we-weed-book-deselection-in-academic-libraries/.
A typical reaction when libraries discard books is described in [(Jackman 2015)]
[152] [(Blanchette and Johnson 2002)]
[153] [(Neziroglu 2014)] and [(Lidz 2003)].
[154] Libraries and museums must also deal with long-term strategy, but the lesser visibility of library governance and museum governance might simply reflect the greater concerns about fraud and malfeasance in for-profit business contexts than in non-profit contexts and the greater number of standards or “best practices” for corporate governance. [(Kim, Nofsinger, and Mohr 2009)].
[155] Data governance decisions are also often shaped by the need to conform to information or process model standards, or to standards for IT service management like the Information Technology Infrastructure Library(ITIL). See http://www.itil-officialsite.com/.
[156] In this context, these management and maintenance activities are often described as “IT governance” [(Weill and Ross 2004)]. Data classification is an essential IT governance activity because the confidentiality, competitive value, or currency of information are factors that determine who has access to it, how long it should be preserved, and where it should be stored at different points in its lifecycle.
[158] This master data must be continually “cleansed” to remove errors or inconsistencies, and “de-duplication” techniques are applied to ensure an authoritative source of data and to prevent the redundant storage of many copies of the same resource. Redundant storage can result in wasted time searching for the most recent or authoritative version, cause problems if an outdated version is used, and increase the risk of important data being lost or stolen. [(Loshin 2008)].
[160] Recently imposed requirements by the National Science Foundation(NSF), National Institute of Health(NIH) and other research granting agencies for researchers to submit “data management plans” as part of their proposals should make digital data curation a much more important concern [(Borgman 2011)]. (NSF Data Management Plan Requirements: http://www.nsf.gov/eng/general/dmp.jsp).