10 Chapter 10. Interactions with Resources
Vivien Petras
Robert J. Glushko
Ian MacFarland
Karen Joy Nomorosa
J.J.M. Ekaterin
Hyunwoo Park
Robyn Perry
Sean Marimpietri
Table of Contents
10.2. Determining Interactions
10.2.2. Socio-Political and Organizational Constraints
10.3. Reorganizing Resources for Interactions
10.3.1. Identifying and Describing Resources for Interactions
10.3.2. Transforming Resources for Interactions
10.4. Implementing Interactions
10.4.1. Interactions Based on Instance Properties
10.4.2. Interactions Based on Collection Properties
10.4.3. Interactions Based on Derived Properties
10.4.4. Interactions Based on Combining Resources
10.6. Key Points in Chapter Ten
Introduction
An organizing system without interactions is a sad one indeed.
Interactions are the answer to two of the fundamental questions we posed back in Chapter 1, Foundations for Organizing Systems: why and when are the resources organized?
In this chapter we will pivot from design for interactions to the design of interactions—and to do this we must pause to consider the question of “when?” In the section called “When Is It Being Organized?”, we contrasted organization done “on the way in” with that done “on the way out,” but this distinction is not always a particularly relevant one. Consider a bookshelf: if you do not organize its resources on the way in (i.e., when you put a book on the shelf), you cannot really organize them on the way out; you just have a disorganized bookshelf. When the time comes to retrieve a book, you’ll have to employ a brute-force linear search algorithm—reading every spine until you find the one you want, and it will not make the remaining books on the shelf any more organized.
Most Common Museum Interaction
Some organizing systems have the power to determine the description standards that others must use. Walmart, the largest retailer in the United States, has devised an organizing system for its supply chain that supports access and movement of physical goods with maximal efficiency and effectiveness. This system saves the corporation money on inventory management and distribution, but to maximize savings, Walmart requires its suppliers to employ the same data model, follow company-set standards, and adopt new technologies such as bar codes and RFID tags that support the highly efficient interactions it requires.[572]
(Screenshot by Ian MacFarland.)
Still others choose to abide by what a standard-setting body decides, or participate in laborious, democratic processes to align their organizing practices and interactions.[573] Libraries and museums are the classic examples of this. The most important interaction in a library, of course, is borrowing: checking out a book to use it off the premises, and checking it back in when you’re done. Patrons search descriptions in a catalog to find books on a certain topic, by a certain author, or with a certain title, and access them by fetching them from the stacks or asking a librarian to retrieve them. As institutions that serve the public interest, libraries adhere to standards and democratic processes to ensure consistent and familiar user experiences for patrons, but also to enable powerful search interactions such as union catalogs, where resource descriptions from multiple libraries are merged before they are offered for search. Union catalogs allow patrons to find out with a single search whether a resource is available from any library that is accessible to them.[574]
The digitization of museum resources also allows visitors to experience them from a perspective that might not be possible in a physical museum. For example, in Google’s Art Project, users can zoom in to view fine details of digitized paintings.[575] Museums are starting to leverage technology and the popularity of Web 2.0 features such as tagging and social networking to attract new audiences.[576]
Implemented in 2004, the MuseumFinland project aims to provide a portal for publishing heterogeneous museum collections on the Semantic Web.[577] Institutions such as the Getty Information Institute and the International Committee for Documentation of the International Council of Museums have worked on standards that ensure worldwide consistency in how museums manage information about their collections.[578]
Navigating This Chapter
This chapter concentrates on the processes that develop interactions based on leveraging the resources of organizing systems to provide valuable services to their users (human or computational agents). It will discuss the determination of the appropriate interactions (the section called “Determining Interactions”), the organization of resources for interactions (the section called “Reorganizing Resources for Interactions”), the implementation of interactions (the section called “Implementing Interactions”), and their evaluation and adaptation (the section called “Evaluating Interactions”). Although the fundamental questions pertain to all types of organizing systems, this chapter focuses on systems that use computers to satisfy their goals.
Determining Interactions
Stop and Think: Constraint vs Flexibility
User Requirements
Users (human or computational agents) search or navigate resources in organizing systems not just to identify them, but also to obtain and further use the selected resources (e.g., read, cluster, annotate, buy, copy, distribute, adapt, etc.). How resources are used and by whom affects how much of the resource or its description is exposed, across which channels it is offered, and the precision and accuracy of the interaction.
Information needs of computational agents are determined by rules and criteria set by the creators of the agents (i.e., the function or goal of the agent). When a computational agent interacts with another computational agent or service by using its API, in the ideal case its output precisely satisfies those information needs.
(Screenshot by Ian MacFarland.)
While search queries are explicitly stated user information needs, organizing systems increasingly attempt to solicit the user’s context or larger work task in order to provide more suitable or precise interactions. Factors such as level of education, physical disabilities, location, time, or deadline pressure often specify and constrain the types of resources needed as well as the types of interactions the user is willing or able to engage in. Implicit information can be collected from user behavior, for example, search or buying history, current user location or language, and social or collaborative behavior (other people with the same context). Methods for explicitly soliciting user requirements include observation, surveys, focus groups, interviews, work task analysis and many more.[579]
Designers of organizing systems must recognize that people are not perfectly capable and rational decision makers. Limited memory and attention capacities prevent people from remembering everything and make them unable to consider more than a few things or choices at once. As a result of these fundamental limitations, people consciously and unconsciously reduce the cognitive effort they make when faced with decisions.
|
|
|
|
||
Sunstein and Thaler popularized the application of behavioral economics as “libertarian paternalism,” with the goal of encouraging the design of organizing systems and policies that maintain or increase freedom of choice but which at the same time influence people to make choices that they would judge as good ones. This perspective is nicely captured by the title of their best-selling book, Nudge. Many government agencies and businesses in the US and elsewhere are building “nudging” principles into policies and products in the areas of social services, healthcare, and financial services because of the complexity of their offerings.
Behavioral economics complements the discipline of organizing by offering insights into the thinking and behavior of typical users that can lead to classifications and choices that make them more effective and satisfied. However, the principles of behavioral economics can be used to design organizing systems that manipulate people into taking actions and making choices that they might not intend or that are not in their best interests. (See Dark Patterns.)[582]
One important way in which this affects how people behave demonstrates what Barry Schwartz calls The Paradox of Choice. You might think that people would prefer many options rather than just a few because that would better enable them to select a resource that best meets their requirements. In fact, because considering more choices requires more mental effort, this can cause stress and indecision and might cause people to give up. For example, when there were 24 different types of jam offered at an upscale market, more people stopped to taste than when only 6 choices were offered, but a greater percentage of people who were presented a smaller number of options actually made a purchase.[580]
We see the same phenomenon when we compare libraries and bookstores. A rational book seeker should prefer the detailed classification system used in libraries over the very coarse BISAC system used in bookstores. However, many people say that the detailed system makes them work too hard, leading to calls that new libraries adopt the bookstore organizing system. (See the section called “The BISAC Classification”)
People can avoid making choices if a system proposes or pre-selects an option for them that becomes a default choice if they do nothing. Often people will make a cursory assessment about how well the option satisfies a requirement and if it is good enough they will not consider any other alternatives.[581]
The study of the limits to human rationality in decision-making is the centerpiece of the discipline known as Behavioral Economics.
Socio-Political and Organizational Constraints
An important constraint for interaction design choices is the access policies imposed by the producers of organizing systems, as already described in the section called “Access Policies”. If resources or their descriptions are restricted, interactions may not be able to use certain properties and therefore cannot be supported.
Information and economic power asymmetry
Standards
Industry-wide or community standards can be essential in enabling interoperability between systems, applications, and devices. A standard interface describes the data formats and protocols to which systems should conform.[583] Failure to adhere to standards complicates the merging of resources from different organizing systems. Challenges to standardization include organizational inertia; closed policies, processes, or development groups; intellectual property; credentialing; lack of specifications; competing standards; high implementation costs; lack of conformance metrics; lack of clarity or awareness; and abuse of standards as trade barriers.
Public policy
Beyond businesses and standards-setting organizations, the government sector wields substantial influence over the implementation and success of possible interactions in organizing systems. As institutions with large and inalienable constituents, governments and governmental entities have similar influences as large businesses due to their size and substantial impact over society at large. Different forms of government around the world, ranging from centrally planned autocracy to loosely organized nation-states, can have far-reaching consequences in terms of how resource description policies are designed. Laws and regulations regarding data privacy prevent organizing systems from recording certain user data, therefore prohibiting interactions based on this information.[584]
Nevertheless, inter-organizational constraints are inherently less deterministic than intra-organizational ones, because it is possible that a decision-maker with broad authority can decide that some interaction is important enough to warrant the change of institutional policies, formats, or even category systems. (See the section called “Institutional Categories”.)
Regulatory Constraints: Right to be Forgotten
The ruling had its foundations in the EU’s 1995 Data Protection Directive, a data retention policy crafted in a time before the dominance of the Internet and search engines. While many privacy advocates hailed it as a victory, others in the technology and media firms have decried it as censorship. Either way, it has highlighted the need for the European Commission to update and modernize its data policy; a proposal has been before the European Parliament since 2012, and plans for its adoption were underway as of summer 2014. (Source: EC fact sheet on the “right to be forgotten” ruling.)
Reorganizing Resources for Interactions
Commonly, interactions are determined at the beginning of a development process of the organizing system. It follows that most required resource descriptions (which properties of a resource are documented in an organizing system) need to be clarified at the beginning of the development process as well; that is, resource descriptions are determined based on the desired interactions that an organizing system should support. Most of these processes have been described in detail in Chapter 5, Resource Description and Metadata, Chapter 6, Describing Relationships and Structures and Chapter 9, The Forms of Resource Descriptions.
Resources from different organizing systems are often aggregated to be accessed within one larger organizing system (warehouses, portals, search engines, union catalogs, cross-brand retailers), which requires resources and resource descriptions to be transformed in order to adapt to the new organizing system with its extended interaction requirements.[585] Elsewhere, legacy systems often need to be updated to accommodate new standards, technologies, and interactions (e.g, mobile interfaces for digital libraries). That means that the necessary resources and resource descriptions for an interaction need to be identified, and, if necessary, changes have to be made in the description of the resources. Sometimes, resources are merged or transformed in order to perform new interactions.
Identifying and Describing Resources for Interactions
Individual and collection resource descriptions need to be carefully considered in order to record the necessary information for the designed interactions. (See Chapter 9, The Forms of Resource Descriptions.) The type of interaction determines whether new properties need to be derived or computed with the help of external factors and whether these properties will be represented permanently in the organizing system (e.g., an extended topical description added due to a user comment) or created on the fly whenever a transaction is executed (e.g., a frequency count).
Transforming Resources for Interactions
Infrastructure or notation transformation
When resources are aggregated, the organizing systems must have a common basic infrastructure to communicate with one another and speak the same language. This means that participating systems must have a common set of communication protocols and an agreed upon way of representing information in digital formats, i.e., a notation (the section called “Notations”), such as the Unicode encoding scheme.[586]
Writing system transformation
During a writing system transformation (Chapter 9, The Forms of Resource Descriptions), the syntax or vocabulary—also called the data exchange format—of the resource description will be changed to conform to another model, e.g., when library records are mapped from the MARC21 standard to the Dublin Core format in order to be aggregated, or when information in a business information system is transformed into an EDI or XML format so that it can be sent to another firm.[587] Sometimes customized vocabularies are used to represent certain types of properties. These vocabularies were probably introduced to reduce errors or ambiguity or abbreviate common organizational resource properties. These customized vocabularies need to be explained and agreed upon by organizations combining resources to prevent interoperability problems.
Semantic transformation
Agreeing on a category or classification system (Chapter 7, Categorization: Describing Resource Classes and Types & Chapter 8, Classification: Assigning Resources to Categories) is crucial so that organizing systems agree semantically—that is, so that resource properties and descriptions share not only technology but also meaning. For example, because the US Census has often changed its system of race categories, it is difficult to compare data from different censuses without some semantic transformation to align the categories.[588]
Resource or resource description transformation
Resources or resource descriptions are often directly transformed, as when they are converted to another file format. In computer-based interactions like search engines, text resources are often pre-processed to remove some of the ambiguity inherent in natural language. These steps, collectively called text processing, include decoding, filtering, normalization, stopword elimination, and stemming. (See the sidebar, Text Processing)
A digital resource is first a sequence of bits. Decoding transforms those bits into characters according to the encoding scheme used, extracting the text from its stored form. (See the section called “Notations”.)
If a text is encapsulated by formatting or non-semantic markup, these characters are removed because this information is rarely used as the basis of further interactions.
Segments the stream of characters (in an encoding scheme, a space is also a character) into textual components, usually words. In English, a simple rule-based system can separate words using spaces. However, punctuation makes things more complicated. For example, periods at the end of sentences should be removed, but periods in numbers should not. Other languages introduce other problems for tokenization; in Chinese, a space does not mark the divisions between individual concepts.
Normalization removes superficial differences in character sequences, for example, by transforming all capitalized characters into lower-case. More complicated normalization operations include the removal of accents, hyphens, or diacritics and merging different forms of acronyms (e.g., U.N. and UN are both normalized to UN).
Stopwords are those words in a language that occur very frequently and are not very semantically expressive. Stopwords are usually articles, pronouns, prepositions, or conjunctions. Since they occur in every text, they can be removed because they cannot distinguish them. Of course, in some cases, removing stopwords might remove semantically important phrases (e.g., “To be or not to be”).
These processing steps normalize inflectional and derivational variations in terms, e.g., by removing the “-ed” from verbs in the past tense. This homogenization can be done by following rules (stemming) or by using dictionaries (lemmatization). Rule-based stemming algorithms are easy to implement, but can result in wrongly normalized word groups, for example when “university” and “universe” are both stemmed to “univers.”
Transforming Resources from Multiple or Legacy Organizing Systems
The traditional approach to enabling heterogeneous organizing systems to be accessed together has been to fully integrate them, which has allowed the “unrestricted sharing of data and business processes among any connected applications and data sources” in the organization.[589] This can be a strategic approach to improving the management of resources, resource descriptions, and organizing systems as a whole, especially when organizations have disparate systems and redundant information spread across different groups and departments. However, it can also be a costly approach, as integration points may be numerous, with vastly different technologies needed to get one system to integrate with another. Maintenance also becomes an issue, as changes in one system may entail changes in all systems integrating with it.[590]
Planning the transformation of resources from different organizing systems to be merged in an aggregation is called data mapping or alignment. In this process, aspects of the description layers (most often writing system or semantics) are compared and matched between two or more organizing systems. The relationship between each component may be unidirectional or bidirectional.[591] In addition, resource properties and values that are semantically equivalent might have different names (the vocabulary problem of the section called “The Vocabulary Problem”). The purpose of mapping may vary from allowing simple exchanges of resource descriptions, to enabling access to longitudinal data, to facilitating standardized reporting.[592] The preservation of version histories of resource description elements and relations in both systems is vital for verifying the validity of the data map.
Similar to mapping, a straightforward approach to transformation is the use of crosswalks, which are equivalence tables that relate resource description elements, semantics, and writing systems from one organizing system to those of another.[593] Crosswalks not only enable systems with different resource descriptions to interchange information in real-time, but are also used by third-party systems, such as harvesters and search engines to generate union catalogs and perform queries on multiple systems as if they were one consolidated system.[594]
In the digital library space, WorldCat allows users to access many library databases to locate items in their community libraries and, depending on patron privileges, to request items through their local libraries from libraries all over the world. For this powerful tool to accurately locate holdings in each library, two resource description standards are involved. At the book publisher, wholesaler, and retailer end, the international standard Online Information Exchange(ONIX) is used to standardize books and serials metadata throughout the supply chain.[595] ONIX is implemented in book suppliers’ internal and customer-facing information systems to track products and to facilitate the generation of advance information sheets and supplier catalogs.[596] At the library end, the Machine-Readable Cataloging(MARC) formats manage and communicate bibliographic and related information.[597] When a member library acquires a title, information in ONIX format is sent from the supplier to the Online Computer Library Center(OCLC) where it is matched with a corresponding MARC record in the WorldCat database by using an ONIX to MARC crosswalk.[598] This enables WorldCat to provide accurate real-time holdings information of its member libraries.
As the number of organizing systems increases, crosswalks and mappings become increasingly impractical if each pair of organizing systems requires a separate crosswalk. A more efficient approach would be the use of one vocabulary or format as a switching mechanism (also called a pivot or hub language) for all other vocabularies to map towards.[599] Another possibility, which is often used in asymmetric power relationships between organizing systems, is to force all systems to adhere to the format that is used by the most powerful party.
Modes of Transformation
The use of automatic tools to create these alignments become vital in ensuring their accuracy and robustness. Graphical mapping tools provide users with a graphical user interface to connect description elements from source to target by drawing a line from one to the other.[600] Other tools perform automatic mappings based on predetermined rules and criteria.[601]
We often perform manual run-time transformations for decisions that require consulting more than one organizing system in our daily lives. For example, when planning a vacation, we use a variety of systems to negotiate a wide set of ad hoc requirements such as our resources and time, our fellow travelers and their availability, and the bookings for hotel and transportation, as well as desirable destinations and their various offerings. We somehow reconcile the different descriptions used in each of the systems and match these against each other so that the relevant information can be combined and compared. Even though the systems use different formats, vocabularies and structures, they are targeted toward human users and are relatively easy to interpret. For automatic run-time transformations, which need to be handled computationally, designers face the challenge of creating more structured processes for merging information from different systems.[602]
Granularity and Abstraction
Within writing system and semantic transformations, issues of granularity and level of abstraction (the section called “Determining the Scope and Focus” and the section called “Category Abstraction and Granularity”) pose the most challenges to cross-organizing system interoperability.[603] Granularity refers to the level of detail or precision for a specific information resource property. For instance, the postal address of a particular location might be represented as several different data items, including the number, street name, city, state, country and postal code (a high-granularity model). It might also be represented in one single line including all of the information above (a low-granularity model). While it is easy to create the complete address by aggregating the different information components from the high-granularity model, it is not as easy to decompose the low-granularity model into more specific information components.
This does not mean, however, that a high-granularity model is always the best choice, especially if the context of use does not require it, as there are corresponding tradeoffs in terms of efficiency and speed in assembling and processing the resource information. (See the sidebar, AccuWeather Request Granularity)
AccuWeather Request Granularity
Requests for AccuWeather data have exploded in the last years, due to automated requests from mobile devices to keep weather apps updated. The company has dealt with this challenge by truncating the GPS coordinates sent by the mobile device when it requests weather data (a transformation to lower granularity). If the request with the truncated coordinates is identical to one recently made, a cached version of the content is served, resulting in 300 million to 500 million fewer requests a day.[604]
Accuracy of Transformations
Automatic mapping tools can only be as accurate as the specifications and criteria that are included in the mapping guidelines. Intellectual checks and tests performed by humans are almost always necessary to validate the accuracy of the transformation. Because description systems vary in expressive power and complexity, challenges to transformations may arise from differences in semantic definitions, rules regarding whether an element is required or requires multiple values, hierarchical or value constraints, and controlled vocabularies.[605] As a result of these complexities, absolute transformations that ensure exact mappings will result in a loss of precision if the source description system is substantially richer than the target system.
In practice, relative crosswalks where all elements in a source description are mapped to at least one target, regardless of semantic equivalence, are often implemented. This lowers the quality and accuracy of the mapping and can result in “down translation” or “dumbing down” of the system for resource description. As a result of mapping compromises due to different granularity or abstraction levels, transformations from different organizing systems usually result in less granular or specific resource descriptions. Consequently, whereas some interactions are now enabled (e.g., cross-organizing system search), others that were once possible can no longer be supported. For example, conflating geographical and person subject fields from one system (e.g., geographical subject = Alberta, person subject = Virginia) to a joint subject field (e.g. subject = Alberta, Virginia) to transform to the resource description of another system does not allow for searches that distinguish between these specific categories anymore.
Implementing Interactions
The next sections describe some common interactions in digital organizing systems. One way to distinguish among them is to consider the source of the algorithms that are used in order to perform them. We can mostly distinguish information retrieval interactions (e.g., search and browse), machine learning interactions (e.g., cluster, classify, extract) or natural language processing interactions (e.g., named entity recognition, summarization, sentiment analysis, anaphoric resolution). Another way to distinguish among interactions is to note whether resources are changed during the interaction (e.g., annotate, tag, rate, comment) or unchanged (search, cluster). Yet another way would be to distinguish interactions based on their absolute and relative complexity, i.e., on the progression of actions or steps that are needed to complete the interaction. Here, we will distinguish interactions based on the different resource description layers they act upon.
Chapter 3, Activities in Organizing Systems, introduced the concept of
affordance or behavioral repertoire—the inherent actionable properties that determine what can be done with resources. We will now look at affordances (and constraints) that resource properties pose for interaction design. The interactions that an individual resource can support depend on the nature and extent of its inherent and described properties and internal structure. However, the interactions that can be designed into an organizing system can be extended by utilizing collection properties, derived properties, and any combination thereof. These three types of resource properties can be thought of as creating layers because they build on each other.
Interactions can be distinguished by four layers:
Interactions based on properties of individual resources
Resource properties have been described extensively in Chapter 4, Resources in Organizing Systems and Chapter 5, Resource Description and Metadata. Any information or property that describes the resource itself can be used to design an interaction. If a property is not described in an organizing system or does not pertain to certain resources, an interaction that needs this information cannot be implemented. For example, a retail site like Shopstyle cannot offer to reliably search by color of clothing if this property is not contained in the resource description.
Interactions based on collection properties
Collection-based properties are created when resources are aggregated. (See Chapter 1, Foundations for Organizing Systems.) An interaction that compares individual resources to a collection average (e.g., average age of publications in a library or average price of goods in a retail store) can only be implemented if the collection average is calculated.
Interactions based on derived or computed properties
Interactions based on combining resources
Interactions Based on Instance Properties
Boolean Retrieval
In a Boolean search, a query is specified by stating the information need and using operators from Boolean logic (AND, OR, NOT) to combine the components. The query is compared to individual resource properties (most often terms), where the result of the comparison is either TRUE or FALSE. The TRUE results are returned as a result of the query, and all other results are ignored. A Boolean search does not compare or rank resources so every returned resource is considered equally relevant. The advantage of the Boolean search is that the results are predictable and easy to explain. However, because the results of the Boolean model are not ranked by relevance, users have to sift through all the returned resource descriptions in order to find the most useful results.[606][607]
Tag / Annotate
A tagging or annotation interaction allows a user (either a human or a computational agent) to add information to the resource itself or the resource descriptions. A typical tagging or annotation interaction locates a resource or resource description and lets the user add their chosen resource property. The resulting changes are stored in the organizing system and can be made available for other interactions (e.g., when additional tags are used to improve the search). An interaction that adds information from users can also enhance the quality of the system and improve its usability.[608]
Interactions Based on Collection Properties
Ranked Retrieval with Vector Space or Probabilistic Models
Ranked retrieval sorts the results of a search according to their relevance with respect to the information need expressed in a query. The Vector Space and Probabilistic approaches introduced here use individual resource properties like term occurrence or term frequency in a resource and collection averages of terms and their frequencies to calculate the rank of a resource for a query.[609]
The simplicity of the Boolean model makes it easy to understand and implement, but its binary notion of relevance does not fit our intuition that terms differ in how much they suggest what a document is about. Gerard Salton invented the vector space model of information retrieval to enable a continuous measure of relevance.[610] In the vector space model, each resource and query in an organizing system is represented as a vector of terms. Resources and queries are compared by comparing the directions of vectors in an n-dimensional space (as many dimensions as terms in the collection), with the assumption is that “closeness in space” means “closeness in meaning.”
In contrast to the vector space model, the underlying idea of the probabilistic model is that given a query and a resource or resource description (most often a text), probability theory is used to estimate how likely it is that a resource is relevant to an information need. A probabilistic model returns a list of resources that are ranked by their estimated probability of relevance with respect to the information need so that the resource with the highest probability to be relevant is ranked highest. In the vector space model, by comparison, the resource whose term vector is most similar to a query term vector (based on frequency counts) is ranked highest.[611]
Both models utilize an intrinsic resource property called the term frequency (tf). For each term, term frequency (tf) measures how many times the term appears in a resource. It is intuitive that term frequency itself has an ability to summarize a resource. If a term such as “automobile” appears frequently in a resource, we can assume that one of the topics discussed in the resource is automobiles and that a query for “automobile” should retrieve this resource. Another problem with the term frequency measure occurs when resource descriptions have different lengths (a very common occurrence in organizing systems). In order to compensate for different resource description lengths that would bias the term frequency count and the calculated relevance towards longer documents, the length of the term vectors are normalized as a percentage of the description length rather than a raw count.
Relying solely on term frequency to determine the relevance of a resource for a query has a drawback: if a term occurs in all resources in a collection it cannot distinguish resources. For example, if every resource discusses automobiles, all resources are potentially relevant for an “automobile” query. Hence, there should be an additional mechanism that penalize a term appearing in too many resources. This is done with inverse document frequency, which signals how often a term or property occurs in a collection.
Inverse document frequency (idf) is a collection-level property. The document frequency (df) is the number of resources containing a particular term. The inverse document frequency (idf) for a term is defined as idft = log(N/dft), where N is the total number of documents. The inverse document frequency of a term decreases the more documents contain the term, providing a discriminating factor for the importance of terms in a query. For example, in a collection containing resources about automobiles, an information retrieval interaction can handle a query for “automobile accident” by lowering the importance of “automobile” and increasing the importance of “accident” in the resources that are selected as result set.
As a first step of a search, resource descriptions are compared with the terms in the query. In the vector space model, a metric for calculating similarities between resource description and query vectors combining the term frequency and the inverse document frequency is used to rank resources according to their relevance with respect to the query.[612]
The probability ranking principle is mathematically and theoretically better motivated than the vector space ranking principle. However, multiple methods have been proposed to estimate the probability of relevance. Well-known probabilistic retrieval methods are Okapi BM25, language models (LM) and divergence from randomness models (DFR).[613] Although these models vary in their estimations of the probability of relevance for a given resource and differ in their mathematical complexity, intrinsic properties of resources like term frequency and collection-level properties like inverse document frequency and others are used for these calculations.
Synonym Expansion with Latent Semantic Indexing
Latent semantic indexing is a variation of the vector space model where a mathematical technique known as singular value decomposition is used to combine similar term vectors into a smaller number of vectors that describe their “statistical center.” [614] This method is based mostly on collection-level properties like co-occurrence of terms in a collection. Based on the terms that occur in all resources in a collection, the method calculates which terms might be synonyms of each other or otherwise related. Put another way, latent semantic indexing groups terms into topics. Let us say the terms “roses” and “flowers” often occur together in the resources of a particular collection. The latent semantic indexing methodology recognizes statistically that these terms are related, and replaces the representations of the “roses” and “flower” terms with a computed “latent semantic” term that captures the fact that they are related, reducing the dimensionality of resource description (see the section called “Vocabulary Control as Dimensionality Reduction”). Since queries are translated into the same set of components, a query for “roses” will also retrieve resources that mention “flower.” This increases the chance of a resource being found relevant to a query even if the query terms do not match the resource description terms exactly; the technique can therefore improve the quality of search.
Latent semantic indexing has been shown to be a practical technique for estimating the substitutability or semantic equivalence of words in larger text segments, which makes it effective in information retrieval, text categorization, and other NLP applications like question answering. In addition, some people view it as a model of the computational processes and representations underlying substantial portions of how knowledge is acquired and used, because latent semantic analysis techniques produces measures of word-word, word-passage, and passage-passage relations that correlate well with human cognitive judgments and phenomena involving association or semantic similarity. These situations include vocabulary tests, rate of vocabulary learning, essay tests, prose recall, and analogical reasoning.[615]
Structure-Based Retrieval
When the internal structure of a resource is represented in its resource description a search interaction can use the structure to retrieve more specific parts of a resource. This enables parametric or zone searching, where a particular component or resource property can be searched while all other properties are disregarded.[616] For example, a search for “Shakespeare” in the title field in a bibliographic organizing system will only retrieve books with Shakespeare in the title, not as an author. Because all resources use the same structure, this structure is a collection-level property.
A format like XML enables structured resource descriptions and is therefore very suitable for search and for structured navigation and retrieval. XPath (see the section called “Structural Relationships within a Resource”) describes how individual parts of XML documents can be reached within the internal structure. XML Query Language(XQuery), a structure-based retrieval language for XML, executes queries that can fulfill both topical and structural constraints in XML documents. For example, a query can be expressed for documents containing the word “apple” in text, and where “apple” is also mentioned in a title or subtitle, or in a glossary of terms.
Clustering / Classification
Clustering (the section called “Categories Created by Clustering”) and computational classification (the section called “Key Points in Chapter Eight”) are both interactions that use individual and collection-level resource properties to execute their operation. During clustering (unsupervised learning), all resources are compared and grouped with respect to their similarity to each other. During computational classification (supervised learning), an individual resource or a group of resources is compared to a given classification or controlled vocabulary in an organizing system and the resource is assigned to the most similar class or descriptor. Another example for a classification interaction is spam detection. (See the section called “Key Points in Chapter Eight”.) Author identification or characterization algorithms attempt to determine the author of a given work (a classification interaction) or to characterize the type of author that has or should write a work (a clustering interaction).
Interactions Based on Derived Properties
Retail Store Activity Tracking
(Photo by Flickr user m01229. Creative Commons license. Illustration of heatmap by Ian MacFarland.)
Popularity-Based Retrieval
Google’s PageRank (see the section called “Structural Relationships between Resources”) is the most well-known popularity measure for websites.[617] The basic idea of PageRank is that a website is as popular as the number of links referencing the website. The actual calculation of a website’s PageRank involves more sophisticated mathematics than counting the number of in-links, because the source of links is also important. Links that come from quality websites contribute more to a website’s PageRank than other links, and links to qualitatively low websites will hurt a website’s PageRank.
Citation-Based Retrieval
Translation
Parallel corpora are a way to overcome many of these challenges. Parallel corpora are the same or similar texts in different languages. The Bible or the protocols of United Nations(UN) meetings are popular examples because they exist in parallel in many different languages. A machine learning algorithm can learn from these corpora to derive which phrases and other grammatical structures can be translated in which contexts. This knowledge can then be applied to further resource translation interactions.
Interactions Based on Combining Resources
Mash-Ups
A mash-up combines data from several resources, which enables an interaction to present new information that arises from the combination.[618] For example, housing advertisements have been combined with crime statistics on maps to graphically identify rentals that are available in relatively safe neighborhoods.
Mash-up of Housing and Crime Stats
(Screenshots by Ian MacFarland.)
Linked Data Retrieval and Resource Discovery
In the section called “The Semantic Web World”, linked data relates resources among different organizing system technologies via standardized and unique identifiers (URIs). This simple approach connects resources from different systems with each other so that a cross-system search is possible.[619] For example, two different online retailers selling a Martha Stewart bedspread can link to a website describing the bedspread on the Martha Stewart website. Both retailers use the same unique identifier for the bedspread, which leads back to the Martha Stewart site.
Evaluating Interactions
Efficiency
Effectiveness
Effectiveness evaluates the correct output or results of an interaction. An effective interaction achieves relevant, intended or expected results. The concept of relevance and its relationship to effectiveness is pivotal in information retrieval and machine learning interactions. (the section called “Relevance”) Effectiveness measures are often developed in the fields that developed the algorithm for the interaction, information retrieval, or machine learning. Precision and recall are the fundamental measures of relevance or effectiveness in information retrieval or machine learning interactions. (the section called “The Recall / Precision Tradeoff”)
Relevance
Historically, relevance has been addressed in logic and philosophy since the notion of inference was codified (to infer B from A, A must be relevant to B). Other fields have attempted to deal with relevance as well: sociology, linguistics, and psychology in particular. The subject knowledge view, subject literature view, logical view, system’s view, destination’s view, pertinence view, pragmatic view and the utility-theoretic interpretation are different perspectives on the question of when something is relevant.[620] In 1997, Mizzaro surveyed 160 research articles on the topic of relevance and arrived at this definition: “relevance can be seen as a point in a four-dimensional space, the values of each of the four dimensions being: (i) Surrogate, document, information; (ii) query, request, information need, problem; (iii) topic, context, and each combination of them; and (iv) the various time instants from the arising of problem until its solution.”[621] This rather abstract definition points to the terminological ambiguity surrounding the concept.
The Recall / Precision Tradeoff
Precision measures the accuracy of a result set, that is, how many of the retrieved resources for a query are relevant. Recall measures the completeness of the result set, that is, how many of the relevant resources in a collection were retrieved. Let us assume that a collection contains 20 relevant resources for a query. A retrieval interaction retrieves 10 resources in a result set, 5 of the retrieved resources are relevant. The precision of this interaction is 50% (5 out of 10 retrieved resources are relevant); the recall is 25% (5 out of 20 relevant resources were retrieved).[622]
The completeness and granularity of the organizing principles in an organizing system have a large impact on the trade-off between recall and precision. (See Chapter 4, Resources in Organizing Systems.) When resources are organized in fine-grained category systems and many different resource properties are described, high-precision searches are possible because a desired resource can be searched as precisely as the description or organization of the system allows. However, very specialized description and organization may preclude certain resources from being found; consequently, recall might be sacrificed. If the organization is superficial—like your sock drawer, for example—you can find all the socks you want (high recall) but you have to sort through a lot of socks to find the right pair (low precision). The trade-off between recall and precision is closely associated with the extent of the organization.
Satisfaction
Key Points in Chapter Ten
10.6.1. Where do interactions come from in an organizing system?
10.6.2. What are the most common interactions with resources in organizing systems?
10.6.3. What factors distinguish interactions?
10.6.4. What prevents people from making perfectly rational decisions?
10.6.5. What activities, with respect to resources, are typically required to enable interactions?
10.6.6. How can we distinguish or classify transformations in organizing systems?
10.6.7. What factors distinguish implementations of resource-based interactions?
10.6.8. What evaluation criteria distinguish interactions?
10.6.9. What is relevance?
10.6.10. What is the recall and precision trade-off?
10.6.11. How does granularity of organization affect recall and precision?
Where do interactions come from in an organizing system? |
|
|
Interactions arise naturally from the affordances of resources or are purposefully designed into organizing systems. |
What are the most common interactions with resources in organizing systems? |
|
|
Accessing and merging resources are fundamental interactions that occur in almost every organizing system. |
What factors distinguish interactions? |
|
|
User requirements, which layer of resource properties is used, and the legal, social and organizational environment can distinguish interactions. |
What prevents people from making perfectly rational decisions? |
|
|
Limited memory and attention capacities prevent people from remembering everything and make them unable to consider more than a few things or choices at once. |
What activities, with respect to resources, are typically required to enable interactions? |
|
|
In order to enable interactions, it is necessary to identify, describe, and sometimes transform the resources in an organizing system. (See the section called “Identifying and Describing Resources for Interactions”) |
How can we distinguish or classify transformations in organizing systems? |
|
|
Merging transformations can be distinguished by type (mapping or crosswalk), time (design time or run time) and mode (manual or automatic). |
What factors distinguish implementations of resource-based interactions? |
|
|
Implementations can be distinguished by the source of the algorithm (information retrieval, machine learning, natural language processing), by their complexity (number of actions needed), by whether resources are changed, or by the resource description layers they are based on. |
What evaluation criteria distinguish interactions? |
|
|
Important aspects for the evaluation of interactions are efficiency (timeliness and cost-effectiveness), effectiveness (accuracy and relevance) and satisfaction (positive attitude of the user). |
What is relevance? |
|
|
|
What is the recall and precision trade-off? |
|
|
The trade-off between recall and precision decides whether a search finds all relevant documents (high recall) or only relevant documents (high precision). |
How does granularity of organization affect recall and precision? |
|
|
The extent of the organization principles also impacts recall and precision: more fine-grained organization allows for more precise interactions. |
[572] Walmart uses its market power to impose technology and process decisions on its suppliers and partners. See [(Fishman 2003)], [(Grean and Shaw 2005)], [(Wilbert 2006)]. Walmart’s website for suppliers is http://walmartstores.com/Suppliers/248.aspx/
[573] In order to more easily use and reuse content, as well as have the ability to integrate different learning tools into a single Learning Management System (LMS), Global Learning Consortium, an organization composed of 140 members from leading educational institutions and education-related companies, has released specifications to make this possible. Called Common Cartridge and Learning Tools Interoperability (http://www.imsglobal.org/commoncartridge.html), the specifications provide a common format and guidelines to construct tools and create content that can be easily imported into learning management systems. Common Cartridge(CC) specifications give detailed descriptions of the directory structure, metadata and information models associated with a particular learning object. For example, a learning package from a provider from McGraw-Hill may contain content from a book, some interactive quizzes, and some multimedia to support the text. CC specifies how files would be organized within a directory, how links would be represented, how the package would communicate with a backend server, how to describe each of the components, and the like. This would enable a professor or a student using any capable learning management system to import a “cartridge” or learning material and have it appear in a consistent manner with all other learning materials within the LMS. This means that content providers need not maintain multiple versions of the same content just to conform to the formats of different systems, allowing them to focus their resources on creating more content as opposed to maintaining the ones they already have. Looking at this in the context of the interoperability framework, we see that while information from providers are in a structured digital form, the main problem was that users were consuming the content using competing systems that had their own data formats by which to accept content. Huge publishers, wanting to increase distribution of their product, offered their content in all these different formats. While the specifications that the LMS created refer to the technical considerations in creating content and tools, the process of getting to that point involved a lot of organizational and political discussions. Internally, content and LMS providers needed to set aside the necessary resources to refactor their products to conform to the standards. Externally, competing providers had to collaborate with one another to create the specifications.
[574] http://www.worldcat.org/.
[576] [(Srinivasan, Boast, Furner, Becvar 2009)].
[577] [(Hyvönen et al. 2004)]. Museum visitors are presented with intelligent, content-based search and browsing services that offer a consolidated view across Finnish museums from the National Museum to the Lahti City Museum. To enable these goals, MuseumFinland mapped the variety of existing terms used by different museums onto shared ontologies, which now enable aggregated searching and browsing.
[578] [(Bower and Roberts 2001)].
[579] A conceptual framework for analyzing users and their work tasks for design requirements is [(Fidel 2012)]. A general survey of design methods is given in [(Hanington and Martin 2012)]. Designing particularly for successful interactions (services) is discussed in [(Polaine, Løvlie and Reason)]. [(Resmini and Rosati 2011)] describe designing for engaged users using cross-channel, cross-media information architecture.
[582] [(Simon 1982)], [(Tversky and Kahneman 1974)], [(Kahneman and Tversky 1979)], [(Kahneman 2003)], and [(Thaler 2008)].
[580] [(Schwartz 2005)] and [(Iyengar 2000)]
[581] A comprehensive review of the power of defaults in software and other technical systems from the perspectives of law, computer science, and behavioral economics is [(Kesan and Shah 2006)]
[584] A good example for the importance of standards and interoperability rules is E-government. E-government refers to the ability to deliver government services through electronic means. These services can range from government-to-citizen, government-to-business, government-to-employees, government-to-government, and vice-versa [(Guijarro 2007)], [(Scholl 2007)]. This could range from a government unit providing a portal where citizens can apply for a driver’s license or file their taxes, to more complex implementations such as allowing different government agencies to share certain pertinent information with one another, such as providing information on driver’s license holders to the police. Because the government interacts with heterogeneous entities and their various systems, e-government planners must consider how to integrate and interoperate with different systems and data models. Countries belonging to the Organization for Economic Cooperation and Development(OECD) have continuously refined their strategies for e-government.
An example of a highly successful implementation of a business-to-government implementation is the use of the Universal Business Language(UBL) by the Government of Denmark. UBL is a “royalty-free library of standard electronic XML business documents such as purchase orders and invoices” [oasis-open.org]. The Government of Denmark localized these standards, and mandated all organizations wanting to do business with the government to use these formats for invoicing. By automating the matching process between an electronic order and an electronic invoice, the government expects total potential savings of about 160 million Euros per year [UBL case study], thus highlighting the need for a standard format by which businesses can send in orders and invoices electronically.
[585] Major library system vendors now market so-called discovery portals to their customers, which allow libraries to integrate their local catalogs with central indexes of journal and other full-text databases. The advantages of discovery portals are the seamless access for patrons to all the library’s electronic materials (including externally licensed databases) while maintaining a local and customized look and feel. By providing out-of-the-box solutions, vendors on the other hand bind libraries more closely to their products.
See for an example Exlibris Primo (http://www.exlibrisgroup.com/category/PrimoOverview/) or OCLC WorldCat Local (http://www.oclc.org/worldcatlocal/default.htm).
[586] While data encoding describes how information is represented, and data exchange formats describe how information is structured, communication protocols refer to how information is exchanged between systems. These protocols dictate how these documents are enclosed within messages, and how these messages are transmitted across the network. Things such as message format, error detection and reporting, security and encryption are described and considered. Nowadays, there are a number of communication protocols that are used over networks, including File Transfer Protocol(FTP), Hypertext Transfer Protocol(HTTP) commonly used in the Internet, Post Office Protocol(POP) commonly used for e-mail, and other protocols under the Transmission Control Protocol/Internet Protocol(TCP/IP) suite. Different product manufacturers normally also have more proprietary protocols that they employ, including Apple Computer Protocols Suite and Cisco Protocols. In addition, different types of networks would also have corresponding protocols, including Mobile Wireless Protocols and such.
[587] Electronic Data Interchange(EDI), is used to exchange formatted messages between computers or systems. Organizations use this format to conduct business transactions electronically without human intervention, such as in sending and receiving purchase orders or exchange invoice information and such. There are four main standards that have been developed for EDI, including the UN/EDIFACT standard recommended by the United Nations(UN), ANSI ASC X12 standard widely used in the US, TRADACOMS standard that is widely used in the UK, and the ODETTE standard used in the European automotive industry. These standards include formats for a wide range of business activities, such as shipping notices, fund transfers, and the like. EDI messages are highly formatted, with the meaning of the information being transmitted being highly dependent on its position in the document. For instance, a line in an EDI document with BEG*00*NE*MOG009364501**950910*CSW11096^ corresponds to a line in the X12 standard for Purchase Orders (standard 850). “BEG” specifies the start of a Purchase Order Transaction Set. The asterisk (*) symbol delineates between items in the line, with each value corresponding to a particular field or information component described in the standard. “NE,” for example, corresponds to the Purchase Order Type Code, which in this instance is “New Order.” As can be seen in the example, the description of the information being transmitted is not readily available within the document. Instead, parties exchanging the information must agree on these formats beforehand, and need to ensure that the information instance is at the right position within the document so that the receiving party can correctly interpret it.
*EDI samples come from http://miscouncil.org.
American National Standards Association(ANSI) can be found at http://www.ansi.org.
[588] This and more examples for difficult categorizations can be found in: [(Bowker and Star 2000)].
[590] Allowing unrestricted access to data and business processes also becomes a problem when working across organizations. Fully integrating systems between two companies, for instance, may entail the exposure of business intelligence and information that should be kept private. This type of exposure is too much for most businesses, regardless of whether the relationship with the other business is collaborative rather than competitive. There are security issues to be considered, as collaborating organizations would need to access private networks and secure servers. The heterogeneity in supporting organizing systems along with the need to quickly evolve with the rapid changes in an organization’s competitive and collaborative environment has pushed organizations to shift from more vertical, isolated structures to a more loosely coupled, ecosystem paradigm This has led to more componentized and modularized systems that need only to exchange information or transform resources when an interaction requires it.
[591] To illustrate the difference between a unidirectional and bidirectional map, consider two systems, the Systematized Nomenclature of Medicine — Clinical Terms(SNOMED-CT) and the International Classification of Diseases, Tenth Revision, Clinical Modification(ICD-10-CM).
SNOMED-CT is a medical language system for clinical terminology maintained by the International Health Terminology Standards Development Organization(IHTSDO) and a designated electronic exchange standard for clinical health information for US Federal Government systems (http://www.nlm.nih.gov/research/umls/Snomed/snomed_main.html).
The ICD-10-CM, on the other hand, is an international diagnostic classification system for general epidemiological, health management, and clinical use maintained by the World Health Organization(WHO) and used for coding and classifying morbidity data from inpatient/outpatient records, physicians offices, and most National Center for Health Statistics(NCHS) surveys (http://www.who.int/classifications/icd/en/).
[592] [(McBride et al. 2006)].
[594] http://journal.code4lib.org/articles/54 (Section 1.), http://www.dlib.org/dlib/june06/chan/06chan.html.
[595] (EDItEUR 2009a).
[596] (EDiTEUR 2009b).
[597] http://www.loc.gov/marc/.
[598] [(Godby, Smith, and Childress 2008)], Sections 1 and 2.
[599] “Toward element-level interoperability in bibliographic metadata” [(Godby, Smith, and Childress 2008)], Sec. 4.4, “Switching-Across.” Consider how the Getty has created a crosswalk called Categories for the Description of Works of Art(CDWA) to switch between eleven metadata standards, including Machine-Readable Cataloging/Anglo-American Cataloging Rules(MARC) and Dublin Core(DC). In this instance, the “Creation Date” element in CDWA is mapped to “260c Imprint — Date of Publication, Distribution, etc.” in MARC/AACR and to “Date.Created” in DC. Although this creates a two-step look-up in real-time, a direct mapping of this element from MARC/AACR to DC is no longer necessary for systems to interoperate.
[600] More commonly, graphical data mapping tools are included in an extract, transform, and load (ETL) database suite that provides additional powerful data transformation capabilities. Whereas data mapping is the first step in capturing the relationships between different systems, data transformation entails code generation that uses the resulting maps to produce an executable transformational program that converts the source data into target format. ETL databases extract the information needed from the outside sources, transform these into information that can be used by the target system using the necessary data mappings, and then loads it into the end system.
[601] Languages such as XSLT and Turing eXtender Language(TXL) facilitate the ease of data transformation while various commercial data warehousing tools provide varying functionalities such as single/multiple source acquisition, data cleansing, and statistical and analytical capabilities. Based on XML, XSLT is a declarative language designed for transforming XML documents into other documents. For example, XSLT can be used to convert XML data into HTML documents for web display or PDF for print or screen display. XSLT processing entails taking an input document in XML format and one or more XSLT style sheets through a template-processing engine to produce a new document.
[603] For an in-depth discussion of interoperability challenges, see Chapter 6 of [(Glushko and McGrath 2005)].
[605] [(Chan and Zeng 2006)]. Section 4.3.
[606] Each of the four information retrieval models discussed in the chapter has different combinations of the comparing, ranking, and location activities. Boolean and vector space models compare the description of the information need with the description of the information resource. Vector space and probabilistic models rank the information resource in the order that the resource can satisfy the user’s query. Structure-based search locates information using internal or external structure of the information resource.
[607] Our discussion of information retrieval models in this chapter does not attempt to address information retrieval at the level of theoretical and technical detail that informs work and research in this field [(Manning et al. 2008)], [(Croft et al. 2009)]. Instead, our goal is to introduce IR from a more conceptual perspective, highlighting its core topics and problems using the vocabulary and principles of IO as much as possible.
[608] A good discussion of the advantages and disadvantages of tagging in the library field can be found in [(Furner 2007)].
[609] [(Manning et al. 2008)], Ch. 1.
[610] Salton was generally viewed as the leading researcher in information retrieval for the last part of the 20th century until he died in 1995. The vector model was first described in [(Salton, Wong, and Yang 1975)].
[611] [(Manning et al. 2008)], p.221.
[612] See [(Manning et al. 2008)], Ch. 6 for more explanations and references on the vector space model.
[613] See [(Robertson 2005)], [(Manning et al. 2008)], Ch. 12 for more explanations and references.
[614] [(Deerwester, Dumais et al. 1990)].
[615] See [(Dumais 2003)].
[616] [(Manning et al. 2008)], Section 6.1.
[619] [(Bizer, Heath, and Berners-Lee 2009)].
[620] Space does not permit significant discussion of these views here, see [(Saracevic 1975)], and [(Schamber et al. 1990)].
[622] Recall and precision are only the foundation of measures that have been developed in information retrieval to evaluate the effectiveness of search algorithms. See [(Baeza-Yates and Ribeiro 2011)], [(Manning et al. 2008)] Ch. 8; [(Demartini and Mizzaro 2006)].