Databases. Each graduate of the Master of Library and Information Science program is able to design, query, and evaluate information retrieval systems.
In the age of technology, librarians are expected not only to preserve and provide access to human knowledge; they are also expected to know how to design systems providing such access. Understanding how databases work is essential in the librarians’ ability to format and structure knowledge in order to facilitate its discovery and retrieval by the user. At the core of information retrieval are principles of design, querying, and evaluating a database; the latter is based on the four main components: data, hardware, software, and people.
The human-centered information retrieval design is grounded in the ability to anticipate and accurately assess information needs of the database users. With the audience needs in mind, the main goal of database designers and indexers is to develop a user-friendly database matching the users’ needs. Successful database design, in my opinion, starts with thorough appraisal of a collection by specialists with cross-disciplinary level of expertise. Accurate assessment of a cultural object, such as a real-photo postcard for example, often requires advanced knowledge of history, art, geography, foreign languages, and other related disciplines. This is particularly challenging when works do not contain information about their author, date of production, and other identifying details. If I can’t tell what these images are about, it is unlikely that I will succeed in describing them, disambiguating them from their counterparts, and designing a database in a way that affords efficient discovery of these images in the digital collection. De Polo and Minelli aptly describe the challenges involved: “A picture of an old man sitting on a chair, for example, has a different cultural value if one knows that the person portrayed is Théophile Gautier (a famous French author)” (2006, p. 102).
Upon completion of collection assessment, the database design process can begin. The process involves developing a data structure and levels of description, based on the collection-level and item-level attributes; deciding on the metadata schema and controlled vocabularies that can accommodate the desirable levels of description of the collection and its items; and determining the field entry rules in the chosen schema. Further, developers select a database management and information retrieval system, such as WebData Pro, CONTENTdm, etc., in which the database can be built. While building fields in the database, developers set up the field properties that afford the entry of the collection data, including field types (textbox, checkbox, list, comment (textarea), upload) and field values (smallint, varchar, tinytext, etc.). They specify the maximum character length for the fields, determine which fields will hold multiple values, determine which fields are searchable, and create data validation rules. Finally, in order to provide a search interface for users, database developers design a query page, configuring it so as to include the database fields that users are most likely to employ in their queries.
Once a database has been set up, indexers can use it to create surrogate records for a collection. To facilitate disambiguation of database records, ensure consistency of information retrieval, and to afford cross-collection interoperability, controlled vocabularies (CV) are used during the data entry. Controlled vocabulary establishes a one-to-one relationship between words and their referents, so that a term always refers to only one concept while a concept is always referred to by only one term. Pre-coordinate and post-coordinate controlled vocabularies both provide distinct advantages and disadvantages: “In pre-coordinated CV systems, multiple concepts are brought together in one term. An illustrative example is the Library of Congress Subject Headings (LCSH), which yield entries such as: Insurance, Unemployment –Switzerland –Statistics,” the Data Documentation Initiative explains. “This method allows for disambiguation of the relationship of the concepts in the term that might not be possible in post-coordinated systems, such as whether a term is a qualifier of another. In post-coordinated or faceted systems, concepts are kept broad and separate and selected and joined in the process of searching with Boolean operators.”
When designing a database, information professionals ought to think of the database searchers, the ways in which they might search, or query the database, and the kinds of help they might need while searching. Belkin’s ASK theory—that people do not have queries but rather an anomalous state of knowledge—provides a helpful conceptual framework for the human-centered information retrieval design:
- information needs are not in principle precisely specifiable;
- it is possible to elicit problem statements from information system users from which representations of the ASK underlying the need can be derived;
- there are classes of ASKs; and
- all elements of information retrieval systems ought to be based on the user’s ASK.
Building a database in a way that accommodates the users’ anomalous state of knowledge involves providing users with different ways of querying the database. “In contrast to browsing, information searching requires the more structured and targeted search for information that meets pre-determined criteria,” Rowley and Hartley point out (2008, p. 114). The most popular search strategies include: keyword search, phrase search, subject search, Boolean logic, truncation, proximity, and nesting.
- Keyword searching, also known as simple search or quick search, searches for words in the natural language fields, such as title, summary, or the text fields of a record, as well as list-based fields, such as controlled vocabulary-indexed fields. In CONTENTdm, for example, the keyword search by default searches for strings across all fields in all collections for all terms that user puts in the search box, in any order. The disadvantage of this search is that the query only retrieves exact matches and does not retrieve the keyword’s synonyms.
- Phrase searching retrieves only those records that contain the words in the particular order in which they appear in the phrase user types in the search box.
- Subject search will only target the database records’ subject field. While producing high-precision results, this search affords interoperability only with those databases that employ the same set of controlled vocabularies in their subject fields as the native database.
- Boolean logic allows using Boolean operators AND, OR, and NOT with the search terms, in order to narrow or expand a search, or to exclude search terms from the query results.
- By adding a wildcard symbol to the end of a word, truncation allows you to search the root of a word to retrieve all records containing words that have that root. The most common truncation symbol is the asterisk (*).
- In the proximity search, the AROUND operator is used when one looks for a combination of search terms and the number indicates the maximum distance between the two terms. The proximity search [“pirates” AROUND(5) “caribbean”] in Google, for example, retrieves 15,800,000 results.
- Finally, nesting allows building a more complex search query by placing parentheses around strings of searches containing Boolean operators, e.g. a user searching for philatelist association in Philadelphia or Pennsylvania, can use a nested search: “philatel* AND (philadelphia or pennsylvania).”
Once a database has been designed, its information retrieval effectiveness is evaluated in beta testing. The database effectiveness is drawn from measurements of recall, i.e. how well the database system aggregates all of the relevant records, and precision, or how well the database system discriminates, i.e. retrieves only the records that are relevant to the user’s query. According to Rowley and Hartley (2008, p. 293), database evaluation results can include four possible outcomes:
- hits, i.e. some relevant documents are successfully retrieved
- noise, i.e. some items that are not relevant are retrieved
- misses, i.e. the search fails to retrieve some relevant items
- dodged, i.e. some irrelevant items are not retrieved.
Recall and precision ratios are usually expressed in percentages, calculated as follows:
Recall = Total relevant retrieved / Total relevant in system x 100, i.e. Recall = a / (a+c) x 100
Precision = Total relevant retrieved / Total retrieved x 100, i.e. Precision = a / (a+b) x 100
Several methods exist to measure the database search effectiveness by correlation of precision and recall; one of them is discussed in Evidence 3 below.
“Together, higher recall and precision measures imply a better, more successful search.” Ma and Cole explain. “However, an actual successful search depends not only on performance, but also on a user’s search strategy, query construction, and proper searching behavior” (2001, p. 285). To tackle this challenge, the following approaches have been used in the field:
- pooling: asking a number of searchers to conduct a search, assuming that if sufficient number of search occasions has been performed, most of the relevant records will be found;
- expert searching: measuring a novice searcher’s performance against expert searchers (Rowley and Hartley, 2008, pp. 296-297).
To support my skills in this competency, I present the following examples of my work:
- The Database 1: Descriptive Metadata and Database Design assignment from LIBR 202 (Information Retrieval)
- The Online Collection assignment from LIBR 284 (Seminar in Archives and Records Management – Topic: Digitization and Digital Preservation)
- The Database 2: Subject Analysis assignment from LIBR 202 (Information Retrieval)
LIBR 202 Database 1: Descriptive Metadata and Database Design
The first piece of evidence to demonstrate my mastery of competency E is a Database 1: Descriptive Metadata and Database Design assignment from LIBR 202 (Information Retrieval). In this project, I worked with four of my classmates to design a database, a data structure (a set of descriptive metadata), and indexing rules for a collection of 25 postcards. First, we appraised the postcard collection, identified its potential user base, and formulated a statement of purpose—in order to determine the search functionality our database ought to have, to tailor it to respective user needs. We then created surrogate records in WebData Pro (5 records each) and tested the online collection, in order to evaluate our database’s effectiveness (alpha test). In addition, we tested a database designed by another group of classmates (beta test), and reported on the results of both tests—in a group paper. The other group, in turn, tested the database our group had designed, as well.
In the individual paper, I analyze the other group’s feedback on our database, give my own assessment of our data structure and field entry rules’ effectiveness, provide recommendations for future database improvements, and review my contribution to the group project.
While working on this assignment, I learned to apply in practice the main principles of information organization, such as information architecture, document representation, and multifindability. In the process of designing a database, I was able to see the information retrieval concepts, such as aboutness, databases, surrogates, data structure, metadata, and indexes—in action. The results of my work demonstrate my understanding of the importance of proper attribute elicitation and disambiguation in database design.
LIBR 284 Online Collection assignment
The second piece of evidence to demonstrate my mastery of competency E is an Online Collection assignment from LIBR 284 (Seminar in Archives and Records Management – Topic: Digitization and Digital Preservation). In this group assignment, I worked with two of my classmates to create an online collection of 45 images and artifacts within CONTENTdm—a digital collection management system provided by OCLC. In the course of the project, each of us selected 15 images and artifacts from our personal archives, digitized them using best practices in digitization, and then worked collaboratively to design a database in order to make our collection available for discovery in CONTENTdm, in anticipation of multidisciplinary information needs of our intended user group. The title of our collection is Growing Up in a Pre-Millennial World.
The database design component of this assignment involved identifying user audience and user needs and developing multifaceted information retrieval functionality. We designed the collection database so that users can search it by title, subject, description, original date, geographic coverage, creator, contributors, type, format, and language. In addition, from the landing page, the collection can be browsed using a list of suggested topics, such as children, families, girls, hairstyles, clothing and dress, food, and vacations.
In order to facilitate resource organization and discovery in the Growing Up in a Pre-Millennial World collection, provide digital identification, ensure resource interoperability, archiving, and long-term preservation, we designed descriptive, administrative, and structural metadata for the collection records, using The NINCH Guide recommendations (Humanities Advanced Technology and Information Institute & National Initiative for a Networked Cultural Heritage, 2002, p. 115). In order to afford the cross-institutional, federated search and retrieval of collection items in the real-world digital library environment, we applied the Dublin Core Metadata Initiative (DCMI) standards while designing our metadata fields and used established controlled vocabularies, including Thesaurus for Graphic Materials (TGM), Art & Architecture Thesaurus (AAT), and DCMI Type Vocabulary [DCMITYPE].
Thanks to this assignment, I learned to create online digital collections in CONTENTdm. My work on it helped me understand the importance of using the most up to date image production and metadata standards across all subsets of the collection in order to ensure interoperability and sustainability of the collection in the digital library environment.
LIBR 202 Database 2: Subject Analysis
The third piece of evidence to demonstrate my mastery of competency E is is a Database 2: Subject Analysis assignment from LIBR 202 (Information Retrieval). In this project, I worked with four of my classmates to develop a post-coordinate vocabulary for 35 scholarly articles related to library and information science, and then used the vocabulary to index the articles in WebData Pro (7 articles each). We then developed specific information need case studies and conducted a series of the need-based test queries, in order to evaluate the effectiveness of the subject-indexing strategy we had developed. Specifically, the results of queries in the controlled vocabulary-indexed field, Descriptors, were compared to the results of queries in the natural language fields, Title and Abstract. Effectiveness was measured by correlation of precision and recall.
In the individual part of this project, I present my individual qualitative and quantitative analysis of our group test results. In particular, I discuss the subject indexing effectiveness and explore the relationship of subject access to the larger issues of information retrieval.
This piece of evidence demonstrates my competence in database evaluation. I learned to apply in practice the database quantitative and qualitative evaluation methods. The work I had completed helped me realize that performance of an information retrieval system depends on many variables, in particular the indexing language, the search interface design, and the search engine functionality.
Through the coursework I completed in the MLIS program, I have acquired knowledge of main principles guiding design, querying, and evaluation of databases; I also had the opportunity to apply this knowledge in practice, designing databases in WebData Pro and CONTENTdm. The database assignments taught me that only when the aboutness of an object—a set of terms evoking a subject concept, which is hopefully shared by many people including authors, indexers, and the users of the system—is identified properly, a database designer can accurately elicit the object’s main attributes and design the search fields and controlled vocabulary matching the main information stored in this object, in order to provide for its successful discovery and retrieval. Finally, I also learned that measuring the information retrieval systems performance and evaluating the IR outcomes allows planning on how to build better, more efficient information retrieval systems.
De Polo, A., & Minelli, S. (2006). Digital access to a photographic collection. In L. MacDonald (Ed.), Digital heritage: Applying digital imaging to cultural heritage (pp. 93-114). Oxford: Elsevier.
Humanities Advanced Technology and Information Institute, & National Initiative for a Networked Cultural Heritage. (2002). The NINCH guide to good practice in the digital representation and management of cultural heritage materials. Washington, D.C.: National Initiative for a Networked Cultural Heritage.
Ma, W., & Cole, T. W. (2001, March 15-18). Test and Evaluation of an Electronic Database Selection Expert System. Retrieved from http://www.ala.org/acrl/sites/ala.org.acrl/files/content/conferences/pdf/ma.pdf
Rowley, J. E., & Hartley, R. J. (2008). Organizing knowledge: An introduction to managing access to information. Aldershot, England: Ashgate.