Schedule of Full and Short Paper Sessions with Abstracts

View Brief Schedule of Full and Short Papers | View Conference Schedule

Tuesday July 23 | Wednesday July 24 | Thursday July 25


10:45 AM Session #1 Web 2.0
Session Chair: Paul Bogen

Identification of Useful User Comments in Social Media: A Case Study on Flickr Commons - Student Best Paper Award Nominee
Elaheh Momeni (University of Vienna - Multimedia Information Systems, Austria)Ke Tao (Delft University of Technology, Netherlands)Bernhard Haslhofer (Cornell University Information Science, United States)Geert-Jan Houben (Delft University of Technology, Netherlands)
Abstract Cultural institutions are increasingly opening up their repositories and contributing digital objects to social media platforms such as Flickr. In return they often receive user comments containing information that could be incorporated in their catalog records. However, users have different backgrounds and expertise and the quality level of those comments varies from useful to useless. Since judging the usefulness of a potentially large number of user comments is a labor-intensive task, our aim is to provide automated support for filtering potentially useful social media comments on digital objects. In this paper, we discuss the notion of usefulness in the context of social media comments and present a machine-learning approach to automatically classify comments according to their usefulness. Our approach makes use of syntactic and semantic comment features and also considers user context. We present the results of an experiment we did on user comments received in six different Flickr Commons collections. They show that a few relatively straightforward features can be used to infer useful comments with up to 89% accuracy.
WikiMirs: A Mathematical Information Retrieval System for Wikipedia
Xuan Hu (Peking University, China)Xiaoyan Lin (Peking University, China)Liangcai Gao (Institute of Computer Science and Technology, Peking University, China)Zhi Tang (Peking Unversity, China)Xiaofan Lin (, United States)Josef Baker (University of Birmingham, United Kingdom)
Abstract Mathematical formulae in structural formats such as MathML and LaTeX are becoming increasingly available. Moreover, repositories and websites, including ArXiv and Wikipedia, and growing numbers of digital libraries use these structural formats to present mathematical formulae. This presents an important new and challenging area of research, that of Mathematical Information Retrieval (MIR). In this paper, we propose WikiMirs, a tool to facilitate mathematical formula retrieval in Wikipedia. WikiMirs is aimed at searching for similar mathematical formulae based upon both textual and spatial similarities, using a new indexing and matching model developed for layout structures. A hierarchical generalization technique is proposed to generate sub-trees from presentation trees of mathematical formulae, and similarity is calculated based upon matching at different levels of these trees. Experimental results show that WikiMirs can efficiently support sub-structure matching and similarity matching of mathematical formulae. Moreover, WikiMirs obtains both higher accuracy and better ranked results over Wikipedia in comparison to Wikipedia Search and Egomath. We conclude that WikiMirs provides a new, alternative, better service for users to search mathematical expressions within Wikipedia.
Interacting with and through a Digital Library Collection: Commenting behavior in Flickr’s The Commons
Sally Jo Cunningham (, New Zealand)Malika Mahoui (, United States)
Abstract There is growing interest by digital collection providers to engage collection users in interacting with the collection (e.g. by tagging or annotating collection contents) and with the collection organizers and other users (e.g. to form loose ‘communities’ associated with the collection). At present, little has been documented as to the uptake of these mechanisms in specific collections, or the range of behaviors that emerge as users bend existing facilities to their own needs. This paper is one step in that direction: it describes the social information behaviors exhibited in a cultural heritage photography collection in The Commons on Flickr, and suggests implications for digital library design in response to these behaviors.
A Comparative Study of Academic impact and Wikipedia Ranking
Xin Shuai (, United States)Zhuoren Jiang (, China)Xiaozhong Liu (, United States)Johan Bollen (, United States)
Abstract In addition to its broad popularity Wikipedia is also widely used for scholarly purposes. Many Wikipedia pages pertain to academic papers, scholars and topics providing a rich ecology for scholarly uses. Although many recognize the scholarly potential of Wikipedia, as a crowdsourced encyclopedia its authority and quality is questioned due to the lack of rigorous peer-review and supervision. Scholarly references and mentions on Wikipedia may thus shape the "societal impact" of a certain scholarly communication item, but it is not clear whether they shape actual "academic impact". In this paper we compare the impact of papers, scholars, and topics according to two different measures, namely scholarly citations and Wikipedia mentions. Our results show that academic and wikipedia impact are positively correlated. Papers, authors, and topics that are mentioned on Wikipedia have higher academic impact than those are not mentioned. Our findings validate the hypothesis that Wikipedia can help assess the impact of scholarly publications and underpin relevance indicators for scholarly retrieval or recommendation systems.


10:45 AM Session #2 Preservation I
Session Chair: Martin Klein

A Distributed Archival Network for Process-Oriented Autonomic Long-Term Digital Preservation
Ivan Subotic (University of Basel, Imaging and Media Lab, Switzerland)Lukas Rosenthaler (University of Basel, Imaging and Media Lab, Switzerland)Heiko Schuldt (University of Basel, Switzerland)
Abstract The reliable and consistent long-term preservation of digital content and its associated metadata is becoming increasingly important in a large variety of applications -- even though the media on which this data is stored is potentially subject to failures, or the data formats may become obsolete over time. A common approach is to replicate data across several sites to increase their availability. Nevertheless, network, software, or hardware failures as well as the evolution of data formats have to be coped with in a timely and, ideally, an autonomous way, without intervention of an administrator.In this paper we present DISTARNET, a distributed, autonomous long-term digital preservation system. Essentially, DISTARNET exploits dedicated processes to ensure the integrity and consistency of data with a given replication degree. At the data level, DISTARNET supports complex data objects, the management of collections, annotations, and arbitrary links between digital objects. At process level, dynamic replication management, consistency checking, and automated recovery of the archived digital objects is provided utilizing autonomic behavior governed by preservation policies without any centralized component.We present the concepts and implementation of the distributed DISTARNET preservation approach. Most importantly, we provide details of the qualitative and quantitative evaluation of the DISTARNET system. The former addresses the effectiveness of the internal preservation processes while the latter evaluates DISTARNET's performance regarding the overall archiving storage capacity and scalability.
Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in Acyclic Walks Through a Web Archive - Student Best Paper Award Nominee
Scott Ainsworth (Old Dominion University, United States)Michael Nelson (Old Dominion University, United States)
Abstract When a user views an archived page using the archive's user interface (UI), the user selects a datetime to view from a list. The archived web page, if available, is then displayed. From this display, the web archive UI attempts to simulate the web browsing experience by smoothly transitioning between archived pages. During this process, the target datetime changes with each link followed; drifting away from the datetime originally selected. When browsing sparsely-archived pages, this nearly-silent drift can be many years in just a few clicks. We conducted 200,000 random walks of archived pages, following up to 50 links per walk, comparing the results of two target datetime policies. The Sliding Target policy allows the target datetime to change as it does in archive UIs such as the Internet Archive's Wayback Machine. The Sticky Target policy, represented by the Memento API, keeps the target datetime the same throughout the walk. We found that the Sliding Target policy drift increases with the number of walk steps, number of domains visited, and choice (number of links available). However, the Sticky Target policy controls temporal drift, holding it less 30 days on average without regard to walk length or the number of domains visited. The Sticky Target policy shows some increase as choice increases, but this may be caused by other factors. We conclude that based on walk length, the Sticky Target policy generally produces at least 30 days less drift than the Sliding Target policy.
Medusa at the University of Illinois at Urbana-Champaign: A Digital Preservation Service Based on PREMIS
Kyle Rimkus (University of Illinois at Urbana-Champaign, United States)Thomas Habing (University of Illinois at Urbana-Champaign, United States)
Abstract The Medusa digital preservation service at the University of Illinois at Urbana-Champaign provides a storage environment for digital content selected for long-term retention by content managers and producers affiliated with the Library in order to ensure its enduring access and use. This paper reports on Medusa development, with emphasis on the research processes that informed key decisions related to its design, the central role of PREMIS metadata in its architecture, the digital preservation benefits leveraged from the service’s underlying Dell DX6000 object storage platform, and future directions of integrating PREMIS management into a Fedora repository architecture. In so doing, it describes a strategy of digital preservation content management that draws strength from the creation and management of comprehensive preservation metadata records and content bit-streams in a storage environment whose architecture provides certain digital preservation features often managed in the repository software layer.
First Steps in Archiving the Mobile Web: Automated Discovery of Mobile Websites
Richard Schneider (Harding University, United States)Frank McCown (Harding University, United States)
Abstract Smartphones and tablets are increasingly used to access the Web, and many websites now provide alternative sites tailored specifically for these mobile devices. Web archivists are turning their attention to this equally ephemeral Mobile Web and are in need of tools to aid in its capture. We present Findmobile, a tool for automating the discovery of mobile websites. We tested our tool in an experiment examining 10K popular websites and found that the most frequently used technique used by popular websites to direct mobile users to mobile sites was by redirection. We found that nearly half of mobile web pages differ dramatically from their stationary web counterparts, and we found that the most popular websites are those most likely to have mobile-specific pages.


1:45 PM Session #3 Education
Session Chair: Ed Fox

Vertical selection in the information domain of children - Student Best Paper Award Nominee
Sergio Duarte Torres (U. Twente, Netherlands) Djoerd Hiemstra (U. Twente, Netherlands)Theo Huibers (U. Twente, Netherlands)
Abstract In this paper we explore the vertical selection methods in aggregated search in the specific domain of topics for children between 7 and 12 years old. A test collection consisting of 25 verticals, 3.8K queries and relevant assessments for a large sample of these queries mapping relevant verticals to queries was built. We gather relevant assessment by envisaging two aggregated search systems: one in which the Web vertical is always displayed and in which each vertical is assessed independently from the web vertical. We show that both approaches lead to a different set of relevant verticals and that the former is prone to bias of visually oriented verticals. In the second part of this paper we estimated the size of the verticals for the target domain. We show that employing the global size and domain specific size estimation of the verticals lead to significant improvements when using state-of-the art methods of vertical selection. We also introduce a novel vertical and query representation based on tags from social media and we show that its used lead to significant performance gain.
Automatic Extraction of Core Learning Goals and Generation of Pedagogical Sequences Through a Collection of Digital Library Resources
Ifeyinwa Okoye (University of Colorado at Boulder, United States)Tamara Sumner (University of Colorado at Boulder, United States)
Abstract A key challenge facing educational technology researchers is how to provide structure and guidance when learners use unstructured and open tools such as digital libraries for their own learning. This work attempts to use computational methods to identify that structure in a domain independent way and support learners as they navigate and interpret the information they find. This article highlights a computational methodology for generating a pedagogical sequence through core learning goals extracted from a collection of resources which in this case, are resources from the Digital Library for Earth System Education (DLESE). This article describes how we use the technique of multi-document summarization to extract the core learning goals from the digital library resources and how we create a supervised classifier that performs a pair-wise classification of the core learning goals; the judgments from these classifications are used to automatically generate pedagogical sequences. Results show that we can extract good core learning goals and make pair-wise classifications that are up to 76% similar to the pair-wise classifications generated from pedagogical sequences created by two Earth Science subject experts. Thus we can dynamically generate pedagogically meaningful learning paths through digital library resources.
Building a Search Engine for Computer Science Course Syllabi
Nakul Rathod (Villanova University, United States)Lillian Cassel (Villanova University, United States)
Abstract Syllabi are rich educational resources. However, finding Computer Science syllabi on a generic search engine does not work well. Towards our goal of building a syllabus collection we have trained various Machine Learning classifiers to recognize Computer Science syllabi from other web pages and the discipline that they represent (AI or SE for instance) among other things. We have crawled 50 Computer Science departments in the US and gathered 100,000 candidate pages. Our best classifiers are more than 90% accurate at identifying syllabi from real-world data. The syllabus repository we created is live for public use [1] and contains more than 3000 syllabi that our classifiers filtered out from the crawl data. We present an analysis of the various feature selection methods and classifiers used.


1:45 PM Session #4 Information Ranking
Session Chair: Kazunari Sugiyama

Ranking experts using Author-Document-Topic graphs
Sujatha Das Gollapalli (Penn State University, United States)Prasenjit Mitra (The Pennsylvania State University, United States)C. Lee Giles (Pennsylvania State University, United States)
Abstract Expert search or recommendation involves the retrieval of people (experts) in response to a query and on occasion, a given set of constraints. In this paper, we address expert recommendation in academic domains that are different from web and intranet environments studied in TREC. Academic corpora typically comprise of scientific research publications, academic homepages and other metadata. We propose and study graph-based models for expertise retrieval for academic corpora with the objective of enabling search using either a topic (e.g. ``Information Extraction") or a name (e.g. ``Bruce Croft") via a uniform framework. We show that graph-based ranking schemes despite being ``generic" perform on par with expert ranking models specific to topic-based and name-based querying.
Aggregating Productivity Indices for Ranking Researchers across Multiple Areas
Harlley Lima (Universidade Federal de Minas Gerais, Brazil)Thiago Silva (Universidade Federal de Minas Gerais, Brazil)Mirella Moro (Universidade Federal de Minas Gerais, Brazil)Rodrygo Santos (Universidade Federal de Minas Gerais, Brazil)Wagner Meira Jr. (Universidade Federal de Minas Gerais, Brazil)Alberto Laender (Universidade Federal de Minas Gerais, Brazil)
Abstract The impact of scientific research has traditionally been quantified using productivity indices such as the well-known h-index. On the other hand, different research fields---in fact, even different research areas within a single field---may have very different publishing patterns, which may not be well described by a single, global index. In this paper, we argue that productivity indices should account for the singularities of the publication patterns of different research areas, in order to produce an unbiased assessment of the impact of scientific research. Inspired by ranking aggregation approaches in distributed information retrieval, we propose a novel approach for ranking researchers across multiple research areas. Our approach is generic and produces cross-area versions of any global productivity index, such as the volume of publications, citation count and even the h-index. Our thorough evaluation considering multiple areas within the broad field of Computer Science shows that our cross-area indices outperform their global counterparts when assessed against the official ranking produced by CNPq, the Brazilian National Research Council for Scientific and Technological Development. As a result, this paper contributes a valuable mechanism to support the decisions of funding bodies and research agencies, for example, in any research assessment effort.
IFME: Information Filtering by Multiple Examples with Under-Sampling in a Digital Library Environment
Mingzhu Zhu (Information Systems Department, New Jersey Institute of Technology, United States)Chao Xu (Information Systems Department, New Jersey Institute of Technology, United States)Yi-Fang Wu (Information Systems Department, New Jersey Institute of Technology, United States)
Abstract With the amount of digitalized documents increasing exponentially, it is more difficult for users to keep up to date with the knowledge in their domain. In this paper, we present a framework named IFME (Information Filtering by Multiple Examples) in a digital library environment to help users identify the literatures related to their interests by leveraging the Positive Unlabeled learning (PU learning). Using a few relevant documents provided by a user and considering the documents in an online database as unlabeled data (called U), it ranks the documents in U using a PU learning algorithm. From the experimental results, we found that while the approach performed well when a large set of relevant feedbacks were available, it performed relatively poor when the relevant feedbacks were few. We improved IFME by combining PU learning with under-sampling to tune the performance. Using Mean Average Precision (MAP), our experimental results indicated that with under-sampling, the performance improved significantly even when P was small. We believe the PU learning based IFME framework brings insights to develop more effective digital library systems.
Can't see the forest for the trees? A citation recommendation system
Cornelia Caragea (, United States)Adrian Silvescu (, United States)Prasenjit Mitra (, United States)C. Lee Giles (, United States)
Abstract Scientists continue to find challenges in the ever increasing amount of information that has been produced on a world wide scale, during the last decades. When writing a paper, an author searches for the most relevant citations that started or were the foundation of a particular topic, which would very likely explain the thinking or algorithms that are employed. The search is usually done using specific keywords submitted to literature search engines such as Google Scholar and CiteSeer. However, finding relevant citations is distinctive from producing articles that are only topically similar to an author's proposal. In this paper, we address the problem of citation recommendation using a singular value decomposition approach. The models are trained and evaluated on the Citeseer digital library. The results of our experiments show that the proposed approach achieves significant success when compared with collaborative filtering methods on the citation recommendation task.


3:45 PM Session #5 Evaluation
Session Chair: Lillian Cassel

Comparative Appraisal: Systematic Assessment of Expressive Qualities
Melanie Feinberg (School of Information, The University of Texas at Austin, United States)
Abstract Clifford Lynch describes the value of digital libraries as adding interpretive layers to collections of cultural heritage materials. However, standard forms of evaluation, which focus on the degree to which a system solves clearly identified problems, are insufficient assessments of the expressive qualities that distinguish such interpretive content. This paper describes a form of comparative, structured appraisal that supplements the existing repertoire of assessment techniques. Comparative appraisal uses a situationally defined set of procedures to be followed by multiple assessors in examining a group of artifacts. While this approach aims for a goal of systematic comparison based on selected dimensions, it is grounded in the recognition that expressive qualities are not conventionally measurable and that absolute agreement between assessors is neither possible nor desirable. The conceptual basis for this comparative method is drawn from the literature of writing assessment.
Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Eleni Afiontzi (Department of Informatics, Athens University of Economics & Business, Greece)Giannis Kazadeis (Department of Informatics, Athens University of Economics & Business, Greece)Leonidas Papachristopoulos (Department of Archives and Library Sciences, Ionian University, Corfu, Greece, Greece)Michalis Sfakakis (Department of Archives and Library Science, Ionian University, Corfu, Greece, Greece)Giannis Tsakonas (Department of Archives and Library Science, Ionian University, Corfu, Greece, Greece)Christos Papatheodorou (Department of Archives and Library Science, Ionian University, Corfu, Greece, Greece)
Abstract The digital library evaluation field has an evolving nature and it is characterized by a noteworthy proclivity to enfold various methodological orientations. Given the fact that the scientific literature of the specific domain is vast, researchers require tools that will exhibit either commonly acceptable practices, or areas for further investigation. In this paper, a data mining methodology is proposed to identify prominent patterns in the evaluation of digital libraries. Using Machine Learning techniques, all papers presented in the ECDL and JCDL conferences between the years 2001-2011 were categorized as relevant or non-relevant to the DL evaluation domain. Then, the relevant papers were semantically annotated according to the Digital Library Evaluation Ontology (DiLEO) vocabulary. The produced set of annotations was clustered to evaluation patterns for the most frequently used tools, methods and goals of the domain. Our findings highlight the expressive nature of DiLEO, emphasize on semantic annotation as a necessary step in handling domain centric corpora and underline the potential of the proposed methodology in the profiling of evaluation activities.
Mendeley Group as A New Source of Interdisciplinarity Study: How Disciplines Interact on Mendeley?
Jiepu Jiang (School of Information Sciences, University of Pittsburgh, United States)Chaoqun Ni (School of Library and Information Science, Indiana University Bloomington, United States)Daqing He (School of Information Sciences, University of Pittsburgh, United States)Wei Jeng (School of Information Sciences, University of Pittsburgh, United States)
Abstract This paper utilizes Mendeley group as a new source of interdisciplinarity study by examining how disciplines interact with each other in terms of sharing group members. Results show that groups in the discipline of Medicine, Computer & Information Science and Biological Sciences are very active, popular and connected a lot in Mendeley. Some disciplines as a whole is not most active, but does have some groups widely participated by users. Besides, the popularity of groups in certain discipline, e.g. Environmental Sciences, oversees its activeness to some extent. Computer and Information Science is connected with other disciplines a lot in terms of sharing members and followers. Disciplines such as Sports and Recreation are less interdisciplinary in that they do not share many group members and followers with other disciplines.
Following Bibliometric Footprints: The ACM Digital Library and the Evolution of Computer Science
Shion Guha (Cornell University, United States)Stephanie Steinhardt (Cornell University, United States)Syed Ishtiaque Ahmed (Cornell University, United States)Carl Lagoze (University of Michigan, United States)
Abstract Using scientometric methods, this exploratory work shows evidence of transitions in the field of computer science since the emergence of HCI as a distinct sub-discipline. We mine the ACM Digital Library in order to expose relationships between sub-disciplines in computer science, focusing in particular on the transformational nature of the SIG on Computer-Human Interaction (CHI) in relation to other SIGs. Our results suggest shifts in the field due to broader social, economic and political changes in computing research and we present implications, which are intended as a prolegomena to further investigations.


3:45 PM Session #6 Information Clustering
Session Chair: Xiaozhong Liu

Information-theoretic Term Weighting Schemes for Document Clustering - Vannevar Bush Best Paper Award Nominee
Weimao Ke (Drexel University, United States)
Abstract We propose a new theory that quantifies information in probability distributions and derive a new document representation model for text clustering. By extending Shannon entropy to accommodate a non-linear relation between information and uncertainty, the proposed Least Information theory (LIT) provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection. We derive two basic quantities in the document clustering context: 1) LI Binary (LIB) which quantifies information due to the observation of a term's (binary) occurrence in a document; and 2) LI Frequency (LIF) which measures information for the observation of a randomly picked term from the document. Both quantities are computed given term distributions in the document collection as prior knowledge and can be used separately or combined to represent documents for text clustering. Experiments on four benchmark text collections demonstrate strong performances of the proposed methods compared to classic TF*IDF. Particularly, the LIB*LIF weighting scheme, which combines LIB and LIF, consistently outperforms TF*IDF in terms of multiple evaluation metrics. The least information measure has a potentially broad range of applications beyond text clustering.
Exploiting Potential Citation Papers in Scholarly Paper Recommendation - Vannevar Bush Best Paper Award Nominee
Kazunari Sugiyama (National University of Singapore, Singapore)Min-Yen Kan (National University of Singapore, Singapore)
Abstract To help generate relevant suggestions for researchers, recommendation systems have started to leverage the interests latent in the publication profiles of the researchers themselves. While using such a publication citation network has been shown to enhance performance, the network is often sparse, making recommendation difficult. To alleviate this sparsity, we identify ``potential citation papers'' through the use of collaborative filtering. Also, as different logical sections of a paper have different significance, as a secondary contribution, we investigate which sections of papers can be leveraged to represent papers effectively. Over a scholarly paper recommendation dataset, we show that recommendation accuracy significantly outperforms (as measured by nDCG and MRR) state-of-the-art recommendation baselines when we discover potential citation papers using imputed similarities via collaborative filtering and representing candidate papers by using both the full text and assigning more weight to the conclusion sections.
Addressing diverse corpora with cluster-based term weighting
Peter Organisciak (University of Illinois at Urbana-Champaign, United States)
Abstract Highly heterogenous collections present difficulties to term weighting models that are informed by corpus-level frequencies. Collections which span multiple languages or large time periods do not provide realistic statistics on which words are interesting to a system. This paper demonstrates how diverse corpora can frustrate term weighting and proposes a modification that weighs documents according to their class or cluster within the collection. In cases of diverse corpora, the proposed modification better represents the intuitions behind corpus-level document frequencies.
Interactive Search Result Clustering: A Study of User Behavior and Retrieval Effectiveness
Xuemei Gong (Drexel University, United States)Weimao Ke (Drexel University, United States)Yan Zhang (University of Texas at Austin, United States)Ramona Broussard (University of Texas at Austin, United States)
Abstract Scatter/Gather is a document browsing and information retrieval method based on document clustering. It is designed to facilitate user articulation of information needs through iterative clustering and interactive browsing. This paper reports on a study that investigated the effectiveness of Scatter/Gather browsing for information retrieval. We conducted a within-subject user study of 24 college students to investigate the utility of a Scatter/Gather system, to examine its strengths and weaknesses, and to receive feedback from users on the system. Results show that the clustering-based Scatter/Gather method was more difficult to use than the classic information retrieval systems in terms of user perception. However, clustering helped the subjects accomplish the tasks more efficiently. Scatter/Gather clustering was particularly useful in helping users finish tasks that they were less familiar with and allowed them to search with fewer words. Scatter/Gather tended to be more useful when it was more difficult for the user to do query specification for an information need. Topic familiarity and specificity had significant influences on user perceived retrieval effectiveness. The influences appeared to be greater with the Scatter/Gather system compared to a classic search system. Topic familiarity also had significant influences on query formulation.



10:30 AM Session #7 Specialist DLs
Session Chair: Michael Nelson

Tipple: Location-Triggered Mobile Access to a Digital Library for audio books - Vannevar Bush Best Paper Award Nominee
Annika Hinze (University of Waikato, New Zealand)David Bainbridge (University of Waikato, New Zealand)
Abstract This paper explores the role of audio as a means to access books in a digital library while being at the location referred to in the books. The books are sourced from the digital library and can either be accompanied by pre-recorded audio or synthesized using text-to-speech. The paper details the functional requirements, design and implementation of Tipple. The concept was extensively tested in three field studies.
Redeye: A Digital Library for Forensic Document Triage - Vannevar Bush Best Paper Award Nominee
Paul Bogen (Oak Ridge National Laboratory, United States)Amber McKenzie (Oak Ridge National Laboratory, United States)Rob Gillen (Oak Ridge National Laboratory, United States)
Abstract Forensic document analysis has become an important aspect of investigation of many different kinds of crimes from money laundering to fraud and from cybercrime to smuggling. The current workflow for analysts includes powerful tools, such as Palantir and Analyst’s Notebook, for moving from evidence to actionable intelligence and tools for finding documents among the millions of files on a hard disk, such as FTK. However, the analysts often leave the process of sorting through collections of seized documents to filter out the noise from the actual evidence to a highly labor-intensive manual effort. This paper presents the Redeye Analysis Workbench, a tool to help analysts move from manual sorting of a collection of documents to performing intelligent document triage over a digital library. We will discuss the tools and techniques we build upon in addition to an in-depth discussion of our tool and how it addresses two major use cases we observed analysts performing. Finally, we also include a new layout algorithm for radial graphs that is used to visualize clusters of documents in our system.
Local Histories in Global Digital Libraries: Identifying Demand and Evaluating Coverage
Katrina Fenlon (, United States)Virgil Varvel (, United States)
Abstract Digital collections of primary source materials have potential to change how citizen historians and scholars research and engage with local history. The problem at the heart of this study is how to evaluate local history coverage, particularly among large-scale, distributed collections and aggregations. As part of an effort to holistically evaluate one such national aggregation, the Institute of Museum and Library Services Digital Collections and Content (DCC), we conducted a national survey of reference service providers at academic and public libraries throughout the United States. In this paper, we report the results of this survey that appear relevant to local history and collection evaluation, and consider the implications for scalable evaluation of local history coverage in massive, aggregative digital libraries.
Instrument distribution and music notation search for enhancing bibliographic music score retrieval
Laurent Pugin (RISM Switzerland, Switzerland)Rodolfo Zitellini (RISM Switzerland, Switzerland)
Abstract Because of the unique characteristics of music scores, search- ing bibliographical music collections using traditional library systems can be a challenge. In this paper, we present two specific search functionalities added to the Swiss RISM data- base and how they improve the user experience. The first is a search functionality for instrument and vocal part dis- tribution that leverages coded information available in the MarcXML records of the database. It enables scores for pre- cise ensemble distribution to be retrieved. The second is a search functionality by music notation excerpts transcribed from the beginning of the pieces, known as music incipits. The music incipit search is achieved using a well-known mu- sic information retrieval (MIR) tool, Themefinder. A nov- elty of our implementation is that it can operate at three different levels (pitch, duration and metric), singularly or combined, and that it is performed through a specifically- developed intuitive graphical interface for note input and parameter selection. The two additions illustrate why it is important to take into consideration the particularities of music scores when designing a search system and how MIR tools can be beneficially integrated into existing heteroge- neous bibliographic score collections.


10:30 AM Session #8 Name Extraction
Session Chair: Ron Larsen

A search engine approach to estimating temporal changes in gender orientation of first names
Brittany N. Smith (University of Illinois at Urbana-Champaign, United States)Mamta Singh (University of Illinois at Urbana-Champaign, United States)Vetle I. Torvik (University of Illinois at Urbana-Champaign, United States)
Abstract This paper presents an approach for predicting the gender orientation of any given first name over time based on a set of search engine queries comprised of the name prefixed by masculine and feminine oriented markers (e.g., “Uncle Taylor”). We hypothesize that these markers can capture the great majority of variability in gender orientation, including temporal changes. In order to test this hypothesis, we train a logistic regression model using 129 years of male/female counts of 85,406 names provided by the US Social Security Administration (SSA) in order to assign weights to the markers (measured by adjusted query results) and permit these weights to vary over time. The model misclassifies 2.25% of the people in the SSA dataset, which is slightly higher than the 1.74% pure (within name*year) error rate. Moreover, the model provides predictions for names not observed in the SSA dataset and accounts for naming conventions beyond the USA. The misclassification rate tends to increase over time with some periodic variations e.g., due to increases in immigration and name creativity. Misclassification rates are higher for rare and non-English names as well as words not exclusively used as proper first names. However, the model tends to err on the side of caution by predicting neutral/unknown rather than false positive female (or male). As hypothesized, the markers also capture temporal patterns of androgyny, and in a meaningful manner, e.g., Daughter is a stronger female predictor for recent years while Grandfather is a stronger male predictor around the turn of the 20th century. These results illustrate how a simple query-based strategy can harness the predictive power of a large collection of indexed text documents in a non-consumptive manner. The model has been implemented as a web-tool called Genni (available via that displays the predicted proportion of females vs. males over time for any given name. This should be a valuable resource for those who utilize names in order to discern gender on a large scale, e.g., to study the roles of gender and diversity in scholarly work based on digital libraries and bibliographic databases where the authors’ names are listed.
A Relevance Feedback Approach for the Author Name Disambiguation Problem
Thiago A. Godoi (Institute of Computing - University of Campinas, Brazil)Ricardo Da S. Torres (Institute of Computing - University of Campinas, Brazil)Ariadne M. B. R. Carvalho (Institute of Computing - University of Campinas, Brazil)Marcos André Gonçalves (Dept. of Computer Science - Federal University of Minas Gerais, Brazil)Anderson A. Ferrreira (Dept. of Computer Science - Federal University of Ouro Preto, Brazil)Weiguo Fan (Dept. of Computer Science - Virginia Tech, United States)Edward A. Fox (Dept. of Computer Science - Virginia Tech, United States)
Abstract This paper presents a new name disambiguation method that exploits user feedback on ambiguous references across iterations. An unsupervised step is used to dene pure training samples, and a hybrid supervised step is employed to learn a classication model for assigning references to authors. Our classication scheme combines the Optimum- Path Forest (OPF) classier with complex reference similarity functions dened by a Genetic Programming framework. Performed experiments demonstrate that the proposed method yields better results than state-of-the-art disambiguation methods on two traditional datasets.
Extracting and Matching Authors and Affiliations in Scholarly Documents
Huy Do Hoang Nhat (National University of Singapore, Singapore) Muthu Kumar Chandrasekaran (National University of Singapore), Philip S. Cho (Asia Research Institute), and Min Yen Kan (Asia Research Institute & National University of Singapore)
Abstract We introduce Enlil, an information extraction system that discovers the institutional affiliations of authors in scholarly papers. Enlil consists of two steps: one that first identifies authors and affiliations using a conditional random field; and a second support vector machine that connects authors to their affiliations. We benchmark Enlil in three separate experiments drawn from three different sources: the ACL Anthology Corpus, the ACM Digital Library, and a set of cross-disciplinary scientific journal articles acquired by querying Google Scholar. Against a state-of-the-art production baseline, Enlil reports a statistically significant improvement in F1 of nearly 10% (p « 0.01). In the case of multidisciplinary articles from Google Scholar, Enlil is benchmarked over both clean input (F1 > 90%) and automatically-acquired input (F1 > 80%). We have deployed Enlil in a case study involving Asian genomics research publication patterns to understand how government sponsored collaborative links evolve. Enlil has enabled our team to construct and validate new metrics to quantify the facilitation of research as opposed to direct publication.


1:30 PM Session #9 Metadata
Session Chair: Unmil Karadkar

User-centered Approach in Creating a Metadata Schema for Video Games and Interactive Media
Jin Ha Lee (University of Washington, United States)Hyerim Cho (University of Washington, United States)Violet Fox (University of Washington, United States)Andrew Perti (Seattle Interactive Media Museum, United States)
Abstract Video games and interactive media are increasingly becoming important part of our culture and everyday life, and subsequently, of archival and digital library collections. However, existing organizational systems often use vague or inconsistent terms to describe video games or attempt to use schemas designed for textual bibliographic resources. Our research aims to create a standardized metadata schema and encoding scheme that provides an intelligent and comprehensive way to represent video games. We conducted interviews with 24 gamers, focusing on their video game-related information needs and seeking behaviors. We also performed a domain analysis of current organizational systems used in catalog records and popular game websites, evaluating metadata elements used to describe games. With these results in mind, we created a list of elements which form a metadata schema for describing video games, with both a core set of 16 elements and an extended set of 46 elements providing more flexibility in expressing the nature of a game.
Automatic Tag Recommendation for Metadata Annotation Using Probabilistic Topic Modeling - Student Best Paper Award Nominee
Suppawong Tuarob (Pennsylvania State University, United States)Line C. Pouchard (Oak Ridge National Laboratory, United States)C. Lee Giles (Pennsylvania State University, United States)
Abstract The increase of the complexity and advancement in ecological and environmental sciences encourages scientists across the world to collect data from multiple places, times, and thematic scales to verify their hypotheses. Accumulated over time, such data not only increases in amount, but also in the diversity of the data sources spread around the world. This poses a huge challenge for scientists who have to manually search for information. To alleviate such problems, ONEMercury has recently been implemented as part of the DataONE project to serve as a portal for accessing environmental and observational data across the globe. ONEMercury harvests metadata from the data hosted by multiple repositories and makes it searchable. However, harvested metadata records sometimes are poorly annotated or lacking meaningful keywords, which could impede effective retrieval. Here, we develop algorithms for automatic annotation of metadata. We transform the problem into a tag recommendation problem with a controlled tag library, and propose two variants of an algorithm for recommending tags. Our experiments on four data sets of environmental science metadata records not only show great promises on the performance of our method, but also shed light on the different natures of the data sets.
The User-Centered Development and Testing of a Dublin Core Metadata Tool
Catherine Hall (The iSchool, Drexel university, United States)Michael Khoo (The iSchool, Drexel University, United States)
Digital libraries are supported by good quality metadata, and thus by the use of good quality metadata tools. The design of metadata tools can be supported by following user-centered design processes. In this paper we discuss the application and evaluation of several cognitively-based rules, derived from the work of Donald Norman, to the design of a metadata tool for administering Dublin Core metadata. One overall finding was that while the use of the rules supported users in their immediate interactions with the tool interface, they provided less support for the more cognitively intensive tasks associated with developing a wider conceptual understanding of the purpose of metadata. The findings have implications for the wider development of tools to support metadata work in digital libraries and allied contexts.
Identification of Works of Manga Using LOD Resources - An Experimental FRBRization of Bibliographic Data of Comic Books -
Wenling He (Graduate School of Library, Information and Media Studies, University of Tsukuba, Japan)Tetsuya Mihara (Graduate School of Library, Information and Media Studies, University of Tsukuba, Japan)Mitsuharu Nagamori (Graduate School of Library, Information and Media Studies, University of Tsukuba, Japan)Shigeo Sugimoto (Graduate School of Library, Information and Media Studies, University of Tsukuba, Japan)
Abstract Manga – a Japanese term meaning graphic novels and comics – has been globally accepted. In Japan, there are a huge number of monographs and/or magazines of manga. Functional Requirements of Bibliographic Records (FRBR) provides useful concepts for readers to identify entities of manga, e.g., Work of manga, Manifestation of manga, and so on. This paper shows a study to identify works of manga in a set of bibliographic records maintained by Kyoto International Manga Museum to help readers find a comic book as an instantiation of a work. It is known that the authority data is useful to identify works from the bibliographic records. However, the authority data of manga are not rich, because manga has been recognized as a sub-culture resource and not included in library collections. In this study, we used DBpedia, which is a large Linked Open Data (LOD) resource created from Wikipedia, to identify FRBR entities of manga in the bibliographic records. The result of experiment shows that using LOD resources is reasonable to find works from bibliographic records, but it also shows the accuracy and efficiency depend on the quality of the LOD resources used.


1:30 PM Session #10 Web Replication
Session Chair: Rob Sanderson

Reading the Correct History? Modeling Temporal Intention in Resource Sharing
Hany Salaheldeen (Old Dominion University, United States)Michael Nelson (Old Dominion University, United States)
Abstract The web is trapped in the “perpetual now”, and when users traverse from page to page, they are seeing the state of the web resource (i.e., the page) as it exists at the time of the click and not necessarily at the time when the link was made. Thus, a temporal discrepancy can arise between the resource at the time the page author created a link to it and the time when a reader follows the link. This is especially important in the context of social media: the ease of sharing links in a tweet or Facebook post allows many people to author web content, but the space constraints combined with poor awareness by authors often prevents sufficient context from being generated to determine the intent of the post. If the links are clicked as soon as they are shared, the temporal distance between sharing and clicking is so small that there is little to no difference in content. However, not all clicks occur immediately, and a delay of days or even hours can result in reading something other than what the author intended. We introduce the concept of a user’s temporal intention upon publishing a link in social media. We investigate the features that could be extracted from the post, the linked resource, and the patterns of social dissemination to model this user intention. Finally, we analyze the historical integrity of the shared resources in social media across time. In other words, how much is the knowledge of the author’s intent is beneficial in maintaining the consistency of the story being told through social posts and in enriching the archived content coverage and depth of vulnerable resources.
An Evaluation of Caching Policies for Memento TimeMaps
Justin F. Brunelle (Old Dominion University, United States)Michael L. Nelson (Old Dominion University, United States)
Abstract As defined by the Memento Framework, TimeMaps are machine- readable lists of time-specific copies – called “mementos” – of an archived original resource. In theory, as an archive acquires additional mementos over time, a TimeMap should be monotonically increasing. However, there are reasons why the number of mementos in a TimeMap would decrease, for example: archival redaction of some or all of the mementos, archival restructuring, and transient errors on the part of one or more archives. We study TimeMaps for 4,000 original resources over a three month period, note their change patterns, and develop a caching algorithm for TimeMaps suitable for a reverse proxy in front of a Memento aggregator. We show that TimeMap cardinality is constant or monotonically increasing for 80.2% of all TimeMap changes observed in the observation period. The goal of the caching algorithm is to exploit the ideally monotonically increasing nature of TimeMaps and not cache responses with fewer mementos than the already cached TimeMap. This new caching algorithm uses conditional cache replacement and a Time To Live (TTL) value to ensure the user has access to the most complete TimeMap available. Based on our empirical data, a TTL of 15 days will minimize the number of mementos missed by users, and minimize the load on archives contributing to TimeMaps.
Extending Sitemaps for ResourceSync
Martin Klein (Los Alamos National Laboratory, United States)Herbert Van de Sompel (Los Alamos National Laboratory, United States)
Abstract The documents used in the ResourceSync synchronization framework are based on the widely adopted document format defined by the Sitemap protocol. In order to address requirements of the framework, extensions to the Sitemap format were necessary. This short paper describes the concerns we had about introducing such extensions, the tests we did to evaluate their validity, and aspects of the framework to address them.
Multimodal Alignment of Scholarly Documents and Their Presentations
Bamdad Bahrani (, Singapore)Min-Yen Kan (, Singapore)
Abstract We present a multimodal system for aligning scholarly documents to corresponding presentations in a fine-grained manner (i.e., per presentation slide and per paper section). Our method improves upon a state-of-the-art baseline that employs only textual similarity. Based on an analysis of baseline errors, we propose a three-pronged alignment system that combines textual, image, and ordering information to establish alignment. Our results show a statistically significant improvement of 25%. We further analyze the results of our system to derive that dealing with visual elements which appear in documents is an important future work directions for this important area of scholarly communication.


3:30 PM Session #11 Data
Session Chair: Brad Hemminger

Visual-Interactive Querying for Multivariate Research Data Repositories Using Bag-of-Words
Maximilian Scherer (Technische Universität Darmstadt, Germany)Tatiana Von Landesberger (Technische Universität Darmstadt, Germany)Tobias Schreck (University of Konstanz, Germany)
Abstract Large amounts of multivariate data are collected in different areas of scientific research and industrial production. These data are collected, archived and made publicly available by research data repositories. In addition to textual, meta-data based access, content-based approaches are highly desirable to effectively retrieve, discover and analyze data sets of interest. Several such methods, e.g., that allow users to search for particular curve progressions, have been proposed. However, a major challenge when providing content-based access -- interactive feedback during query formulation -- has not received much attention yet. This is important because it substantially improves the user's search effectiveness. In this paper, we present a novel interactive feedback approach for content-based access to multivariate research data. Thereby we enable query modalities that were not available for multivariate data before. We provide instant search results and highlight query patterns in the result set. Real-time search suggestions are computed to give an overview of important patterns to look for in the data repository. We develop a bag-of-words index for multivariate data as the back-end of our approach. We apply our method to a large repository of multivariate data from the climate research domain. We describe a use-case for discovery of interesting weather phenomena using the newly developed visual-interactive query tools.
The Challenges of Digging Data: A Study of Context in Archaeological Data Reuse
Ixchel Faniel (OCLC Research, United States)Eric Kansa (University of California Berkeley, School of Information, United States)Sarah Whitcher Kansa (Alexandria Archive Institute, United States)Julianna Barrera-Gomez (OCLC Research, United States)Elizabeth Yakel (University of Michigan, School of Information, United States)
Abstract Field archaeology only recently developed centralized systems for data curation, management, and reuse. Data documentation guidelines, standards, and ontologies have yet to see wide adoption in this discipline. Moreover, repository practices have focused on supporting data collection, deposit, discovery, and access more than data reuse. In this paper we examine the needs of archaeological data reusers, particularly the context they need to understand, verify, and trust data others collect during field studies. We then apply our findings to the existing work on standards development. We find that archaeologists place the most importance on data collection procedures, but the reputation and scholarly affiliation of the archaeologists who conducted the original field studies, the wording and structure of the documentation created during field work, and the repository where the data are housed also inform reuse. While guidelines, standards, and ontologies address some aspects of the context data reusers need, they provide less guidance on others, especially those related to research design. We argue repositories need to address these missing dimensions of context to better support data reuse in archaeology.
Constructing an Anonymous Dataset From the Personal Digital Photo Libraries of Mac App Store Users
Jesse Prabawa Gozali (National University of Singapore, Singapore)Min-Yen Kan (National University of Singapore, Singapore)Hari Sundaram (Arizona State University, United States)
Abstract Personal digital photo libraries embody a large amount of information useful for research into photo organization, photo layout, and development of novel photo browser features. Even when anonymity can be ensured, amassing a sizable dataset from these libraries is still difficult due to the visibility and cost that would be required from such a study.We explore using the Mac App Store to reach more users to collect data from such personal digital photo libraries. More specifically, we compare and discuss how it differs from common data collection methods, e.g. Amazon Mechanical Turk, in terms of time, cost, quantity, and design of the data collection application.We have collected a large, openly available photo feature dataset using this manner. We illustrate the types of data that can be collected. In 60 days, we collected data from 20,778 photo sets (473,772 photos). Our study with the Mac App Store suggests that popular application distribution channels are viable means to acquire massive data collections for researchers.
Modeling Heterogeneous Data Resources for Social-Ecological Research: A Data-Centric Perspective
Miao Chen (Indiana University, United States)Umashanthi Pavalanathan (Indiana University, United States)Scott Jensen (Indiana University, United States)Beth Plale (Indiana University, United States)
Abstract Digital repositories are grappling with an influx of scientific data brought about by the well publicized “data deluge” in science, business, and society. One particularly perplexing problem is the long-term storage and access to complex data sets. This paper presents an integrated approach to data discovery over heterogeneous data resources in social-ecological systems research. Social-ecological systems data is complex because the research draws from both social and physical science. Using a sample set of data resources from the domain, we explore an approach to discovery and representation of this data. Specifically, we develop an ontology-based process of organization and visualization from a data-centric perspective. We define data resources broadly and identify six key categories of resources that include data collected from site visits to common pool resources, the structure of research instruments, domain concepts, research designs, publications, theories and models. We identify the underlying relationships and construct an ontology that captures these relationships using semantic web languages. The ontology and a NoSQL data store at the back end store the data resource instances. These are integrated into a portal architecture we refer to as the Integrated Visualization of Social-Ecological Resources (IViSER) that allows users to both browse the relationships captured in the ontology and easily visualize the granular details of data resources.


3:30 PM Session #12 Historical DLs
Session Chair: Edie Rasmussen

Non-Linear Book Manifolds: Learning from Associations the Dynamic Geometry of Digital Libraries
Richard Nock (CEREGMIA-UAG, France)Frank Nielsen (Sony CS Labs Tokyo, Japan)Eric Briys (CEREGMIA-UAG-Cyberlibris, Belgium)
Abstract Mainstream approaches in the design of virtual libraries basically exploit the same ambient spaceas their physical twins. Our paper is an attempt to rather capture automatically the actual space on which the books live, and \textit{learn} the virtual library as a non-linear book manifold.This tackles tantalizing questions, chief among which whether modeling should be static and book focused (\textit{e.g.}using bag of words encoding) or dynamic and user focused (\textit{e.g.} relying on whatwe define as a \textit{bag of readers} encoding). Experiments on a real-world digital librarydisplay that the latter encoding is a serious challenger to the former. Our results also show that the geometric layers of the manifold learned bring sizeable advantages for retrieval and visualization purposes. For example, the topological layer of the manifold allows to craft \textit{Manifold} association rules; experiments display that theybring dramatic improvements over conventional association rules built from the discrete topology of book sets.Improvements embrace \textit{each} of the following major standpoints on association rule mining: computational, support, confidence, lift, and leverage standpoint.
LSH-Based Large Scale Chinese Calligraphic Character Recognition
Yuan Lin (Zhejiang university, China)Jiangqin Wu (Zhejiang university, China)Pengcheng Gao (Zhejiang University, China)Yang Xia (Zhejiang University, China)Tianjiao Mao (Zhejiang University, China)
Abstract Chinese calligraphy is the art of handwriting and is an important part of Chinese traditional culture. But due to the complexity of shape and styles of calligraphic characters, it is difficult for common people to recognize them. So it would be great if a tool is provided to help users to recognize the unknown calligraphic characters. But the well-known OCR (Optical Character Recognition) technology can hardly help people to recognize the unknown characters because of their deformation and complexity. Numerous collections of historical Chinese calligraphic works are digitized and stored in CADAL (China Academic Digital Associate Library) calligraphic system [1], and a huge database CCD (Calligraphic Character Dictionary) is built, which contains character images labeled with semantic meaning. In this paper, a LSH-based large scale Chinese calligraphic character recognition method is proposed basing on CCD. In our method, GIST descriptor is used to represent the global features of the calligraphic character images, LSH (Locality-sensitive hashing) is used to search CCD to find the similar character images to the recognized calligraphic character image. The recognition is based on the semantic probability which is computed according to the ranks of retrieved images and their distances to the recognized image in the Gist feature space. Our experiments show that our method is effective and efficient for recognizing Chinese calligraphic character image.
Automatic Performance Evaluation of Dewarping Methods in Large Scale Digitization of Historical Documents
Maryam Rahnemoonfar (, United States)
Abstract Geometric distortions are among the major challenging issues in analysis of historical document images. Such distortions appear as arbitrary warping, folds and page curl, and have detrimental effects to recognition (OCR) and readability by human. While there are many dewarping techniques in the literature, their performances cannot be evaluated against each other in a standard way. In particular, there is no satisfactory method capable of comparing the results of existing dewarping techniques on arbitrary wrapped documents. The existing methods either rely on the visual comparison of the output and input images or depend on recognition rate of an OCR system. In the case of historical documents, OCR either is not available or does not generate acceptable result. In this paper an objective and automatic evaluation methodology for document image dewarping technique is presented. At the first step all the baselines in the original distorted image as well as dewarped image are modelled precisely and automatically. Then based on the mathematical function of each line, a comprehensive metric which calculates the performance of a dewarping technique is introduced. The presented method does not require user interference in any stage of evaluation and therefore is quite objective. Experimental results, applied to two state-of-the art dewarping methods and an industry-standard commercial system, demonstrate the effectiveness of the proposed dewarping evaluation method.
Semiautomatic Recognition and Georeferencing of Places in Early Maps
Winfried Höhn (Universität Würzburg, Germany)Hans-Günter Schmidt (University of Würzburg, Germany)Hendrik Schöneberg (University of Würzburg, Germany)
Abstract Early maps are a valuable resource for historical research, this is why digital libraries for early maps become a necessary tool for research support in the age of information. In this article we introduce the Referencing and Annotation Tool (RAT), designed to extract information about all places displayed in a map and link them to a place on a modern map. RAT automatically recognizes place markers in an early map according to a template specified by the user and estimates the position of the annotated place in the modern map, thus making georeferencing easier. After a brief summary on related projects, we describe the functionality of the system. We discuss the most important implementation details and factors influencing recognition accuracy and performance. The advantages of our semiautomatic approach are high accuracy and a significant decrease of the user's cognitive load.



09:00 AM Session #13 Preservation II
Session Chair: Tim Cole

Access Patterns for Robots and Humans in Web Archives
Yasmin Alnoamany (Old Dominion University, United States)Michele C. Weigle (Old Dominion University, United States)Michael L. Nelson (Old Dominion University, United States)
Abstract Although user patterns in the live web are well-understood, there has been no corresponding study of how users, both humans and robots, access web archives. Based on samples from the Internet Archive's public Wayback Machine, we propose a set of basic usage patterns: Dip (a single access), Slide (the same page at different archive times), Dive (different pages at approximately the same archive time), and Skim (accessing only lists of what pages are archived). Robots are limited almost exclusively to Dips and Skims, but human accesses are more varied between all four types. Robots outnumber humans 10:1 in terms of sessions, 5:4 in terms of raw HTTP accesses, and 4:1 in terms of megabytes transferred. Robots almost always access TimeMaps (95\% of accesses), but humans predominately access the archived web pages themselves (82\% of accesses). In terms of unique archived web pages, there is no overall preference for time, but the recent past (within the last year) shows significant repeat accesses.
Free Benchmark Corpora for Preservation Experiments: Using Model-Driven Engineering to Generate Data Sets
Christoph Becker (Vienna University of Technology, Austria)Kresimir Duretec (Vienna University of Technology, Austria)
Abstract Digital preservation is an active area of research, and recent years have brought forward an increasing number of characterisation tools for the object-level analysis of digital content. However, there is a profound lack of objective, standardised and comparable metrics and benchmark collections to enable experimentation and validation of these tools. While fields such as Information Retrieval have for decades been able to rely on benchmark collections annotated with ground truth to enable systematic improvement of algorithms and systems along objective metrics, the digital preservation field is yet unable to provide the necessary ground truth for such benchmarks. Objective indicators, however, are the key enabler for quantitative experimentation and innovation. This paper presents a systematic model-driven benchmark generation framework that aims to provide realistic approximations of real-world digital information collections with fully known ground truth that enables systematic quantitative experimentation and measurement and improvement against objective indicators. We describe the key motivation and idea behind the framework, outline the technological building blocks, and discuss results of the generation of page-based and hierarchical documents from a ground truth model. Based on a discussion of the benefits and challenges of the approach, we outline future work.
A scalable, distributed and dynamic workflow system for digitization processes
Hendrik Schöneberg (University of Würzburg, Germany)Hans-Günter Schmidt (University of Würzburg, Germany)Winfried Höhn (University of Würzburg, Germany)
Abstract Creating digital representations of ancient manuscripts, printsand maps is a challenging task due to the sources’ fragile andheterogeneous natures. Digitization requires a very special-ized set of scanning hardware in order to cover the sources’diversity. The central task is obtaining the maximum re-production quality while minimizing the error rate, which isdifficult to achieve due to the large amounts of image data re-sulting from digitization, putting huge computational loadson image processing modules, error-detection and informa-tion retrieval heuristics.As digital copies initially do not contain any informationabout their sources’ semantics, additional efforts have to bemade to extract semantic metadata. This is an error-prone,time-consuming manual process, which calls for automatedmechanisms to support the user.This paper introduces a decentralized, event-driven work-flow system designed to overcome the above mentioned chal-lenges. It leverages dynamic routing between workflow com-ponents, thus being able to quickly adapt to the sources’unique requirements. It provides a scalable approach tosoften out high computational loads on single units by usingdistributed computing and provides modules for automatedimage pre- / post-processing, error-detection heuristics, datamining, semantic analysis, metadata augmentation, qualityassurance and an export functionality to established pub-lishing platforms or long-term storage facilites.
Domain-specific Image Geocoding: A Case Study on Virginia Tech Building Photos
Lin Tzy Li (Institute of Computing - UNICAMP & CPqD, Brazil)Otávio A. B. Penatti (Unicamp, Brazil)Edward A. Fox (Virginia Polytechnic Institute and State University, United States)Ricardo Da Silva Torres (Institute of Computing, University of Campinas, Brazil)
Abstract The use of map-based browser services is of great relevance in several digital libraries. The implementation of such services, however, demands the use of geocoded data collections. This paper investigates the use of several image content local representations in geocoding tasks. Performed experiments demonstrate that some of the evaluated descriptors yield effective results in the task of geocoding building photos related to the Virginia Tech April 16, 2007 school shooting tragedy.