Data Analytics in Digital Library Research

The Digital Library Research Laboratory, directed by Dr. Edward Fox, professor in the Department of Computer Science, integrates the best of information retrieval — multimedia, hypermedia, visualization — with the best and most humanistic aspects of living libraries to further the dissemination of understanding.

fox_hadoop_cluster

The Digital Library Research Laboratory’s 20-plus-node Hadoop cluster in Torgersen Hall room 2070 is used for collecting, processing, and accessing tweets and web pages for research. The 10-Gbps connection to the Virginia Tech Research Network (VT-Rnet) has helped external collaborators, such as the Internet Archive, to gain faster access to the data.

Grants

The faster connection benefited the following grants which existed before the VT-Rnet project began:

  1. University of North Texas (NSF flow through supplement request): CREST Partnership Supplement: Building Capacity in Information Management through a Partnership with Virginia Tech’s Digital Library Technology Center, Fox is PI at Virginia Tech (with main grant to UTEP), 9/1/2015-5/31/2017.
  2. NIH Grant 1R01DA039456-01: The Social Interactome of Recovery: Social Media as Therapy Development; $1,674,440 for 9/15/2014-8/31/2018, PI Warren K. Bickel (VTCRI), Fox as co-PI.
  3. NSF IIS Award 1319578: III: Small: Integrated Digital Event Archiving and Library (IDEAL), $500,000 for 9/1/2013-8/31/2017. Fox is PI with co-PIs Donald Shoemaker, Andrea Kavanaugh, Steven Sheetz, and Kristine Hanna (Internet Archive, replaced by Jefferson Bailey).

The following grants were obtained in part due to the high-speed connection:

  1. NSF CMMI-1638207, CRISP Type 2/Collaborative Research: Coordinated, Behaviorally-Aware Recovery for Transportation and Power Disruptions (CBAR-tpd), $876,913 for 1/1/2017-12/31/2020, PI Pamela Murray-Tuite, Co-PIs Edward Fox, Kris Wernstedt; in collaboration with grant 1638197 for $249,921 to U. Mich. Ann Arbor, PI Seth Guikema
  2. NSF IIS Award 1619028, III: Small: Collaborative Research: Global Event and Trend Archive Research (GETAR), 1/1/2017-12/31/2019, Fox is PI with co-PIs Chandan Reddy, Andrea L. Kavanaugh, and Donald J. Shoemaker (in collaboration with Award 1619371 to Internet Archive, PI Jefferson Bailey).
  3. IMLS LG Grant 71-16-003716: Developing Library Cyberinfrastructure Strategy for Big Data Sharing and Reuse, 6/1/2016-5/31/2018, Zhiwu Xie is PI with co-PIs Tyler Walters, Edward Fox, and Pablo Tarazaga.

Theses and Dissertations

The following students completing their theses and dissertations which are based, in part, on work done using the high-speed connection to the research network:

  1. S.M.Shamimul Hasan, “A Semantic Web-Based Digital Library Infrastructure to Facilitate Computational Epidemiology”, final defense Aug. 10, 2017, Ph.D. dissertation
  2. Yinlin Chen, “A High-quality Digital Library Supporting Computing Education: The Ensemble Approach”, final defense July 27, 2017, Ph.D. dissertation http://hdl.handle.net/10919/78750.
  3. Shivam Maharshi, “Performance Measurement and Analysis of Transactional Web Archiving”, May 2017, MS thesis, Computer Science, http://hdl.handle.net/10919/78371
  4. Matthew Bock, “A Framework for Hadoop Based Digital Libraries of Tweets”, May 2017, MS thesis, Computer Science, http://hdl.handle.net/10919/78351
  5. Saurabh Chakravarty, “A Large Collection Learning Optimizer Framework”, June 2017, MS thesis, Computer Science, http://hdl.handle.net/10919/78302
  6. Saket Dilip Vishwasrao, “Performance Evaluation of Web Archiving Through In-Memory Page Cache”, June 2017, MS thesis, Computer Engineering, http://hdl.handle.net/10919/78252
  7. Sunshin Lee, “Geo-Locating Tweets with Latent Location Information”, Ph.D. dissertation, defended December 2016, published 2017-02-13 in VTechWorks: http://hdl.handle.net/10919/75022
  8. Mohamed Magdy Gharib Farag, “Intelligent Event Focused Crawling”, Ph.D. dissertation, 23 September 2016, http://hdl.handle.net/10919/73035

Classes

The VT-Rnet connection has also benefited the following classes:

  1. CS5604 (Information Retrieval) was taught in spring and fall of 2016 and fall of 2017. The class was built around students working with the Hadoop cluster and VT-Rnet connection. Student team reports can be found at https://vtechworks.lib.vt.edu/handle/10919/19081.
  2. CS4624 (Multimedia, Hypertext, and Information Access) was offered in spring 2016. Several student team projects directly used our Hadoop cluster and VT-Rnet, and others learned of the connection through presentations by those teams. Student team reports can be found at https://vtechworks.lib.vt.edu/handle/10919/18655.
  3. CS6604 (Digital Libraries) was offered in spring 2017. The Hadoop cluster and VT-Rnnection connection was used in several of the projects. Student team reports can be found at https://vtechworks.lib.vt.edu/handle/10919/47780.

Awards

The high-speed connection was used in work which contributed to the following awards:

  1. IEEE Fellow: Professor Fox cited for leadership in digital libraries and information retrieval – starting Jan. 1, 2017.
  2. XCaliber Award 2016 “for extraordinary contributions to technology-enriched learning activities” for the project entitled “Enhanced problem-based learning connecting big data research with classes”, with students: Mohamed Farag, Richard Gruss, Tarek Kanan, Sunshin Lee, Xuan Zhang.

Publications

The following publications contain work accomplished using the high-speed connection:

Journal and Magazine Articles

  1. Edward A. Fox, Zhiwu Xie, Martin J. Klein. Web Archiving and Digital Libraries (WADL) 2016: Highlights and Introduction to this Special Issue. Bulletin of IEEE Technical Committee on Digital
    Libraries, 13(1), April 2017, 3 pages, http://www.ieee-tcdl.org/Bulletin/v13n1/papers/intro.pdf.
  2. Yinlin Chen, Zhiwu Xie, and Edward A. Fox. A Library to Manage Web Archive Files in Cloud Storage. Bulletin of IEEE Technical Committee on Digital Libraries, 13(1), April 2017, 1 page, http://www.ieee-tcdl.org/Bulletin/v13n1/papers/chen.pdf.
  3. Mohamed Farag and Edward A. Fox. Which webpage should we crawl first? Social media-based webpage source importance guidance. Bulletin of IEEE Technical Committee on Digital Libraries, 13(1), April 2017, 1 page, http://www.ieee-tcdl.org/Bulletin//v13n1/papers/farag.pdf.
  4. Sunshin Lee and Edward A. Fox. Archiving and Analyzing Tweets and Webpages with the DLRL Hadoop Cluster. Bulletin of IEEE Technical Committee on Digital Libraries, 13(1), April 2017, 1 page, http://www.ieee-tcdl.org/Bulletin//v13n1/papers/lee.pdf.
  5. Edward A. Fox, Martin Klein, and Zhiwu Xie. Guest Editors’ Introduction to the Special Issue on Web Archiving. International Journal on Digital Libraries, 18, 2017. DOI: 10.1007/s00799-016-0203-5.
  6. Mohamed Magdy Gharib Farag, Sunshin Lee, Edward A. Fox. Focused Crawling for Events. International Journal on Digital Libraries, 18:1-17, 2017. DOI: 10.1007/s00799-016-0207-1.
  7. Warren Kurt Bickel, Amanda Quisenberry, Prashant Chandrasekar, Mikhail Nikolaas Koffarnus, Edward A. Fox, Chris Franck. The social interactome of recovery: Network topology influences social
    media engagement. Drug and Alcohol Dependence vol. 171, page e20, 2017. DOI: 10.1016/j.drugalcdep.2016.08.070.
  8. Andrea L. Kavanaugh, Steven D. Sheetz, Rodrigo Sandoval-Almazan, John C. Tedesco, Edward A. Fox. Media use during conflicts: Information seeking and political efficacy during the 2012 Mexican elections. Government Information Quarterly, 33(3): 595-602, Feb. 2016, DOI: 10.1016/j.giq.2016.01.004.
  9. Tarek Kanan, Raed Kanaan, Omar Al-Dabbas, Ghassan Kanaan, Ali Al-Dahoud, Edward Fox. Extracting Named Entities Using Named Entity Recognizer for Arabic News Articles. International Journal of Advanced Studies in Computer Science and Engineering (IJASCSE) 5(11):78-84, 11/30/2016.
  10. Tarek Kanan and Edward A. Fox. Automated Arabic Text Classification with P-Stemmer, Machine Learning, and a Tailored News Article Taxonomy. Journal of the Association for Information Science and Technology (JASIST), 67(11): 2667-2683, Nov. 2016, published online 23 Dec. 2015, DOI: 10.1002/asi.23609.

Refereed Conference and Workshop Papers

  1. Yufeng Ma, Tingting Jiang, Chandani Shrestha, Edward A. Fox, Jian Wu, C. Lee Giles. Scenarios for Advanced Services in an ETD Digital Library. In proceedings of ETD2017, the 20th international symposium on electronic theses and dissertations, Washington, DC, August 7-9, 2017. http://fox.cs.vt.edu/talks/2017/20170807ETD2017etdseer.pptx.
  2. Eduardo P.S. Castro, Saurabh Chakravarty, Eric Williamson, Denilson Alves Pereira, and Edward A. Fox. Classifying Short Unstructured Data Using the Apache Spark Platform. In Proc. ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2017), Toronto, Canada, June 19-23, 2017, 10 pages.
  3. Edward A. Fox. Introduction to Digital Libraries. In Proc. ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2017), Toronto, Canada, June 19-23, 2017, 2 pages.
  4. Edward A. Fox, Zhiwu Xie, and Martin Klein. Web Archiving and Digital Libraries (WADL). In Proc. ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2017), Toronto, Canada, June 19-23, 2017, 2 pages.
  5. Saurabh Chakravarty, Eric Williamson and Edward Fox. Classification of Tweets using Augmented Training. In Proc. WADL 2017, a workshop held in conjunction with ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2017), Toronto, Canada, June 19-23, 2017.
  6. Prashant Chandrasekar, Islam Harb, Elsa Tai, Saurabh Chakravarty, Monika Akbar, Ann Gates, Chris Frank, Warren K. Bickel, and Edward Fox. A DL framework and case studies with linked open data. Paper and presentation at RUMOUR 2017, a workshop held in conjunction with ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2017), Toronto, Canada, June 19-23, 2017.
  7. Xuan Zhang, Zhilei Qiao, Lijie Tang, Weiguo Fan, Edward A. Fox, Gang Wang: Identifying Product Defects from User Complaints: A Probabilistic Defect Model. 22nd Americas Conference on Information Systems, AMCIS 2016, San Diego, CA, USA, August 11-14, 2016. Association for Information Systems 2016. http://aisel.aisnet.org/amcis2016/Decision/Presentations/14/
  8. Venkat Srinivasan and Edward Fox. Progress toward automated ETD cataloging. In Proc. 19th International Symposium on Electronic Theses and Dissertations, ETD 2016 “Data and Dissertations”, 11-13 July 2016, Lille (France), in Session 3.1 on 7/13 “ETD metadata and cataloging”, http://etd2016.sciencesconf.org/92334.
  9. Zhiwu Xie, Krati Nayyar, and Edward A. Fox. Nearline Web Archiving. Paper presented at WADL 2016: Third International Workshop on Web Archiving and Digital Libraries, June 22-23, 2016. In connection with ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2017), Rutgers Univ., Newark, NJ, http://fox.cs.vt.edu/wadl2016.html.
  10. Mohamed Farag and Edward A. Fox. Which webpage should we crawl first? Social media-based webpage source importance guidance. Paper presented at WADL 2016: Third International Workshop on Web Archiving and Digital Libraries, June 22-23, 2016. In connection with ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2017), Rutgers Univ., Newark, NJ, http://fox.cs.vt.edu/wadl2016.html.
  11. Edward A. Fox. Introduction to Digital Libraries. In Proc. ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2017), Rutgers Univ., Newark, NJ, June 19-23, 2016, 283-284, DOI: 10.1145/2910896.2925429.
  12. Edward A. Fox, Zhiwu Xie, and Martin Klein. WADL 2016: Third International Workshop on Web Archiving and Digital Libraries. In Proc. ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2017), Rutgers Univ., Newark, NJ, June 19-23, 2016, 293-294, DOI: 10.1145/2910896.2926735.
  13. Zhiwu Xie, Andrej Galad, Yinlin Chen, and Edward Fox. Are Repositories Impeding Big Data Reuse? In Proc. 11th International Conference on Open Repositories. Trinity College, Dublin, Ireland, 13-16 June 2016.
  14. Warren K. Bickel, Amanda J. Quisenberry, Prashant Chandrasekar, Edward A. Fox, Christopher T. Franck. The Social Interactome of Recovery: Network Topology Influences Social Media Engagement. Proc. 2016 CPDD (College on Problems of Drug Dependence) Annual Meeting, selected based on abstract for oral presentation June 15, 2016.
  15. Andrea Kavanaugh, Steven D. Sheetz, Hamida Skandrani, John C. Tedesco, Yue Sun, and Edward A. Fox. The Use and Impact of Social Media during the 2011 Tunisian Revolution, Proc. 17th International Digital Government Research Conference (dg.o 2016), Fudan University, China, June 8-10, 2016, Yushim Kim and Monica Liu (Eds.). ACM, New York, NY, USA, 20-30, 10.1145/2912160.2912175.
  16. Warren K. Bickel, Prashant Chandrasekar, Mikhail Koffarnus, Edward A. Fox, Christopher T. Franck, Sandesh Bhandari, Amanda J. Quisenberry. The Social Interactome of Recovery: Social Media Network Topology Influences Relapse. 39th Annual Scientific Meeting of the Research Society on Alcoholism, June 25-29, New Orleans, Louisiana, in Alcoholism: Clinical & Experimental Research, Vol. 40, Issue S1, p. 235A, 10.1111/acer.13084.

Refereed Posters

  1. Saket Vishwasrao, Zhiwu Xie and Edward Fox. Web Archiving Through In-Memory Page Cache. Poster accepted for WADL (Web Archiving and Digital Libraries) 2017, a workshop held in conjunction with JCDL 2017, Toronto, Canada, June 19-23, 2017.
  2. Yinlin Chen, Zhiwu Xie, and Edward A. Fox. A Library to Manage Web Archive Files in Cloud Storage. Poster presented at WADL 2016: Third International Workshop on Web Archiving and Digital Libraries, June 22-23, 2016. In connection with ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2017), Rutgers Univ., Newark, NJ, http://fox.cs.vt.edu/wadl2016.html.
  3. Sunshin Lee and Edward A. Fox. Archiving and Analyzing Tweets and Webpages with the DLRL Hadoop Cluster. Poster presented at WADL 2016: Third International Workshop on Web Archiving and Digital Libraries, June 22-23, 2016. In connection with ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2017), Rutgers Univ., Newark, NJ, http://fox.cs.vt.edu/wadl2016.html.
  4. Mohamed Farag, Pranav Nakate and Edward A. Fox. Big Data Processing of School Shooting Archives. Poster, in Proc. ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2017), Rutgers Univ., Newark, NJ, June 19-23, 2016, 271-272, 10.1145/2910896.2925466.

Tutorials

  1. Edward Fox. Introduction to Digital Libraries, JCDL 2017 (ACM/IEEE), full-day on June 19, 2017, Toronto, Ontario, Canada. http://fox.cs.vt.edu/talks/2017/20170613JCDL2017FoxTutorialSlides.pptx.
  2. Edward Fox. Introduction to Digital Libraries, JCDL 2016 (ACM/IEEE), full-day on June 19, 2016, Rutgers U., Newark, NJ. .