Skip to Main Content Header and Footer Templates

Text and Data Mining

A guide to resources available for text and data mining

What is Text and Data Mining?

Text and data mining (TDM) uses computational methods to extract and analyze large quantities of text files or data sets to quickly identify patterns and relationships. Text analytics methods include information retrieval, named entity recognition, part of speech tagging, sentiment analysis, network analysis, and topic modelling.

Planning Your Project

Finding Corpora

You can find and collect text for analysis from a variety of resources including library content we subscribe to, open access content, social media, and online web resources. Your librarian can help you find corpora suitable for analysis.

Some textual resources are born-digital (e.g., Wikipedia, social media); other works are digitized and converted to text by OCR (optical character recognition). The accuracy of OCR varies by resource and your corpora may require significant OCR cleaning depending on your research needs. 

TDM Use and Copyright

Many textual resources are still under copyright, complicating full-text access. Simply because the library has subscribed to the journal or database does not necessarily mean that we have TDM rights to that same content. Please refer to the Resources that allow TDM tab or contact your subject librarian to understand whether you have TDM rights to the library resource you are interested. In some cases, we can negotiate additional access for you (often at additional cost that may be borne by the researcher). Please allow sufficient lead time to negotiate additional access.

Responsible Usage

Access to the electronic resources listed on the Libraries’ website and catalog is restricted to current students, staff, and faculty for the purposes of research, teaching, and private study. The terms of every single electronic resource (databases, ebooks, etc.) the library provides apply to every enrolled student, faculty, and staff member at Washington University, as each are considered “authorized users” allowed to use these products.

Copyright Law (including the protections of fair use) and contractual license agreements govern the access, use, and reproduction of electronic resources. License terms can supersede that which the law allows. Each licensed product may have more specific or additional permissions or prohibitions. All individual patrons must comply with the specific Terms of Use and/or License Agreement for the applicable electronic resource. For more info, see the  Responsible Electronic Resource Use Guide.

Budgeting 

We are happy to consult with researchers on projects using existing content, however, we may be unable to provide licensing or funding for individual text-mining projects for needs not covered by university-wide licenses. The library is unable to pay for project-by-project fees but will attempt to negotiate with the vendor for a more institutional solution. We highly encourage scholars to consider research or grant funding in these cases.

Budgeting 

We are happy to consult with researchers on projects using existing content, however, we may be unable to provide licensing or funding for individual text-mining projects for needs not covered by university-wide licenses. The library is unable to pay for project-by-project fees but will attempt to negotiate with the vendor for a more institutional solution. We highly encourage scholars to consider research or grant funding in these cases.

Responsible Usage

Access to the electronic resources listed on the Libraries’ website and catalog is restricted to current students, staff, and faculty for the purposes of research, teaching, and private study. The terms of every single electronic resource (databases, ebooks, etc.) the library provides apply to every enrolled student, faculty, and staff member at Washington University, as each are considered “authorized users” allowed to use these products.

Copyright Law (including the protections of fair use) and contractual license agreements govern the access, use, and reproduction of electronic resources. License terms can supersede that which the law allows. Each licensed product may have more specific or additional permissions or prohibitions. All individual patrons must comply with the specific Terms of Use and/or License Agreement for the applicable electronic resource.