Skip to Main Content

Text and Data Mining

Library Resources

Resources available for text and data mining vary by publisher. If you do not see the resource you are looking for, please contact your subject librarian about obtaining access or where to find corpora for your research needs. 

Individual publishers typically have restrictions on access and use to comply with copyright law and to prevent strain on their servers. Please read carefully any terms and conditions before proceeding. Most providers limit use to non-commercial, research purposes only.

Publisher Content Available Access Information
Adam Matthew WashU Licensed Content

Adam Matthew Statement on Data Mining

Clarivate Analytics Web of Science

Available Web of Science APIs

Clarivate Developers Portal

Elsevier ScienceDirect

Elsevier text and data mining policy

Elsevier developers portal

JSTOR JSTOR and Portico content (full list) Constellate text analytics service from JSTOR
LexisNexis LexisNexis Licensed News only (does not include company reports or legal materials) Contact Data Services for API access
Springer Nature WashU Licensed Content and Open Access content

Springer Nature text and data mining policy

Springer Nature API Portal

SAGE Journals WashU Licensed Content Sage Text and Data Mining policy
Talylor & Francis WashU Licensed Content

Strongly recommend emailing support@tandfonline.com, with a brief descripton of your planned TDM activity. 

T&F text and data mining policy

Wiley WashU Licensed Content Wiley Text and Data Mining

The following publishers do not permit text and data mining or require additional fees for access:

Freely available content for TDM projects

In addition to the specific resources listed below, check out this list of Open Access disciplinary repositories if you are looking for scholarly literature.

If you are looking for linguistic corpora, consult the Linguistics Research Guide.

Publisher Content Available Access Information
arXiv Scholarly articles (non peer-reviewed) in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. arXiv API access
Caselaw Access Project All U.S. federal and state case law CaseLaw access policy
Congress.gov Structured data on legislators, bills, bill summaries, amendments, committee reports, appointee nominations, international treaties, and more. Details on specific coverage dates here. Congress.gov API documentation
CrossRef Metadata records with CrossRef DOIs CrossRef API documentation
Digital Public Library of America Metadata on items and collections DPLA API Codex
Europeana Images, newspapers, books, and audio-visual material from cultural institutions across Europe Europeana API
HathiTrust Metadata, page images, and OCR for 17+ million digitized items in HathiTrust Digital Library HathiTrust Data Availability and APIs
Internet Archive Wayback Machine, Open Library (books), and Internet Archive metadata Internet Archive Developers Portal
Library of Congress Chronicling America historical newspapers Chronicling America API
Library of Congress LC for Robots provides machine-readable access to the Library of Congress' digital collections, including images, laws and regulations, and bibliographic information. LC for Robots Home
National Library of Medicine Several text mining tools for accessing various NLM databases and biomedical literature.

NLM APIs

NLM Products and Services (filter by API)

OECD Datasets in the catalogue of OECD databases (full list) OECD data for developers
PLOS (Public Library of Science) Access to article corpus and article metadata

PLOS Text and Data Mining home

PLOS API

Project Gutenberg Over 60,000 books, usually out of copyright. No API available. Scraping available from mirror sites only. Project Gutenberg Website Terms of Use
PubMed Central Open access full-text scholarly articles that have been published in biomedical and life sciences journals

PubMed Central text mining tools

PubMed Central Developer Portal

Text Creation Partnership Early English Books Online (EEBO), Eighteenth Century Collections Online (ECCO-TCP), and Evans Early American Imprints (Evans-TCP) TCP Documentation
WorldBank Development data, World Bank operations and financial data, and climate data World Bank APIs