Research Guides: Text and Data Mining: Resources that allow TDM

Library Resources

Resources available for text and data mining vary by publisher.

Individual publishers typically have restrictions on access and use to comply with copyright law and to prevent strain on their servers. Please read carefully any terms and conditions before proceeding. Most providers limit use to non-commercial, research purposes only.

Publisher	Content Available	Access Information
Adam Matthew	WashU Licensed Content	Adam Matthew TDM information and permission request
Clarivate Analytics	Web of Science	Available Web of Science APIs Clarivate Developers Portal
Elsevier	ScienceDirect	Elsevier text and data mining policy Elsevier developers portal
JSTOR	JSTOR and Portico content	Constellate text analytics service from JSTOR
LexisNexis	LexisNexis Licensed News only (does not include company reports or legal materials)	Contact Data Services for API access
Springer Nature	WashU Licensed Content and Open Access content	Springer Nature text and data mining policy Springer Nature API Portal
SAGE Journals	WashU Licensed Content	Sage Text and Data Mining policy
Talylor & Francis	WashU Licensed Content	Strongly recommend emailing support@tandfonline.com, with a brief description of your planned TDM activity. T&F text and data mining policy
Wiley	WashU Licensed Content	Wiley Text and Data Mining Guide

The following publishers do not permit text and data mining or require additional fees for access:

ProQuest (ProQuest TDM Studio available at additional cost)
Factiva
EBSCO

WashU Subject Librarians
If you do not see the resource you are looking for, please contact your subject librarian about obtaining access or where to find corpora for your research needs.
Linguistics Research Guide
If you are looking for linguistic corpora, consult the Linguistics Research Guide.

Free APIs and other content for TDM projects

Index of Open Access disciplinary repositories
In addition to the specific resources listed below, check out this list of Open Access disciplinary repositories if you are looking for scholarly publications.

Publisher	Content Available	Access Information
arXiv	Scholarly articles (non peer-reviewed) in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.	arXiv API access
Caselaw Access Project	All U.S. federal and state case law	CaseLaw access policy
Congress.gov	Structured data on legislators, bills, bill summaries, amendments, committee reports, appointee nominations, international treaties, and more. Details on specific coverage dates.	Congress.gov API documentation
CrossRef	Metadata records with CrossRef DOIs	CrossRef API documentation
Digital Public Library of America	Metadata on items and collections	DPLA API Codex
Europeana	Images, newspapers, books, and audio-visual material from cultural institutions across Europe	Europeana API
HathiTrust	Metadata, page images, and OCR for 17+ million digitized items in HathiTrust Digital Library	HathiTrust Data Availability and APIs
Internet Archive	Wayback Machine, Open Library (books), and Internet Archive metadata	Internet Archive Developers Portal
Library of Congress	Chronicling America historical newspapers	Chronicling America API
Library of Congress	LC for Robots provides machine-readable access to the Library of Congress' digital collections, including images, laws and regulations, and bibliographic information.	LC for Robots Home
National Library of Medicine	Several text mining tools for accessing various NLM databases and biomedical literature.	NLM APIs NLM Products and Services (filter by API)
OECD	Datasets in the catalogue of OECD databases	OECD data for developers
PLOS (Public Library of Science)	Access to article corpus and article metadata	PLOS Text and Data Mining home PLOS API
Project Gutenberg	Over 60,000 books, usually out of copyright. No API available. Scraping available from mirror sites only.	Project Gutenberg Website Terms of Use
PubMed Central	Open access full-text scholarly articles that have been published in biomedical and life sciences journals	PubMed Central text mining tools PubMed Central Developer Portal
Text Creation Partnership	Early English Books Online (EEBO), Eighteenth Century Collections Online (ECCO-TCP), and Evans Early American Imprints (Evans-TCP)	TCP Documentation
WorldBank	Development data, World Bank operations and financial data, and climate data	World Bank APIs