Avoid web scraping or downloading large amounts of content from databases to which the library subscribes. Instead — if the publisher allows text and data mining — they will provide an API or other means of access. This helps provide the data in a stable and secure manner that complies with copyright laws and helps the library comply with the subscription license terms. Failure to comply may lock you out of the content and may jeopardize other library users' access.
Resources available for text and data mining vary by publisher.
Individual publishers typically have restrictions on access and use to comply with copyright law and to prevent strain on their servers. Please read carefully any terms and conditions before proceeding. Most providers limit use to non-commercial, research purposes only.
Publisher | Content Available | Access Information |
---|---|---|
Adam Matthew | WashU Licensed Content | |
Clarivate Analytics | Web of Science | |
Elsevier | ScienceDirect | |
JSTOR | JSTOR and Portico content | Constellate text analytics service from JSTOR |
LexisNexis | LexisNexis Licensed News only (does not include company reports or legal materials) | Contact Data Services for API access |
Springer Nature | WashU Licensed Content and Open Access content | |
SAGE Journals | WashU Licensed Content | Sage Text and Data Mining policy |
Talylor & Francis | WashU Licensed Content |
Strongly recommend emailing support@tandfonline.com, with a brief description of your planned TDM activity. |
Wiley | WashU Licensed Content | Wiley Text and Data Mining Guide |
The following publishers do not permit text and data mining or require additional fees for access:
Publisher | Content Available | Access Information |
---|---|---|
arXiv | Scholarly articles (non peer-reviewed) in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. | arXiv API access |
Caselaw Access Project | All U.S. federal and state case law | CaseLaw access policy |
Congress.gov | Structured data on legislators, bills, bill summaries, amendments, committee reports, appointee nominations, international treaties, and more. Details on specific coverage dates. | Congress.gov API documentation |
CrossRef | Metadata records with CrossRef DOIs | CrossRef API documentation |
Digital Public Library of America | Metadata on items and collections | DPLA API Codex |
Europeana | Images, newspapers, books, and audio-visual material from cultural institutions across Europe | Europeana API |
HathiTrust | Metadata, page images, and OCR for 17+ million digitized items in HathiTrust Digital Library | HathiTrust Data Availability and APIs |
Internet Archive | Wayback Machine, Open Library (books), and Internet Archive metadata | Internet Archive Developers Portal |
Library of Congress | Chronicling America historical newspapers | Chronicling America API |
Library of Congress | LC for Robots provides machine-readable access to the Library of Congress' digital collections, including images, laws and regulations, and bibliographic information. | LC for Robots Home |
National Library of Medicine | Several text mining tools for accessing various NLM databases and biomedical literature. |
NLM Products and Services (filter by API) |
OECD | Datasets in the catalogue of OECD databases | OECD data for developers |
PLOS (Public Library of Science) | Access to article corpus and article metadata | |
Project Gutenberg | Over 60,000 books, usually out of copyright. No API available. Scraping available from mirror sites only. | Project Gutenberg Website Terms of Use |
PubMed Central | Open access full-text scholarly articles that have been published in biomedical and life sciences journals | |
Text Creation Partnership | Early English Books Online (EEBO), Eighteenth Century Collections Online (ECCO-TCP), and Evans Early American Imprints (Evans-TCP) | TCP Documentation |
WorldBank | Development data, World Bank operations and financial data, and climate data | World Bank APIs |