Measuring the Elephant in the Room: Quantifying Legal Barriers to AI Processing of Full‑Text in Systematic Reviews
Abstract
Why this matters
Generative AI can speed up full‑text screening and data extraction—two of the most labour‑intensive steps in HTA/HEOR evidence synthesis. The challenge is not only in technology but also in licensing: specifically, the copyright limitations on AI inference processing of full-text PDFs. We measured real‑world coverage and outline practical, compliant operating paths.
Objectives
1. Quantify how many recent review articles permit commercial text‑and‑datamining (TDM)/AI inference.
2. Assess agreement between open‑access (OA) metadata sources.
3. Test whether a rights‑vetted XML service improves coverage.
4. Provide clear operating paths and guardrails for compliant RAG‑style workflows.
Results
- License visibility: OpenAlex had a record for every DOI, but only 1,584/3,712(43%) exposed a license.
- Inconsistent metadata: 26 items labelled as closed access in OpenAlex paradoxically carried OA licenses.
- Disagreements between OpenAlex and PubMed Central: Among 440 papers with dual metadata, 39 (9%) showed true license conflicts; PMC was more permissive in 15 cases.
- Commercially usable via open access license: 618/3,712 (17%); 3,094 remained restricted or unknown.
