Measuring the Elephant in the Room: Quantifying Legal Barriers to AI Processing of Full‑Text in Systematic Reviews
Abstract
Why this matters
Generative AI can speed up full‑text screening and data extraction—two of themost labour‑intensive steps in HTA/HEOR evidence synthesis. The challenge isnot only in technology but also in licensing: specifically, the copyright limitations on AI inference processing of full-text PDFs. We measured real‑worldcoverage and outline practical, compliant operating paths.
Objectives
1. Quantify how many recent review articles permit commercial text‑and‑datamining (TDM)/AI inference.
2. Assess agreement between open‑access (OA) metadata sources.
3. Test whether a rights‑vetted XML service improves coverage.
4. Provide clear operating paths and guardrails for compliant RAG‑styleworkflows.
Results
- Licence visibility: OpenAlex had a record for every DOI, but only 1,584/3,712(43%) exposed a licence.
- Inconsistent metadata: 26 items labelled as closed access in OpenAlex paradoxically carried OA licenses.
- Disagreements between OpenAlex and PubMed Central: Among 440 papers with dual metadata, 39 (9%) showed true licence conflicts; PMC was morepermissive in 15 cases.
- Commercially usable via open access licence: 618/3,712 (17%); 3,094remained restricted or unknown.
