FAIR AI for Evidence Synthesis
Abstract
FAIR AI for Evidence Synthesis: A Human-in-the-Loop Workflow for Generating Interoperable Enterprise Knowledge
Background: Large enterprises in the biopharmaceutical sector face a significant challenge in leveraging the vast amount of unstructured data within scientific publications. Integrating this evidence into internal R&D workflows is often hampered by a lack of data standardization, hindering downstream analytics and the creation of enterprise-wide knowledge graphs. The FAIR principles (Findable, Accessible, Interoperable, Reusable) provide a crucial framework for data stewardship, but applying them at scale to unstructured literary sources remains a major hurdle.
Methods: We present a novel workflow that operationalizes FAIR principles for evidence synthesis by integrating a Large Language Model (LLM)-powered platform for literature reviews, KiaKia, with enterprise data governance systems. The core of this workflow is the use of controlled vocabularies and ontologies managed by a central Data Stewardship system. This "single source of truth" is synchronized with KiaKia to guide the data extraction process. Our LLM-powered extraction engine is "grounded" to these enterprise dictionaries, ensuring that extracted data points (e.g., "progression-free survival") strictly adhere to the organization's approved terminology, thereby guaranteeing semantic interoperability. Furthermore, the system includes a robust "human-in-the-loop" governance model. When users encounter concepts not present in the dictionary, they can submit suggestions, which are adjudicated by review leads within a dedicated data cleaning module. Accepted terms are then formally reviewed by Data Stewards for inclusion in the master ontology, creating a closed-loop system that enriches the enterprise vocabulary while ensuring all downstream data remains 100% compliant.
Results: This workflow transforms unstructured text into fully compliant, analysis-ready data. At Roche, data processed by KiaKia is seamlessly ingested by downstream systems, populating Snowflake data warehouses. This interoperability is a direct result of enforcing a common language at the point of extraction. The resulting structured data powers a suite of business analytics dashboards in platforms like Tableau, enabling researchers and decision-makers to derive insights from a harmonized, high-quality evidence base.
Conclusion: By tightly coupling the power of LLMs with rigorous, human-led data governance and existing enterprise ontologies, our workflow provides a scalable and reliable solution for creating FAIR data assets from scientific literature. This methodology not only accelerates evidence synthesis but also builds a lasting, interoperable foundation for advanced analytics and the construction of comprehensive internal knowledge graphs, turning unstructured information into a strategic enterprise asset.
