A catalyst for innovation, a leading bioscience research institute based in the Midwest, the Indiana Bioscience Research Institute (IBRI), is working toward its mission of becoming the first industry-inspired institute in the development of solutions that improve the health of those suffering from diabetes, metabolic disease, and poor nutrition. Amazon Web Services (AWS) and Virtusa were brought in to assist the IBRI and Fuse by Cardinal Health to provide the technical skills and infrastructure to assess and optimize a simulated electronic health records (EHRs) dataset from Fuse to match the characteristics of IBRI’s type 2 diabetes (T2D)-related EHR datasets.
The IBRI uses these EHR datasets with its research collaborators to drive insights into patient subgroups, disease characteristics, and complications of patients with T2D and to develop and validate T2D-related prediction models. An improved data analytics platform and a “realistic” simulated dataset that matched the complexities of a real EHR dataset as provided by Fuse has the potential to accelerate certain of the IBRI’s research activities.
The institute is leveraging EHR datasets provided under appropriate data use agreements to drive its research, but even though this data is anonymized, the data has specific use restrictions, as well as internal biases. The challenges are:
- EHR data is currently cleaned and processed on secure computers that are difficult to access by the scientists.
- Too much time was being spent on engineering, profiling, and testing datasets.
- Database architecture to support and access large amounts of data on a regular basis was limited.
- No alternate data sources (e.g. simulated EHR data) to drive certain appropriate research activities is available.
Virtusa’s vLife™ platform powered by AWS and the access and optimization of simulated data stored on this platform helped the IBRI test secure and collaborative analytics capabilities to accelerate research and innovation:
- Dashboards: Interactive dashboards deliver real-time, aggregated insights from profiled data, and they provide chart and self-service visualizations.
- Machine learning (ML) model build: Rapid access and transformation of a feature to enable machine learning and test model accuracies across datasets is available.
- Synthetic data lake: Massive amounts of data is stored and quickly accessed with the ability to support pre-built database views that can be instantly queried.
- Data loading and quality control: Data files placed in the data lake are loaded into the database automatically and quality control metrics are automatically compiled.
- Simulated data generator: Simulated data can be iteratively improved to match the characteristics of the real-world EHR datasets, and machine learning is used to test deep correlations (e.g. complications or clinical variables) within the “real” vs “simulated” data.
AWS Services Used
- Amazon EC2
- Amazon SageMaker
- Amazon S3
- Amazon Athena
- Amazon CloudFront
- Amazon RedShift
- Amazon RDS
- Amazon ELB
Researchers at the IBRI now have access to an AWS-based platform and a simulated diabetes-optimized EHR dataset that uses deep computer learning technologies to accelerate research related to the IBRI’s mission.
- AWS Cloud creates a single source of truth for all data, giving researchers access to real-time data to make better decisions faster.
- Machine learning automates the tedious work of data engineering, profiling, and testing, keeping researchers’ focus on discovering new and innovative solutions.
- The realistic simulated EHR data provides an alternate tool for research collaborations with numerous advantages detailed in the references listed below.
Article: The New Synthetic
Article: Unlock the Power of Simulated Data to Accelerate Research