Our experience, coupled with our deep digital engineering capabilities, made us a vendor of choice as the client set out to automate data loading and eliminate operational bottlenecks.
With the help of Healthfirst’s ODS platform (built on Amazon S3) and Amazon Aurora PostgreSQL, we ingested data into an Amazon Redshift data mart using PySpark ETL. Some of the key aspects of the solution delivered are:
- Rearchitected the data integration layer with a parameterized, configurable open-source data transformation framework that manages end-to-end functionality
- Developed the framework using Sqoop, PySpark, AWS Glue, AWS Glue catalog, and Amazon S3
- Automated error detection process leading to faster root cause identification
- Ongoing IT support for incident management with resolution SLAs
- Content-Defined Chunking (CDC) identification based on Sha2-256 hashing technique and ability to handle type1, type2, and hybrid Slowly Changing Dimension (SCD) types.
- Audit component built with attributes captured to the maximum details
- Restoration from any failure point
- Logging enabled by default for every process
AWS Services Used:
- Amazon Redshift
- Amazon Aurora
- Amazon S3
- Amazon EMR
- AWS Glue