Engineering at Work: Verana Health’s New Self-service Platform for Data Labeling to Help Improve Efficiency and Data Quality for Clinical Research

Author:

Karim Abdelkader, Machine Learning Engineer Yana Nikitina, Verana Health Vice President of Engineering

On this National Engineering Week (Feb. 19-25), we’re calling attention to the important role that Verana Health engineers play to help elevate the quality of curated real-world data (RWD) to power research. One innovation our engineers have shepherded is a self-service platform—for our data scientists, clinicians, biostatisticians, and more—to assist with data labeling and validation for both structured and unstructured data to help improve operational efficiencies and enhance data quality. This self-service platform is now being used by Veranans to create labeling jobs for classifying data (i.e., text, medical images and handwritten clinician notes) into categories. Clinically labeled data is essential to train our supervised machine learning (ML) and natural language processing (NLP) models to detect patterns or structures in RWD, so that when new data is ingested, our machine learning models are able to make informed clinical predictions on how to classify the new data. 

A common scenario that would occur before the creation of the self-service platform is that a Verana Health scientist, who doesn’t have the required access or the technical skills to create the labeling jobs, would ask a member of our Engineering team to create the desired job or service. This can be a complicated process that involves multiple manual, time-consuming steps – all for a single request. 

Self-service to the rescue

Faced with numerous requests to create labeling jobs, it was apparent that a solution was needed to reduce inefficiencies and avoid unnecessary utilization of support services. In response, our Engineering team built an automated, self-service platform that allows Veranans to submit labeling jobs by creating a Jira ticket (Jira is a software tool used for project management and issue tracking).

The Jira ticket contains custom fields that translate to configuration options for the labeling job, such as the type of data being labeled, if it is a binary or a multilabel classification job, and the labels used to categorize the data. Once these fields have been filled and the ticket created, a Jira Automation rule kicks-off which sends the values of these custom fields to GitHub, the software development platform. 

When our GitHub repository receives the ticket values, the first of the GitHub Actions workflows kicks in to convert the values into a formatted configuration file that is easily readable by our system. Upon completion of that process, the next GitHub Action workflow runs validation tests on the configuration to ensure job health and runs data quality checks to ensure there are no nulls, or missing or duplicate values in the data. If any of the tests fail, an error message is sent to the Jira ticket with the failure reason and instructions for possible fixes of the issue. Once the ticket is updated with the correct values, another Jira automation rule kicks in to send the new values to GitHub, where a GitHub Actions workflow will update the configuration file and rerun the validation tests.

While Jira Automations and GitHub Actions orchestrate and connect the platform, other tools and languages behind the scenes are utilized. Python is the main language used to run the validation scripts, while Terraform is used to build and manage the associated cloud infrastructure. Additionally, as our platform works through these steps, it not only provides constant updates to the Jira ticket but through Slack too, another internal communication tool. 

Once all tests are passed, the labeling job is automatically launched in AWS (Amazon Web Services) SageMaker GroundTruth and the clinical team is able to start labeling the data in a secure and managed environment.  

Automation benefits 

By helping to automate and democratize the process for internal stakeholders to launch labeling jobs themselves, Verana Health can dramatically accelerate its development cycles. With the previous system, it took up to a day, on average, to manually create each labeling job. Using the new automated self-service platform, requests can be completed in five minutes, with no manual steps needed.

Speed isn’t the only tangible benefit of our self-service platform. The exponentially faster processing time allows the platform to scale as necessary, eliminating slowdowns and backlogs. This allows an unlimited number of labeling jobs to be processed without delay.

Not surprisingly, automation and self-service are boosting efficiencies across Verana Health’s quantitative sciences (QS), clinical and machine learning/artificial intelligence (MLAI) teams. Our platform significantly reduces the delays in waiting for manual processes, as well as back-and-forth communication. 

The quality of labeling jobs is enhanced by our new self-service platform because the automated processes can detect duplicates and missing values that humans may not detect, thus reducing unnecessary data from the pipeline that might negatively impact teams downstream. 

Another benefit of our self-service platform is transparency. The entire process is visible for QS/clinical teams, and test results are immediately shared with members; when a test fails, they are also informed of steps needed to debug and fix the issue.

Finally, the platform provides the kind of democratization of data that fuels collaboration and clinical breakthroughs by allowing an accessible method for QS / Clinical teams to directly create labeling jobs regardless of technical background.

Conclusion

Automated self-service for the creation of data labeling jobs significantly improves internal efficiencies at Verana Health and positively impacts projects, deliverables and teams that rely on data labeling. Our platform accelerates the labeling process and enables it to scale while helping to prevent errors that could reduce data quality in the modules we build. The improvements in speed, scalability, efficiency, data quality, transparency and collaboration will ultimately help benefit Verana Health customers in the form of cleaner RWD and healthcare consumers through improved treatments.

Verana Health Logo

Let's Accelerate Research Together

To learn more about Verana Health, please fill out the information below and our team will follow up with you as soon as possible.