Using SQL to Build a Cohort to evaluate Diabetes & ED Visits

Updated May 29, 2025

Project: Cohort Builder – Diabetes + ED Visits

This SQL script defines a real-world-style cohort of patients with:

  • A first diagnosis of diabetes (excluding prediabetes)
  • Diagnosed before age 40
  • Who had an emergency department (ED) visit within 6 months of diagnosis

Technologies and Skills Used

  • PostgreSQL
  • Common Table Expressions (CTEs)
  • Window functions
  • Clinical reasoning applied to synthetic EHR data
  • Relational Databases

Data Source

The synthetic data comes from SyntheaTM, an open-source patient population simulation made available by The MITRE Corporation which can be found at this link. I used the 1k patient sample.

Final Cohort

patient diabetes_start_date age_at_diagnosis encounter_start
14dc5e57-1b84-3305-c042-86c9fc7e4996 12/29/2012 29 2013-02-09T04:21:38Z

Although the synthetic data only returns one patient meeting these criteria, the logic is easily extendable to other conditions or encounter types. The sample size is only 1000 patients and tweaking the criteria for age and time between diabetes diagnosis and ED visit did not increase the number of available patients.

Code Files

My GitHub repository containing the code can be found here.

References

Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, Scott McLachlan, Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record, Journal of the American Medical Informatics Association, Volume 25, Issue 3, March 2018, Pages 230–238, https://doi.org/10.1093/jamia/ocx079