Visualizing data across your project – How we automated combining 7 different data sources ranging from APIs, databases, static CSVs, DHIS2, website images and email attachments(!) into a single dashboard
We’ve developed a strong partnership with Village Reach, a team that stewards a logistics management system (OpenLMIS) and works to solve health care delivery challenges in low resource communities. Their team in Mozambique recently reached out to us with a problem that we’re seeing more and more frequently among clients: the need to automatically combine data from multiple sources in the same dashboard. Usually we find the biggest problem our partners face usually isn’t around visualizing data; rather, it’s around how to pull together a bunch of real-time data from multiple sources so they can make sense of it. This means automating the process of ingesting the right data, in the right format, to generate a unified data view for use across all levels and verticals of a team. Prior to this project1, Village Reach staff in Mozambique had to manually collect raw data from two electronic systems – SELV2 and DHIS23 – four CSV data updates sent via email, and a series of JPEG images produced by a cold chain management system called NexLeaf. Their current approach was to manually compile and share via an Excel spreadsheet something that was very time consuming and limiting from a data analysis standpoint.
Since we have been collaborating closely with Village each to develop Canopy (it’s currently being used to support the OpenLMIS v3 reporting stack), Village Reach wanted to see if we could assist them in using Canopy to automatically pull these data sources together into a real-time dashboard solution. We had a tight timeline, two months from start to finish, to see if what we were developing with Canopy was up to the challenge.
The platform involves a few components that perform specialized functions:
- Canopy Flow is used to collect data from each of the 7 different sources and merge it together into a unified data view
- Canopy EDW stores the data and generates the indicator values
- And Canopy Discover visualizes those indicators in an easy-to-understand interface with self-service capabilities like modifying charts or building custom visualizations
We began with a two-pronged approach. Our data engineering team dug into the systems with APIs (DHIS2, SELV, and NexLeaf) to understand how to retrieve data from them and how they structure data. At the same time, our data analyst team worked with Village Reach to catalog the complete list of indicators across the 7 different data sources on the current dashboard, capture any future ones, and come up with technical definitions mapped to the source systems. It was a lot of data to map but fortunately our partners were well prepared and excellent in helping us understand the context during this initial period (and throughout the project) which was key to help ensuring the projects success.
Some detail on the data sources:
- SELV provides data stock levels and stock outs. Because SELV doesn’t have an API, we configured Canopy Flow to fetch facility data related to these indicators from the SELV database on a nightly basis. Although SELV only stores data at a monthly level granularity, we need to fetch it nightly because a submission could arrive at any point during the month.
- DHIS2 contains data on vaccine coverage, utilization, and wastage. Its smallest unit of information is available at the granularity of facility and month. We use Canopy Flow to fetch data from a few DHIS2 APIs on a schedule.
- e-SISTAFE provides financial data that comes in at the granularity of province and month. e-SISTAFE does not have an API, so a Ministry of Health official generates a CSV extract containing its data. There are 3 other CSV files that Ministry of Health staff compile manually that contain data on community outreach and engagement, with the granularity of district and quarter. To handle these different CSV files, we use Canopy Flow to read emails from a dedicated inbox. It parses email attachments that meet a specific set of criteria like the file type, name, columns, and data types, and rejects submissions that fail data validation checks. Because this comes in at no predictable or schedulable time, we use data in the email subject line to assign it to a reporting period.
- NexLeaf provides JPEG images of charts that are at the granularity of district and month. Unfortunately, we can’t access the raw data from this version of NexLeaf, so Ben, a member of the Village Reach team, wrote a node.js app that runs on a nightly cron job. The app scrapes images off of the web from NexLeaf’s API, and stores them in a database. We then built a component to return those images via an HTML iFrame embedded in Canopy Discover.
Once we understood the source data and the indicators, we defined a schema for how this multisource data would be merged to generate all of the desired indicators.
The end result: a single dashboard powered by Canopy Discover which combines data from 7 sources, and that updates itself automatically. Ministry of Health staff across Mozambique have a central place to view vaccine supply chain indicators, including things like coverage rates, stockouts, vaccine utilization and wastage, and financial data.
Vaccine coverage rates filtered to a particular province for the current reporting year, displaying trends over time, monthly totals, and a table for exploration. Note the actual province name has been removed.
A couple of key things we learned along the way:
- Manually-entered CSV data is tricky! Because we had some in-country Ministry of Health staff who were working with older versions of Excel, we could not rely on data validation checks built into the software. We needed to control for casing, date formatting, etc. in Canopy itself; we needed to convert “MAPUTO CIDADE” or “cidade de maputo” to “Maputo Cidade,” for example. As a result of this, we’ve been strongly recommending the use of a form-based data capture system (like Ona) instead of CSV files when it comes to manually-entered data. Note that CSVs with a standard schema and data structure are much easier to handle
- It is important to thoroughly examine source system APIs and understand the level of effort it working with them. Just because the system has an API doesn’t mean that API gives you the information you need in the format you need it. APIs that produce JPEG images are obviously very different from those that produce JSON files, and so they need to be scoped differently
To learn more, contact us at email@example.com
By Clay Crosby of Ona.
- The PAV project, PAV stands for Programa Alargado de Vacinação, the Portuguese equivalent of Expanded Program for Immunization (EPI). Its goal is to ensure vaccines are delivered to health care facilities across Mozambique, and the dashboard we’re describing here gives PAV staff a clear picture into all of the component systems involved
- Sistema Electrónico de Logística de Vacinas, an electronic logistics management system for vaccines
- District Health Information System 2, a national health database which holds aggregate, monthly, facility-level data of key health indicators