Dashboards Powered by Multiple Datasources

November 2, 2018 in Client projects, Data visualization

Visualizing data across your project – How we automated combining 7 different data sources ranging from APIs, databases, static CSVs, DHIS2, website images and email attachments(!) into a single dashboard

We’ve developed a strong partnership with Village Reach, a team that stewards a logistics management system (OpenLMIS) and works to solve health care delivery challenges in low resource communities. Their team in Mozambique recently reached out to us with a problem that we’re seeing more and more frequently among clients: the need to automatically combine data from multiple sources in the same dashboard. Usually we find the biggest problem our partners face usually isn’t around visualizing data; rather, it’s around how to pull together a bunch of real-time data from multiple sources so they can make sense of it. This means automating the process of ingesting the right data, in the right format, to generate a unified data view for use across all levels and verticals of a team. Prior to this project1, Village Reach staff in Mozambique had to manually collect raw data from two electronic systems – SELV2 and DHIS23 – four CSV data updates sent via email, and a series of JPEG images produced by a cold chain management system called NexLeaf. Their current approach was to manually compile and share via an Excel spreadsheet something that was very time consuming and limiting from a data analysis standpoint.

Since we have been collaborating closely with Village each to develop Canopy (it’s currently being used to support the OpenLMIS v3 reporting stack), Village Reach wanted to see if we could assist them in using Canopy to automatically pull these data sources together into a real-time dashboard solution. We had a tight timeline, two months from start to finish, to see if what we were developing with Canopy was up to the challenge.

The platform involves a few components that perform specialized functions:

  • Canopy Flow is used to collect data from each of the 7 different sources and merge it together into a unified data view
  • Canopy EDW stores the data and generates the indicator values
  • And Canopy Discover visualizes those indicators in an easy-to-understand interface with self-service capabilities like modifying charts or building custom visualizations

We began with a two-pronged approach. Our data engineering team dug into the systems with APIs (DHIS2, SELV, and NexLeaf) to understand how to retrieve data from them and how they structure data. At the same time, our data analyst team worked with Village Reach to catalog the complete list of indicators across the 7 different data sources on the current dashboard, capture any future ones, and come up with technical definitions mapped to the source systems.   It was a lot of data to map but fortunately our partners were well prepared and excellent in helping us understand the context during this initial period (and throughout the project) which was key to help ensuring the projects success.

Some detail on the data sources:

  • SELV provides data stock levels and stock outs. Because SELV doesn’t have an API, we configured Canopy Flow to fetch facility data related to these indicators from the SELV database on a nightly basis. Although SELV only stores data at a monthly level granularity, we need to fetch it nightly because a submission could arrive at any point during the month.
  • DHIS2 contains data on vaccine coverage, utilization, and wastage. Its smallest unit of information is available at the granularity of facility and month. We use Canopy Flow to fetch data from a few DHIS2 APIs on a schedule.
  • e-SISTAFE provides financial data that comes in at the granularity of province and month. e-SISTAFE does not have an API, so a Ministry of Health official generates a CSV extract containing its data. There are 3 other CSV files that Ministry of Health staff compile manually that contain data on community outreach and engagement, with the granularity of district and quarter. To handle these different CSV files, we use Canopy Flow to read emails from a dedicated inbox. It parses email attachments that meet a specific set of criteria like the file type, name, columns, and data types, and rejects submissions that fail data validation checks. Because this comes in at no predictable or schedulable time, we use data in the email subject line to assign it to a reporting period.
  • NexLeaf provides JPEG images of charts that are at the granularity of district and month. Unfortunately, we can’t access the raw data from this version of NexLeaf, so Ben, a member of the Village Reach team, wrote a node.js app that runs on a nightly cron job. The app scrapes images off of the web from NexLeaf’s API, and stores them in a database. We then built a component to return those images via an HTML iFrame embedded in Canopy Discover.

Once we understood the source data and the indicators, we defined a schema for how this multisource data would be merged to generate all of the desired indicators.

The end result: a single dashboard powered by Canopy Discover which combines data from 7 sources, and that updates itself automatically. Ministry of Health staff across Mozambique have a central place to view vaccine supply chain indicators, including things like coverage rates, stockouts, vaccine utilization and wastage, and financial data.

Provincial-level utilization rates, aggregated for the current reporting year and trended over time. Note actual province names have been removed.

 

Vaccine coverage rates filtered to a particular province for the current reporting year, displaying trends over time, monthly totals, and a table for exploration. Note the actual province name has been removed.

 

A couple of key things we learned along the way:

  • Manually-entered CSV data is tricky! Because we had some in-country Ministry of Health staff who were working with older versions of Excel, we could not rely on data validation checks built into the software. We needed to control for casing, date formatting, etc. in Canopy itself; we needed to convert “MAPUTO CIDADE” or “cidade de maputo” to “Maputo Cidade,” for example. As a result of this, we’ve been strongly recommending the use of a form-based data capture system (like Ona) instead of CSV files when it comes to manually-entered data. Note that CSVs with a standard schema and data structure are much easier to handle
  • It is important to thoroughly examine source system APIs and understand the level of effort it working with them. Just because the system has an API doesn’t mean that API gives you the information you need in the format you need it. APIs that produce JPEG images are obviously very different from those that produce JSON files, and so they need to be scoped differently

To learn more, contact us at hello@canopyinsights.com

By Clay Crosby of Ona.

GDPR and PII: Canopy Has You Covered

April 24, 2018 in Uncategorized

The European Union has introduced important new privacy legislation that will have wide ranging effects on how most organizations will need to handle data related to people. This new legislation called the General Data Protection Regulation (GDPR) focuses primarily on the issue of collection of personal identifiable data (PII) data and goes in effect May 25th, 2018.

The GDPR legislation applies to any organization that collects or processes PII data on EU residents. In today’s global world, any organization, large or small, operating anywhere in the world, could be affected by this regulation if an EU resident chooses to use the service. And the penalties are steep

Stop the Madness! Enterprise Data Warehouses Are Better Than Custom Data Integration Tools

February 8, 2018 in General

Enterprise Data Warehouse

Everyone is interested in the fancy outputs of mobile data collection: eye-catching and informative web-based data visualizations. Yet few people care about the mess of intermediate analytics and custom data engineering work needed to fuse multiple raw data sources to a single database, and then visualizing that data coherently.

The easiest way to handle the messy intermediate work is to create a custom solution, i.e. a transactional database combined with a common web framework. This seems like a reasonable solution that fulfills project requirements.

How harmful could custom data integration tools be? Quite harmful, unfortunately.

For unique problems that only need to be solved once, this solution makes sense. However, data integration problems are not unique. Problems such as connecting to other systems (e.g. DHIS2, OpenMRS, iHRIS), adjusting or cleaning raw data, and fine-tuning charts and other visualizations need to be solved in almost every project.

We all end up rebuilding the same thing over and over. Worse still, custom solutions become unmaintained, dated and eventually discarded abandonware after funding ceases.

At Ona, we are software solution builders and often see these patterns but do not have the time or the budget to do anything about it. Instead, we write good code – extensible with common standards and thorough testing – but this narrow minded focus on technical excellence lets us overlook that we have spent more resources building out a series of separate web applications than we would have if we had invested in a long term solution.

Open source data tools are the answer

What are our options? Going back to combine these separate applications into a single platform is untenable, because we will be forced to accommodate design decisions made for those single applications, which then becomes an endless game of whack-a-mole with every upstream code change.

We started looking for an answer by looking at our previous work. Every tool we’ve built breaks down into three components:

  1. Aggregation
  2. Merging
  3. Analytics and Visualization

Well-architected and well-maintained open source solutions exist for each of these components:

Open source data tools aren’t enough

Although these tools exist, they are hard to use. First, they are not packaged in an easy-to-use way, sometimes purposefully. Second, they are built to scale and solve harder technical problems rarely encountered in ICT4D to date.

They are built to operate with high availability, high scalability, high throughput, and low latency – making them challenging to setup and operate in developing country contexts.

So our new problem is: how can we make the awesome but complicated existing tools easier to use?

Enter the Enterprise Data Warehouse (EDW), the modular, open source tool we’re building to make integrating existing data tools simple. The EDW will fill a huge need in the ICT4D open source community and let us build better solutions faster while being adaptable to the changing needs of our partners.

Side note: building an EDW is a significant challenge (read: big investment) — so much so it is the competitive advantage of a number of companies.

EDW tools should be common in ICT4D

We see building an Enterprise Data Warehouse as a disruptive change for ICT4D. A well-tested industry standard software package will make it easy for groups to solve data problems they are facing today. At the same time, a EDW means faster integration with promising new technologies like artificial intelligence, comprehensive data security, and Internet of Things.

To accelerate this shift in the industry, Ona has released open source Ansible roles to automatically set-up and configure data warehouse components.

Additionally, we offer our partners an EDW solution that addresses the unique needs of the ICT4D space including data sovereignty, aggregating data across heterogeneous data sources, mixed deployments that are both in the cloud and on-premise, limited hardware resources, and limited or unstable network connectivity.

Program monitoring based on real-time data is an important part of all the projects we are involved in, and will become essential to any project’s success in the increasingly network connected future (eg. Internet of Things). Building these tools the correct way, by custom building as little as possible, is the most cost effective and — we believe — the best approach.

The EDW prepares us for what’s next

An ecosystem of mature tools covering all the use-cases we outlined above has only become available in the past year or so due to two recent changes: first that these tools are mature and available, and second that successful projects require real-time feedback and expect to be used at scale. These changes have shifted the balance and justify a more thoughtful approach to data systems tooling.

An Enterprise Data Warehouse built from open source components is the perfect solution for this job. It future-proofs our tools and our partners’ projects. We can operate efficiently from village-scale to planet-scale, and when we need to incorporate new technologies, it’s as close to plug-and-play as possible.

For example, a global organization that’s in the process of deploying what they expect to grow into a planet-scale SMS-reporting tool will eventually have datasets in the high terabytes and then petabytes. To handle a rapidly growing dataset of this size, process it in real-time, and accomplish its mission effectively this organization will need to use an EDW that can accommodate this growth and at the same time is capable of handling it’s current smaller-scale needs at a reasonable cost.

In addition to reducing duplicate work, our approach – implementing on top of open standards based data platforms – will mean solutions that cost less and give builders more flexibility. These are essential features to successful ICT4D projects, which supports our raison d’être at Ona: to build technology solutions that improve people’s lives.

Enterprise Data Warehouse Example

In the context of health systems, our streaming data architecture means we can already create a single pipeline that receives information from an electronic medical record system, enhances it with demographic data, and then visualizes indicators on a website, all without building custom software.

This is what it looks like:

Enterprise Data Warehouse

 

Using industry standard data platforms lets us reconfigure and reuse the same system for different health use-cases or for any particular needs a client with data might have. We can also extend this system by adding machine learning tools and connecting them to existing platforms, products and data.

Most importantly, our clients can access the visualization and data ingestion platform themselves. They can play with the charts and data pipelines to discover uses we would have never imagined.

If you are interested in exploring how an Enterprise Data Warehouse could help address your needs please leave a comment below.

By Roger Wong and Peter Lubell-Doughtie of Ona.

We’d like to acknowledge the World Health Organization, John Hopkins University, VillageReach, the Bill and Melinda Gates Foundation, and Johnson & Johnson for supporting us in this work.