RESTORE-ing Confidence in Data-Driven Decision Making with New QA Tool
Dr. Andriy Miranskyy (right) and Dr. Lei Zhang (left)
Data might not be tangible, but the effect it has on our lives is very real. Statistics Canada relies on data to determine how billions of federal funds are distributed to local communities for education, healthcare and employment services. In the private sector, most companies use Big Data to establish competitive advantage and create strategic business plans. Data is arguably the most valuable resource of the 21st century, but unlike non-renewable fossil fuels, data is an exponentially growing resource.
By 2020, the accumulated volume of Big Data is expected to reach 44 trillion GB, while the business of Big Data, (its collection, analysis, and sale) is estimated to have burgeoned into a $203 billion industry. With data-driven decision making being the status quo across public and private sectors, the chances of faulty data analysis and the potential social and economic risks resulting from this are high.
In their ongoing quest to develop the most robust data analytics solutions, Toronto-based Environics Analytics (EA), one of North America’s leading data, analytics and marketing services companies partnered with Andriy Miranskyy, Associate Professor of the Department of Computer Science at Ryerson, and his post-doctoral fellow Lei Zhang. Miranskyy and Zhang set out to tackle the specific issue of regression testing that was causing EA some inefficiencies.
With the proliferation of new data, datasets need to be refreshed periodically to reflect new/updated information. During the refresh, the old dataset is replaced by a new one, but ensuring the quality and accuracy of the new dataset can be challenging. One small error in the new data could spoil the entire results of the new analysis. “Take the housing market for example,” explains Zhang, “house values may need to be updated monthly, but what if during a dataset update one property jumps from $1 million to $1 billion? This would clearly be a red flag to any human, but not a computer.” So, how do analysts distinguish reasonable real-world changes versus errors related to data capture or data transformation, in a time and cost efficient manner?
Since there is no off-the-shelf tool to compare datasets, Miranskyy and Zhang devised an automated testing tool to create a report comparing two datasets. With EA, they created RESTORE: REgreSsion Testing tool fOR datasEts. Using a comprehensive set of tests for the detection of abnormalities in a refreshed dataset, the Ryerson researchers designed an automated test harness, which ensures that data meets a reasonable standard of quality before it is loaded in a software system for analysis.
And the results? According to Sean Howard, Senior Vice President of Product Development at EA, “The RESTORE tool helps our research compare datasets within minutes and quickly narrow down the components of the dataset that need further investigation. Since implementing RESTORE as a standard part of our data quality control,” he explains, “we have not had a major data error reported in our production systems.” Also, Zhang notes that the increased automation and an improved approach to data testing will significantly reduce the cost of delivering products to market. “We estimate that the time of getting data to market can be reduced, which means lower company costs and improved customer satisfaction,” he adds.
While the primary goal of this collaborative project was the creation of a tool for EA, both sides of the partnership agreed that there was value in also making it generic and public. RESTORE is now available as an open-source package for R language on GitHub, so it can be used by analysts working on all types of data structures around the world. “The benefit of publishing this online is that others can use it and give us feedback on how to advance and improve the tool, which is already happening,” says Zhang.
From policy makers in Parliament to money makers in marketing, data can make our lives better, but only if it is read, interpreted, and analyzed correctly. Miranskyy, Zhang and EA’s collaborative effort serves as a building block in the movement of bringing lightweight software engineering practices into the data science realm to improve the quality of data-science-related products. This brings us one step closer to reliable data that can then be more accurately analyzed, and hopefully, implemented by businesses and government to better serve the needs of Canadians.
350 Victoria Street, Toronto, Ontario, Canada M5B 2K3