Open data that never retires: the new data lifecycle

‘Open data’ is data that can be freely used, shared and built-on by anyone, anywhere, for any purpose. Open data plays a role in creating transparency, driving insight and innovation, and supporting data driven decisions and service delivery. To meet its innovation potential, however, open data must be better defined and include larger volumes of historical data.  This means refreshing traditional data lifecycle management by designing data to be shared from the outset and accepting that the archiving and destruction of data is no longer on the cards – open data never retires.

The government’s open data agenda continues to tick, with an increased mandate for data sharing supported by the prime minister’s public data policy statement to make non-sensitive data ‘open by default’ and the step changes proposed in the recent Productivity Commission report on Data Availability and Use. And it is not just the public sector that is focussing on open data. Data sharing by private enterprise equally supports social innovation, transparency, and economic growth. The Competition Policy Review (“Harper Review”) includes a focus on making customer data openly available to the individual whom that data is about. For instance, requiring banks to provide transparency to their customers on the data they holds about them. Further, and more recently, the notion of data commercialisation is gaining interest as both public and private organisations look to generate new revenue streams from their data. This includes seeking new platforms for housing and linking data in a public marketplace, while preserving the privacy and integrity of the data.  Open data is here to stay.

Traditional data lifecycle management addresses the ‘back office’ capabilities necessary to enable data to be captured, held, processed and maintained. It was designed for data sets that were used for internal operations and at some point would be considered no longer useful (or too costly to keep) and were therefore retired, archived or destroyed.  But in 2017, data needs to be made broadly available, including on open platforms, and it needs to be left there. This is necessary to give app developers, industry groups, entrepreneurs or the general public the best chance at leveraging open data to create innovative solutions that make an economic or social impact. One example is in agriculture and farming, where insurers have combined generations of crop yield data with weather and satellite data to optimise insurance offerings.

For these benefits to be broadly realised, traditional capabilities and policies must now be complemented by ‘front end’ capabilities for publication and engagement. The good news is, that with reduced cost of storage and next-generation data architectures and trading platforms that provide ready access to large volumes of data, it’s increasingly feasible to retain large volumes of data in the public domain.

We’re letting go of the archiving aspect of traditional data lifecycle management, but that’s the easy part. There are a few more old-school ways of data lifecycle management that need re-vamping for the open data context. The areas to focus on for next generation data lifecycle management are data design & discoverability, data quality & maintenance, data protection, and feedback loops:

Data design & discoverability

Data is generally poorly described at the point of creation with limited focus on ensuring accurate interpretation and prevention of misuse when it’s opened to new contexts and users.

When designing your data, it must include consideration for sharing and be defined in terms that allow the public (as well as internal stakeholders) to easily discover and correctly interpret its meaning, and to readily link it with other data sets. Standards are key in driving consistency and repeatability and for enabling linking of data.

Data that has been created without considering these standards require effort-intensive transformation to meet accepted standards and improve linkages. A process of use case discovery, prioritisation and remediation must be adopted to enable publication of poorly designed data.

Data quality & maintenance

Low quality data leaves the door open for inappropriate interpretation and application or use, whilst increasing the risks of inadvertent disclosure, when published.

A data maintenance approach is needed that includes establishing a quality position for published data sets that can readily be shared with an external audience. This doesn’t necessitate that all data shared externally is rigorously cleansed, rather, that it is assessed and published with an authoritative quality rating and explanation of any known compromises, entrusting the end-user to determine whether the quality will suffice for their purpose.

Data protection

Complex and changing legislative constraints, an absence of data sharing frameworks, and old, engrained work habits contribute to a culture of sharing data on a ‘need to know’ basis. Even when anonymised, sharing of sensitive data sets can be troublesome, as proven with the recent Medicare MBS debacle, where anonymised data was readily re-identified, albeit without vicious intent.

It goes without saying that the right internal and external access controls to all sources of data are required. Data processing and anonymisation techniques that minimise the risk of re-identification must be implemented, however this also requires appropriate governance structures in place to enable decisions on data release and publication, as well as quickly resolve data protection threats.

Risk of exposure needs to be carefully understood and mitigated, both at the time of publication, but also on an ongoing basis. This requires a continual effort to understand the data, periodic assessment of and re-identification risk, as well as an up-to-date understanding of the current legislative focus on data sharing and release.

Feedback loops

Without a user-friendly feedback loop, there’s a missed opportunity to build on and enhance publicly published data. Open data publication is still new enough that there are not a lot of mature processes for tapping in to end-users as a source of feedback and innovation.

Feedback loops must be established to understand how open data is being used, ‘crowd-source’ improvements to the quality of published data over time, and to ensure that future investments in public data focus on high-value data sets.

With these challenges in mind, the open data agenda demands that data be made permanently available and managed well beyond its original purpose. To deliver on open data’s potential, a new lens on data lifecycle management is needed.

For organisations to move forward with their open data agenda, they must:
  1. Establish who the decision makers are.
    We’ve provided a number of requirements for next generation data lifecycle management. To action any of this, initial and ongoing investment decisions will need to be made by the right people in the organisation. In the public sector, new data governance arrangements have been proposed in the Productivity Commission report on Data Availability & Use, including a National Data Custodian and Accredited Release Authority roles. Internally, however, organisations will require decision making structures to establish data standards and provide approval for data release.
  2. Empower data decision makers.
    Not only do the decision makers need to be identified, they must have the authority and resources to drive the open data agenda in their organisation. Statements of intention won’t lead to an active program of publication unless there is also support for action. This means ensuring that the effort taken publish and maintain public data are recognised as part of people’s roles and aren’t simply added on as another administrative task.
  3. Focus on the most valuable data.
    Most organisations are swimming in data and it is hard to know where to start. For organisations seeking to publish their data to open platforms, a structured approach is needed to ensure a focus on the most valuable data first. We’ve described an approach to prioritise data investments in one of our previous blog posts, and it’s a crucial step to getting the most value out of open data investments.
  4. Check your skill sets and leverage what you’ve got.
    The core skill sets required for next generation data lifecycle management may already exist across the organisation, although perhaps not in the most effective structures. Organisations with a current data or information management function should test the next generation requirements against current capabilities and tools, and seek to leverage or augment these to deliver the requirements and priorities of open data.