We often hear buzzwords about data, with terms like predictive analytics or digital value propositions. But how do they fit together? This guide focuses on data fluency: a shared understanding of how data is disclosed, manipulated and processed—and the implications thereof. This scannable format is designed to help you quickly understand how developers think, what business stakeholders need to understand to be effective in digital efforts, and critical questions to answer at each stage of a data ‘supply chain.’ This is part of a larger conversation about raising digital literacy in yourself and your organization (you can read more at www.causeit.org/what-is-digital-fluency).  

Let's start with how to think about data. In the technology world we normally think of data in the context of its utility function—where data is a thing that we have, and it waits for us to modify it. To innovate, we need to start thinking of digital and data as a capability that expands what’s possible, instead of something that we use to support our existing analog work. Data is increasingly core to value creation, and it needs to be in a place where it can evolve and stay up to date without 'file owners'. 

Causeit talks a lot about shifting from an existing or default mental model to a new mental model. It doesn't mean the current mental model is bad, leaders just need to be aware of which mental model we’re applying. Some parts of our businesses have very specific requirements requiring us to stay in an existing analog model for the moment, while other areas allow for newer thinking. We’ll talk a lot about these mindset shifts throughout the article. For example, changing the way we think of documents from ‘attachments' to something that's in the cloud, from data at rest to data in motion, from spreadsheets as the place where calculation occurs to algorithms that are evolving over time.

Decomposition Icon.png

Decomposition:

breaking down a complex problem into several simpler problems

Abstraction Icon.png

Abstraction:

a model of a system which leaves out unnecessary parts

Patterns Icon.png

Patterns:

using reusable components to minimize error and work  

Algorithm Icon.png

Algorithms: 

a series of unambiguous instructions to process data, make decisions and/or solve problems

 
Program Icon.png

Programs:

algorithms converted to programming languages; sometimes called applications

 
 

Computational thinking is a key set of mental models for working with technology and technologists to solve real-world problems. Now, when you talk to coders, you’ll have some genuinely-useful buzzwords that make you both sound pretty cool and also allow you to discuss tech meaningfully. 

When you talk to someone who works in programming and formulas, they’re thinking of problems as computations. But business stakeholders usually think in terms of solutions or packages. So, to collaborate effectively we have to do something here called decomposition: breaking a complex problem down into several simpler problems.

Take something seemingly complex, like drawing a face, and try to break it into step-by-step instructions. First, draw a circle, then draw the lips and the eyes. If there is a variable, like hair, you might have parallel steps. Is the hair spiky or smooth? Based on the choices you make, the steps go a little differently. 

As we decompose problems into their constituent steps, we are using abstraction thinking: creating models of systems which leave out unnecessary parts, while allowing us to see how different pieces fit together. In this example, a face has turned into a collection of features. And each feature may be broken down into lines, shapes or strokes of a pen.

When we do the mental work of abstraction and decomposition, and then translate that into something a machine can use, we're more likely to build function algorithms. Without abstraction, we just have a lot of disparate data that don’t really map to each other. The other thing that programmers and developers do is look for patterns that are reusable in multiple contexts, like building blocks. As organizations are going digital they often find many parallel systems that were custom built from scratch, which is quite expensive and complex. The challenge now is trying to standardize parts of that complexity without constraining anyone, so people can focus on what they each do best. 

These building blocks are part of what we call algorithms. Algorithms are a series of unambiguous instructions to process data, make decisions, or solve problems. Sometimes, we can reuse components of one algorithm for another. And then, finally we have programs or applications, which can be thought of as collections of algorithms that work together to process data.

Data Supply Chain

To understand how data works, we need to understand the data supply chain, so we're going to apply some computational thinking to help us think through how all of these pieces fit together. There are three stages of the data supply chain: 1) disclosure, whether by a human or a sensor or a system; 2) manipulation, which is where we process data and understand what's possible with it or analyze it in some way; and 3) consumption, where data is used by a business stakeholder or fed back to clients as insight about themselves.

Forthcoming + Data 101.001.jpeg
0475 Bubble Question.png

What data do you have now?

0475 Bubble Question.png

What data will you need in the future?

0475 Bubble Question.png

Where will you get it from?

 

Example: An autonomous vehicle captures raw data from its on-board sensors.

The first stage is data acquisition—data is collected from sensors, systems, and humans. For the purposes of this article, let’s use the example of a driverless car or autonomous vehicle as the context for data’s journey through the supply chain.

In the acquisition stage of data, the car captures raw data from its on-board sensors, like cameras or speed sensors. It's just bits and bytes, and no work has been applied in terms of processing or thinking about it. 

What’s the mental model here? When we're talking about sources of data, we may think about ‘data entry’, like rows in a spreadsheet. We can also think about data that are more passive, like market performance data, website analytics or social media posts, news stories, Google surveys, even video streams. These are all data sources. 

    • What data could you gather? 

    • Where will you get it from?

Screen+Shot+2019-11-20+at+10.44.58+PM.jpg

Concept: Metadata

Metadata is data about another piece of data. Metadata is used to validate data sets and increase their utility. For example, the data in an MP3 is a recording of music, but the information about the artist and song name is metadata—additional data about the core data.

Other common examples of metadata include the send and receive dates of emails, the unique address of a computer which acts as a server, or info about which app was used to post a particular message to Twitter. This is how, for example, it was found that Donald Trump was using an insecure phone to post things to Twitter—because Twitter shows which app was used to post a tweet. Other examples of metadata are the number of times a post has been viewed on social media or a song has been played on Spotify.

 
john-matychuk-yvfp5YHWGsc-unsplash+%281%29.jpg

Concept: Alternative Data

Sometimes an adjacent dataset can be used to infer something about a ‘traditional’ dataset. For example, parking lot occupancy as seen from satellites can be cross-referenced to expected sales from retail stores using those parking lots. If we just ask the basic question of “What will a retail store’s holiday sales numbers be?” we might not see the actual cause of those sales—like increased numbers of shoppers. We need to decompose the end result (sales) into the steps leading up to it (number of potential customers in stores, which can be inferred by how full parking lots are). Such a mindset lets us get an idea of what sales volume we might expect before we get the quarterly report that happens after the holidays.

Machine learning and spotting patterns becomes really important here—because it’s not cost-effective to build these capabilities from scratch. At this point, we also need to record data’s provenance (or source), as well as consent for use, whenever possible. Authenticating details will assure us that the data is correct, and that we have sufficient metadata to compare data sets.

For example, you'd want to make sure that the satellite images you might get from 10 different providers all have the right metadata. You need to know exactly what time and date they were captured and that the location was accurate (eg, not a similar-looking store next door). Otherwise, the formulas you apply later won't be very accurate, or you might find that you’re projecting based on out-of-date facts.

 
0475 Bubble Question.png

Where will you store your data?

 
 

Example: Raw data is stored in unprocessed form in the vehicle local memory.

The next stage of the data supply chain is storage—recording data to a trusted location, which is both secure and easily accessible for further manipulation. Storage is often some version of the cloud, or perhaps a specific server. Sometimes it’s a flash drive or local memory on a sensor. But wherever we store data, we need to make sure that we understand how that's going to connect to other systems. In the automotive case, raw data might be stored in an unprocessed form in the vehicle's local memory. Think of it like a hard drive in the car.

A key shift here is in thinking from data at rest to data in motion. Most of us have been trained to think about data like a file in a file cabinet or a row in a spreadsheet. When data is at rest, we think of it as static, staying the same until a user modifies it. Normally when you secure data at rest, you secure the perimeter. This is like locking a file cabinet or password-protecting a spreadsheet. The problem is that data at rest is hard to synchronize across multiple systems. Imagine ten co-workers, all with their own copies of tomorrow’s presentation and sales figures. That’s a lot of distinct versions. As we start to shift our thinking to data in motion, a good metaphor is water—a ‘stream’ like a live video feed or a ‘flow’ of stock market data.

Data in motion is dynamic and it's usually secured multiple ways. You have to secure the ‘pipes’ that the data passes through, just like locking the file cabinet in the prior example. You also need to make sure that you're authenticating the users that access it or even providing live encryption using blockchain or other technologies. This is akin to signing a document or sealing an envelope. 

Data architectures need to be built to synchronize across systems. However, many organizations, even tech-centric ones, didn’t build infrastructures with the enormity of today’s data in mind. Speed and robustness are hard to have ready in advance. 

Data in motion is much harder to wrap our heads around than data at rest. But it’s a more useful mental model, because as we go forward, we don't just want information about things which happened months ago—we want to look at exactly what's happening in the present and get insight into what future opportunities exist.

mika-baumeister-Wpnoqo2plFA-unsplash.jpg

Concept: Data Hygiene

How do you manage, synchronize, and pass data back and forth between different systems where there might be duplicate or similar data?

Data hygiene is paramount at each stage, especially in aggregation. Have you assured that data is accurate? Is it standardized or normalized in some way that permits comparison to other datasets? Have you created data catalogs that let you know what data sets are available inside your organization or from partners, and are there taxonomies for how that data fits together?

For example, what is the exact format of the date and time? Are names stored in the same format in each dataset? One key element of every data aggregation effort is the establishment of unique identifiers. While a simple name may be enough in a small business context, an email ID, phone number or tax identification number might be needed in a larger dataset. The more complex the data, the more specific the taxonomy needs to be. While it’s tempting to avoid tedious taxonomy discussions, it’s important to plan for them as you go, or you might end up with a mountain of un-sortable, un-verifiable data.

We also need to correct for errors. There is a critical opportunity at this stage for us to attempt to anonymize or de-identify data before it's stored permanently, but that’s much harder than it sounds. Going back to the autonomous car example, some pre-processing would actually happen on the car before uploading to a central server. This reduces the size of the dataset that's transmitted, but also can protect the users’ privacy by de-identifying key elements before data can be stored on the server.

 
brett-zeck-eyfMgGvo9PA-unsplash.jpg

Concept: Data Sovereignty

Where, exactly, is your data stored? What systems does it pass through? Even if they're not technically storing that data, or we're not sure if they're storing that data, we should be mindful of increasing concern about the ‘citizenship’ of data. Different rights might apply to that data depending on where it was ingested, manipulated, and/or consumed. Sometimes those rights are even applied retroactively by policymakers who didn’t anticipate particular usages. This is where examples like GDPR (global data protection regulations), or California privacy laws come in. Because those policies require certain rights afforded to the generators of the data, and impose restrictions on that data’s use, companies who did not store metadata on where data came from, or who did not establish consent or have the ability to reaffirm user consent, had to abandon or destroy data in their systems. Some jurisdictions award or grant their citizens a “right to be forgotten” or to revoke their data. But what if you're Google and you have search results data about someone that is on thousands or millions of different locations across the world? How do you revoke that data and bring it back and properly delete or dispose of it? While there are technical solutions, they only work if there is a good data taxonomy and logging. The key question to ask here is, "What are the sovereignties through which our data is passing, and what are their laws?"

 
0475 Bubble Question.png

How will you combine data from different sources and types?

 
 

Example: When the car ‘syncs’ it uploads raw data to the manufacturer’s server.

When we aggregate data, we combine disparate data sets to create a larger data set that's greater than the sum of its parts. This is the fun part, but also the more complicated part. In the driverless car example, the aggregation stage begins when the car gets to its owner's home and syncs, uploading its raw data to the manufacturer server. How will the manufacturer combine this data from different sources and types?  

As awareness of COVID-19 was rising, many of us were obsessively refreshing the screen on website dashboards hosted by Johns Hopkins or the New York Times. They gathered information from many different countries’ public health systems and other data sources, normalized that data and transparently de-duplicated cases that were reported to get as accurate a case count as possible. The COVID-19 pandemic is a critical example of a situation where we want to make sure that when we're aggregating our data, we're doing it in a way that we're not getting the wrong result by having double counts.

data_rivers_AdobeStock_84977313.jpg

Concept: Data Rivers & Data Lakes

One of the most useful metaphors, when we think about data in motion, is data rivers and data lakes.

First, data rivers or streams, which you can also think about mechanistically as pipelines, are flows of data from a lot of different places. Data lakes, which are also sometimes called data warehouses, are aggregations of data so that we can find it more easily and analyze it as a whole.

If you look at, for example, the ‘social graph’ that exists on the backend of Facebook for each user, that's pulling data from location information on your phone, message history, comments, likes—you name it. Photos, articles you shared, and even things you've done in third party apps (that you logged into with Facebook), are all brought together and carefully cross-referenced. Data lakes can get a little creepy.

Facebook can also run future analyses on these data lakes, even if they don't know what questions they're going to ask of the data when they collect or store it. This is because they have very robust metadata, which allows them to ask new questions of old data. A big question comes up from a business perspective here, because it costs a lot of money to have these data lakes or to subscribe to these data rivers. This is one reason why big data companies were, for the most part, funded by venture capital—it takes a long time to gather a critical mass of data and even longer to find (and sell) use cases for it. 

One tactic is to save every last bit of data you have access to, so you can run analysis on it in the future—as the thinking of the organization or the analytical technologies catches up. Some organizations focus on the data they absolutely need and keep the costs low and let other people do that work, if they’re not really focused on becoming a data company.

0475 Bubble Question.png

What questions do we want our data to help us answer?

0475 Bubble Question.png

What insights could we help our customers, end users and partners get from their data?

0475 Bubble Question.png

Will you change it or add to the data in any way?

 

Example: The company’s big data algorithms analyze the raw data from all vehicles, and compare it to map and traffic data.

Once data is collected, stored, and aggregated it’s time to analyze it. We examine data—and sometimes transform it—to extract information and discover new insights. In the automotive example, a company's big data algorithms analyze raw data from all vehicles in an area and compare it to map and traffic data to see how their cars are faring in different parts of the city. 

The analyze stage is the point in the journey where data turns into information. This is a crucial shift. When we talk about data, we may sometimes think of it as being the same all the way through the supply chain, but data is largely useless until you apply the right analysis to it. Think about the difference between the raw data generated by your banking transactions vs. fraud alerts that come from your bank analyzing it effectively.

What analysis will you do of the data?
What is the question you would ask of the data if it were a person? 
Will you change it or add to it in any way?

 
thisisengineering-raeng-fgdmH3iqvMw-unsplash.jpg

Concept: Algorithm

We've probably all heard the term algorithm, but what does it really mean? An algorithm is a step-by-step method for solving a problem, expressed as a series of decisions—like a flow chart. Advanced algorithms can update their ‘flow charts’ because they contain a mathematical model for key trends or systems that adapts over time, in comparison to algorithms which are pretty static. Think “How long will it take to drive the kids to school based on current traffic?” versus “What do I do when I encounter a stop sign?”

Another use case for self-updating algorithms is image recognition—algorithms that ‘learn’ the more that we bring in new data. This is something that goes beyond the scope of this guidebook, but for further reading, look up terms like ‘supervised learning’ versus ‘unsupervised learning.’

Make sure that you are asking questions about the models encapsulated within your algorithms as you're working with technical stakeholders so that they can partner with you to select the right strategy and tools for machine learning.

 
predictive_analytics-AdobeStock_1315284.jpg

Concept: Predictive Analytics

Predictive analytics uses data, statistical algorithms and machine learning to calculate the likelihood of future outcomes based on past data. The goal is to go beyond knowing what has happened to providing a best assessment of what will happen in the future.

For example, when you get a notification predicting a flight delay while traveling, a system is comparing data about the current and past flights and referencing other algorithms processing weather data. Machines are analyzing and learning constantly.

Inside established organizations, we often run into a cultural issue which we call 'not invented here'. Organizations insist on developing their own algorithms because they can't figure out the technical or legal challenges of working with third parties to access their data and algorithms. When developing inside an established organization, discuss the use of off-the-shelf algorithms and datasets from third parties, and what infrastructure and agreements would be needed for that. 

Algorithms are easy to think of as math formulas. Some developers may subconsciously assume that their worldview is complete and logical so they don't have to pay too much attention to bias. While algorithms are formulas, if they are mathematical abstractions of human values (like trust, for example) the people who are coding incorporate their own biases. Those biases can be really, really hard to remove, especially once a complex set of systems are established. Therefore, when developing predictive analytics, make sure that you are managing bias in the process.

This bias issue happened when Goldman Sachs, who hosts the Apple Credit Card infrastructure, developed an algorithm that screens applicants for creditworthiness. Their algorithm somehow came to the erroneous conclusion that women were less creditworthy than men. In one notable case, a couple who had completely joint finances received very different offers, with the male member of the couple receiving a credit limit 20 times higher than that of his wife, despite her superior credit score. It was a scandal, and with the interdependence of the two firms involved, it was perhaps harder to identify the responsible party. When the algorithm is complex and not visible to others, it’s hard to do the forensic analysis to determine the source(s) of the unintended outcome—was it the accuracy of the data that is fed into those algorithms? Or is the algorithm itself wrong? With feedback loops between the data and ‘self-teaching’ algorithms, the causation of such unintended outcomes is a black box for end users—and a deserved black eye on the faces of major brands. Causeit wrote an extensive guide to the topic of Data Ethics in conjunction with leading researchers, including Accenture. You can dive deeper here.  

To avoid unintended consequences, you need to integrate active feedback loops from customers. In the ‘consumer’ setting, you might see this in the form of a recommendation from Amazon for a product coupled with a question asking you, “Did we recommend the right product?” Facebook might ask you if the face it identified in the photo you uploaded is accurate. A key part of raising data fluency is to help humans—no matter their level of technical inclination—better speak to each other and their machine counterparts.

 

Concept: Natural Language Processing & Sentiment Analysis

Natural language processing is another form of analysis. It's a set of algorithms that are designed to capture, and make mathematical representations of verbal communication, and then analyze sentiment within. Causeit uses a video conferencing software with our team called Uberconference, which provides a great example of this type of data analysis.

Screen Shot 2020-09-22 at 6.37.51 PM.png

Uberconference offers a ‘value-added’ feature where machines transcribe all of the content we say during a call and then highlight action items, questions, and key moments of discussion to make it easier to review the transcript. This is a huge boost to our productivity, and reflects important information we may even have missed. You can think of natural language processing as applying to anywhere verbal communication (written or spoken) happens. You might analyze social media posts for word choices to see if people are frustrated with a company, or you might analyze news stories to see if there are disturbing trends that are showing up about a company that you're invested in.

 
0475 Bubble Question.png

What will you do with the findings of your analysis?

 
 

Example: Collision-avoidance and navigation route algorithms are updated across all vehicles.

How can you apply the insights gained from data analysis and use it to make better decisions, which effect change or otherwise help you deliver a product or service?  In the automotive example, collision avoidance and navigation algorithms might be updated across all the vehicles based on the raw data that we're getting from these various cars out on the road.

What will you do with the findings of your analysis? What switch will you flip? What choice might you alter? What investment might you make—or revoke?

 
SDK.jpg

Concept: APIs

One of the most important building blocks of data-driven value propositions are APIs. API stands for an application program interface. It’s a standardized way to pass data and commands between various systems to facilitate stable, secure functionality. APIs enable internal or third party developers to work without reinventing the wheel or merging different operations between teams. APIs require robust architecture and clarity about how the data is structured; otherwise, apps that read or alter the data in different ways could have unintended results.

For example, anytime you use Facebook to log into another website, like Airbnb, that's an authentication API. Facebook confirms to Airbnb that you're you, without you having to set up a separate username and password. There are rules about what data can and cannot be passed back and forth between systems with Facebook—which can either serve to protect or compromise your privacy, depending on the situation. 

Third-party apps don't necessarily get to access everything about your Facebook profile, for example, and might only be able to see your email address. 

APIs also enable integrations related to ‘smart’ or connected homes. Many connected home users use Amazon Alexa, Google Assistant, or Apple HomeKit (Siri) to provide a unified interface, including voice commands, to control their home and provide multi-step automations like ‘turn on the lights whenever the security camera detects motion.’ The precursor to Google Nest/Home—called ‘Works With Nest’— made this great video explaining how their APIs coordinated Nest devices and third party devices to make the ‘connected home’ easier. 

If you use smart home gadgets, you may have noticed for example that the proprietary app for your special, colored lights has more features in it than the generic controls accessed through a hub system like Apple or Google. Sometimes, this is because not all functions are easy to turn into an API. Other times, it may be a ‘special sauce’ feature that the manufacturer is reserving for themselves to stay differentiated. Similarly, data which is passed to third party services may be limited. For reasons of privacy, proprietary insight or data sovereignty, not all data should be made available through ‘external’ APIs.

APIs are used inside organizations, too, or between key partners only. This is usually called a private API. APIs can accelerate innovation inside organizations, but many organizations didn't have the foresight, resources or alignment to use API thinking when building their technology stack or infrastructure. It's only later on that they try to develop APIs—which is why it can be very hard to catch up to other organizations which have had a higher level of digital literacy for longer. Amazon is famous for saying you can use any technology you want, as long as it connects to everything—a decision principle which operationalized API and data-centric thinking long before incumbent competitors of theirs were even thinking about e-commerce seriously. This is part of why it’s important to have shared language about the data supply chain—so that you and your colleagues are thinking as early as possible about the ways that your data will connect with everything else.

Another example of APIs power and usefulness is a do-it-yourself automation tool called Zapier. Zapier started as a relatively basic API integration platform, but its capabilities are approaching enterprise-level robustness. Zapier organizes APIs into ‘triggers’ and ‘actions’ which can be connected in ‘zaps.’ Currently, there are over 2000 apps in their ecosystem. For example, a user might create a simple ‘zap’ where every time they send a message through Gmail that meets certain criteria, zapier is ‘triggered’ to save that email to a row in a spreadsheet in Excel. Zapier is a great way to experiment with your own API-backed functions. Even if you can’t access it at work for security reasons, you can play with it at home, making a simple zap between a Google account and a Facebook account to see how you would move data from one system to another automatically—in other words, how you would build a simple algorithm.

You can also integrate building blocks of more complex algorithms. At our company we used Zapier and Google to automate the process of recording new contacts. If we scan a business card and upload the image to our customer relationship management (CRM) database, Zapier will pass that image to Google image recognition’s API. Then Google analyzes the card image and extracts text and logos as well as specific companies and names, and then passes that information back to our CRM. It allows us to quickly make scans searchable without paying someone to go through and do all of that rote work—instead, they can focus their work on more human-centric tasks. In this way, APIs allow you to leverage more complex machine learning that someone else has developed, rather than trying to do it all ‘in-house.’

 

Example: Plaid

Another more specific example from the finance world is a service called Plaid. Plaid helps developers make financial apps without developing a lot of specific partnerships. You may have already used Plaid without ever realizing it. In fact, if you use an app like Acorns or Venmo that integrates with your bank account, you’ve used Plaid. You are able to log in and link these services to your bank because Plaid has normalized, or abstracted, the APIs of many, many financial institutions.

Plaid provides several narrow financial data services, often referred to generically as microservices. Each function does just one thing, but it does it really well and really quickly. Plaid allows a developer to rapidly access an end user’s income, detect fraud, verify employment, view assets, check creditworthiness, verify identity, and much more. This means that when you have ‘decomposed’ a financial app need like “get a user’s net worth” into many specific questions, you can send those various queries to Plaid and get reliable results back. 

Screen Shot 2019-11-20 at 8.18.39 PM.png

When Plaid has passed data through to a developer, the developer can then do what they do best—which might have little to do with connecting with financial institutions. For example, developers of a personal budgeting app may not be in the business of verifying credit, but need to know enough to make recommendations about re-financing or create a helpful infographic. Plaid and other API normalization services are taking the heavy lifting out of the app development process. The shift to focus on here is the move from having to build digital value from scratch to assembling off-the-shelf building blocks. This is often the only way an ‘incumbent’ competitor who has not had much digital momentum can catch up with more digital-first ‘disruptors’ coming into their space.

 
0475 Bubble Question.png

Will you share or sell any of the data to other parties?

 
0475 Bubble Question.png

Will you share back to the source?

 
 

Example: The vehicle-maker sells subsets of the data and/or insights to other manufacturers, map services and regulators.

Now we get to the million-dollar question: Can we ethically share or sell data? 

How do you provide access to datasets—or the insights from those data sets, which are not the same thing—to other people, organizations or systems? In the example of the automotive data supply chain, the vehicle maker might make subsets of their data, or data-derived insights like "these are the parts of the city that have a lot of traffic," to other manufacturers, map services, or even regulators. This creates a feedback loop to influence the world around us. However, sharing and selling data is the stuff of headlines for a reason—if our driving data is telling someone exactly where and when we drive, do we really want that shared to other manufacturers or map services?

There are data ethics and informed consent considerations all the way through the data supply chain. For one, good modeling of a ‘data supply journey’ is a critical part of operating mindfully and ethically, which in the long run is always better business sense. Secondly, the analysis phase of the data supply chain is a vital one for ethics considerations. Sometimes we can analyze and get insights into data without ever passing on raw data. If we can do analysis at the point of data ingestion into the data supply chain, we may never need to pass private information along at all. On-device analysis characterizes Apple’s approach, as they prefer to do much of their seemingly-magical processing (facial identification, voice recognition, etc) with powerful and secure processors on users’ phones, rather than passing raw data (like face biometrics or voice recordings) to the cloud as much as competitors like Amazon and Google.

 
0475 Bubble Question.png

How will you protect the data?

 
0475 Bubble Question.png

Will you dispose of the data when you are finished?

 
 

Example: Raw data about individual vehicles is deleted from the manufacturer's central servers.

Disposal of data is an important consideration at the end of the data supply chain. It’s never the most exciting, and it doesn’t generate any immediate value, so it’s often overlooked. However, regulatory requirements and common decency both demand that we think through how data will be disposed of when it’s no longer useful. While whole disciplines of cybersecurity are devoted to these details, business stakeholders need to consider one key concept: centralization vs. federation. With data at rest—‘files in a file cabinet’—we can imagine shredding files when they’re not needed anymore. If we’re honest about it, most organizations only got rid of paper files when they ran out of space. Digital storage, however, isn’t visible to us, and has such marginal cost that we might not even realize what data we still have laying around gathering proverbial dust.

When we think about ‘data in motion’ and the many distributed and synchronized, or ‘federated,’ copies of data out there, it may be nearly impossible to find and delete all instances of the data a user has disclosed or had generated about them. As mentioned earlier, if we can avoid storing the data in the first place—as Apple does by doing face recognition on a user’s own phone rather than in the cloud—we may not have as much to deal with. Therefore, it’s critical to consider how data deletion and disposal will occur, even if just to prompt us to not unnecessarily store sensitive data in the first place.

Computational Thinking and the Data Supply Chain are complex topics which can generate fractally-complex discussions. Wise leaders will balance learning about the technical elements with the need to think critically about data strategies. 

100-day plan: 

  • Self-assess your teams’ digital fluency 

  • Adopt or align on the definitions of each stage of the data supply chain

  • Determine key stakeholders for each stage—both for ideation and implementation

  • Educate yourself on the business models around data-centric offerings, such as multi-sided platforms

  • Learn about the basics of data ethics and create data principles in consultation with user advocates, legal and compliance experts, business stakeholders and lay users 

  • Run ideation and design sessions to craft ‘data stories’ which go through each stage of the data supply chain

  • Create a digital fluency plan for your part of the organization

  • Identify digital ‘advocates’ and ‘champions’ throughout the organization

365-day plan: 

  • Publish a data catalog which details available datasets and APIs

  • Commit to data principles throughout the organization

  • Create an ‘options management’ system that helps cross-reference value propositions, datasets, and APIs into business models

  • Matchmake digital advocates, and gently coordinate their efforts with an acting ‘entrepreneurial czar’ who has a high degree of digital fluency and strong connections with established leaders, innovators, and strategic partners

  • Implement basic ‘digital factory’ functions in your strategy and budget