The Evolution of Data Products
"The purpose of (scientific) computing is insight, not numbers". - Richard Hamming
A data product is a software application or tool that leverages data to deliver actionable insights, predictions, or automation, typically through the use of analytics, machine learning, or statistical models. These products integrate data collection, processing, and analysis into a usable form for end users, often in real time or near real time, providing value by enabling decision-making, optimizing processes, or enhancing user experiences.
In practice, data products can range from dashboards and recommendation systems to AI-powered applications like fraud detection systems, personalized learning platforms, or even predictive maintenance tools. They are often characterized by the following key components:
1. Data Ingestion: Collecting data from various sources (e.g., databases, APIs, sensors).
2. Data Processing: Cleaning, transforming, and organizing data into a usable format.
3. Analytics or Models: Applying algorithms, statistical methods, or machine learning models to extract insights or make predictions.
4. User Interface: Presenting the insights or results in a user-friendly format, often via dashboards, APIs, or applications.
In essence, data products convert raw data into useful applications that provide value to businesses, consumers, or organizations.
The academic discipline of computer science emerged in the 1960s, with a focus on programming languages, compilers, operating systems, and the mathematical theories supporting these areas. Theoretical computer science courses covered finite automata, regular expressions, context-free languages, and computability. In the 1970s, algorithm studies became a significant component of the field, emphasizing the practical applications of computers.
Individuals with a computer science (CS) background are trained in algorithmic thinking, where they learn to solve problems using computation. Algorithms are the fundamental unit of programming and computer science. Avi Wigderson states, "Algorithms are a common language for nature, humans, and computers." The field of CS is relatively new, and increasingly, we are starting to realize that algorithmic thinking is a universal framework that can be applied to the hard sciences as well. Robert Sedgewick, who teaches algorithms at Princeton, says that computational models are replacing mathematical models in scientific inquiry and discovery.
This means algorithms are a common language for understanding nature, where we can simulate a given phenomenon in order to better understand it. The commercial implications of this are immense. Software has transitioned from being merely an IT expense to something that is eating the world and reorganizing the world.
But Today, there is a fundamental shift towards a wide range of applications. Numerous factors contribute to this change, including the convergence of computing and communication technologies, the increased capacity to observe, collect, and store data in various fields, and the rise of the internet and social networks as integral parts of daily life. These developments present both opportunities and challenges for theoretical computer science. While traditional areas remain crucial, future researchers will increasingly focus on utilizing computers to extract valuable information from massive datasets generated by applications, rather than solely on solving well-defined problems. Consequently, there is a need for building a toolbox (a repertoire of techniques) to address the theoretical knowledge expected to be relevant in the next 40 years, just as the understanding of automata theory, algorithms, and related topics provided an advantage in the past 40 years. One notable shift we are seeing is the increasing emphasis on probability, statistics, and numerical methods for needed for algorithms that extract value out of big amounts of data.
Data products:
A data product is something that can only exist because of the analysis of some underlying data set. Maybe you can argue with that, or maybe you want to, but I think the point I’m trying to make here is that we’re able now in 2019 to build an entirely new class of application. And we’ve been able to do this for a while. It was just much more expensive. It was out of reach.
- Hilary Mason
Few people know that one of the reasons computers exist today is because computing the census of 1880 took eight years of effort, much of which was tedious, monotonous work.
"John Shaw Billings, a physician assigned to assist the Census Office with compiling health statistics, had closely observed the immense tabulation efforts required to deal with the raw data of 1880. He expressed his concerns to a young mechanical engineer assisting with the census, Herman Hollerith, a recent graduate of the Columbia School of Mines. On September 23, 1884, the U.S. Patent Office recorded a submission from the 24-year-old Hollerith, titled “Art of Compiling Statistics.” By progressively improving the ideas of this initial submission, Hollerith would decisively win an 1889 competition to improve the processing of the 1890 census. The technological solutions devised by Hollerith involved a suite of mechanical and electrical devices. The first crucial innovation was to translate data on handwritten census tally sheets to patterns of holes punched in cards. As Hollerith phrased it, in the 1889 revision of his patent application. This process required developing special machinery to ensure that holes could be punched with accuracy and efficiency. - Computers have an unlikely origin story: the 1890 census"
Large companies have typically been storing data for a long time, as data is often generated as a side effect of a business that has been operating for a while. Identifying data-related technical problems and solving them should preferably be done in-house. This is because developing such core capabilities leads to a competitive advantage in the long run. In contrast, purchasing a product or solution (a common route taken in enterprise settings) often fails to deliver. An interesting case study is the well-known content curation app Toutiao:
"Simply put, the more users use your product, the more data they contribute. The more data they contribute, the smarter your product becomes. The smarter your product is (e.g., better personalization, recommendations), the better it serves your users and they are more likely to come back often and contribute more data — thus creating a virtuous cycle. By building an addictive product, Toutiao generates engagement data from their users. That data is fed into Toutiao’s algorithms, which in turn further refines the products’ quality. Ultimately, the company plans to use this virtuous cycle to optimize every stage of what they call the “content lifecycle”:" Creation, Curation, Recommendation and Interaction. — The Hidden Forces Behind Toutiao: China’s Content King
A decade ago, the conversation primarily revolved around the rise of big data and organizations developing ways to leverage it for extracting business value and making effective decisions. This entailed building centralized data warehouse solutions and constructing business intelligence solutions on top of them. Under the hood, many of these solutions performed functions that involved computing descriptive statistics to answer questions and better understand the reality of business operations. For early adopters, this marked a fundamental shift in how businesses were conducted. Instead of relying on management lessons from business school or gut feelings, businesses began to make decisions by quantifying and monitoring various aspects of their operations. This led to increased efficiency for many businesses. The next natural step for organizations was to develop more sophisticated modeling and prediction capabilities. As we began to think beyond dashboards and visualizations, the field of data science was born.
In many user-facing products, a natural feedback loop now exists in the form of actions and outcomes. It is now common to observe user behavior by collecting relevant data, make decisions to steer product direction, and then observe the outcomes. Implemented correctly, this iterative process enables companies to perform course corrections in a timely fashion. Moreover, technologies like machine learning and reinforcement learning are quite effective at taking advantage of such feedback loops (more on that in another post). A simple example is a weather app that goes beyond weather reports by recommending what clothes to wear or pack for a vacation and whether an umbrella is needed, all based on data such as location and available clothing.
People who have expertise in this space use different titles (since the field is new, there isn't yet a commonly agreed-upon title, skillset, etc.). Large teams often have dedicated individuals who specialize in one aspect of building such products (e.g., data engineers are skilled at constructing data pipelines and MLOps). General consensus on job titles and descriptions is lacking due to rapidly changing tools, technologies, and practices.
In short, the recipe for success in data products involves a combination of key elements. Successful data products require data science's secret sauce, which includes data, context, wisdom, and a great team. By bringing these components together, businesses can effectively harness the power of data to drive growth and innovation.
In some of the future posts/primers, I will try to explore enduring concepts and ideas that have withstood the test of time. I will investigate intriguing and relatively straightforward notions that form the foundations of data algorithms for creating intelligent data products. These concepts are, to some extent, non-obvious, even to well-trained computer scientists, as they are not typically found in traditional computer science textbooks. Overall, together we will explore mathematically rigorous definitions for some of the central problems, ideas, and algorithms for building intelligent systems and data products.
Where does AI buzz come into all of this? Well that's a long topic of discussion. But let me close on a remark that computers have always been bicycles for mind and that Net positive impact that technology can have on society as well as create abundance.