Matching the AI Data Lens to the Data Sources

Scientific data includes context information (metadata) to make its creation reusable in downstream analysis. This metadata comes with a vocabulary that is helpful in understanding how business data – that is data from the systems and activity of an organization – can be selected to feed an AI application.

In a previous post, I described the [data lens] that characterises the data required to feed an AI application, both during its development and once it is operational. The data lens identifies the different dimensions that the data needs to match, such as time, business unit, subject area and quality in order to be relevant to the AI project.

Matching the data lens to actual data sources is often tricky because, unlike scientific data, most business data does not include the additional metadata that explains its scope, the collection method, how frequently this data was updated, how granular are the data values and the level of accuracy (uncertainty) associated with their capture method. These are, of course, the values that are needed to match the data source to the data lens.

Most systems only capture the data they need to operate properly. The context information is embedded in the system code and the manner to which it is used. It is not stored as metadata in the database.

For example, consider an application that manages sales information. The code for the application determines which table in its database is used for which purpose. You may be lucky that the database schema gives a hint of the design in its table and column naming – but do not bank on it. Anyone involved in application development will tell you that the schema may be pretty self-explanatory in the early releases, but as new features are added, the way data is managed in the schema becomes more distorted as the team aim to avoid needing data migration to move a deployed system from release to release. Older systems may have been restricted in the number of characters available to form a name so you may have esoteric column names such as ADATE, DDATE, IDATE etc. to contend with. The users of the system also develop their own approaches to workaround inadequacies in the systems capabilities. For example, they may use the fifth line of a delivery address to add messages to the postman (beware of the dog, put parcel by gate etc.) if the system does not support delivery instructions.

So all of the data you need is probably in the database, but it takes more than looking at the schema to understand what data you have.

Looking further, our example sales application may have been deployed three times in an organization to cover sales from the Americas, sales from Europe and Africa, and sales from Asia. Each database for these three application instances has the same type of data, structured in the same way, but the geographical scope of the data within them is different. If the data lens calls for global sales data, the data from all three sales applications is needed to match the requirement.

The exceptions to these observations are the data sources that have been carefully constructed to combine data from multiple systems for reporting and distribution to other systems, such as:

Operational data stores, data warehouses and data marts. Here, more of the scientific mindset has been added, with metadata embedded in the data design. This is because the data needs to be sliced and diced to support different reports and use cases. It also aims to provide a historical record of the organization’s operations which could be significant for training AI models. The mechanisms that create these data sources (data pipelines) may also need to be examined to understand the original source, update frequency, and transformation logic that may introduce uncertainty in the resulting values.
Data lakes and lakehouses are built with a different philosophy in mind. Their methods aimed to capture data from the operational systems with as little transformation as possible. Consuming projects are responsible for combining and formatting data from these sources as needed. The captured operational system data begins to include some of the metadata we need for matching to the data lens – for example, the time that the data was extracted. However, most context information will need to come from the source system. On a more positive note, this data is closer to the operations of the organization which could be significant for real-time AI context.

The context information that is missing from business data needs to be captured in a way that makes it accessible to the processing that assesses data sources for an AI project. The type of values in this additional metadata ideally need to match the information in the data lens. If they use the same data types, valid values and data collection rules as the data lens, the matching process becomes trivial, and, most importantly, automatable 🙂

Look out for my next post to discover what this additional metadata includes and how to capture it.

Leave a Reply