Data Profiling

(只提供英文版本)

Big Data – Blog by YY

Data Profiling

In the previous weekly posts, Data Governance and Data Quality, as criterion not just for earlier data processing flows, but also for afterwards analysis and consistency on data source in data warehouse, have been introduced. However, there still have minor terms in existence of Data Governance and Data Quality. and that is what will be mentioned in this post, Data Profiling, statistical criterion on raw data or processed one.

As a reminder, terms like Data Governance, Data Quality and Data Profiling depends on needs in organizations.

What I mentioned here is some organizations adopt all of the them before the initialization of the project; however, others take just one/two or none of them before the initialization of the project.

The conduction of overall data profiling assessment will help architects design a better solution and reduce afterwards potential issues/risks by quickly identifying and pining down potential data issues.

The following list is the key factors regarding how to create an overall data profiling assessment.

 

  • Distinct count and percent

Analyzing the number of distinct values in individual column will help identify possible unique keys from source data. Identification of natural keys is a fundamental requirement for database and ETL architecture, especially when processing inserts and updates.

 

  • Zero, blank and NULL precent

Analyzing each column for missing or unknown data helps identify potential data issues.

 

  • Minimum, maximum and average string length

The analysis over string lengths of the source data is a valuable step in selecting the most appropriate data types and sizes in the target database and is on performance concern.

 

  • Numerical and data range analysis

Gathering information on minimum and maximum numerical and date values is helpful for database architects to identify appropriate data types to balance storage and performance requirements.

 

  • Key integrity

After the completion of natural keys in identification,  check the overall integrity by applying “ Zero, blank and NULL precent “ to the data set. Besides, the checking progress against related data sets is extremely important to reduce downstream issues, especially that over any orphan keys.

 

  • Cardinality

“ Cardinality “ means the relations ( one-to-one, one-to-many, many-to-many, etc ) in between data sets. It is vital for DataBases (DBs) modeling and Business Intelligence (BI) tool set-up.

 

  • Pattern, frequency distributions, and domain analysis

        Patterns in examination means to check if data fields are well-formatted, in other words, formed correctly as it being defined in spec.

 

As what being mentioned at the beginning, techniques being involved in terms, “Data Governance,” “Data Quality,” and “Data Profiling” are not adopted by organizations at the projects in initialization. The reasons is time-consuming in identifying fields in source data and the uncertainty in regrading how to better use them at the beginning, and, finally, the low volume in data size.

 

However, if the three terms being mentioned before have been well-scoped in any data-related projects, it will tremendously decrease the time in code-debugging and quickly increase response to results based on data source, as well as making better marketing decisions in organizations.

Posted in ICG日志.