Data profiling supports an e ective data quality program and is an important first step for many information technology initiatives. Data profiling should be conducted both periodically for critical data stores, and event-driven as an important activity in evalu- ating data quality for a specific purpose. For example, organizations may profile data prior to finalizing the design of a data warehouse, populating a metadata repository, performing data conversion or migration, planning for data store consolidation, or whenever a data store is modified.

Data profiling is a discovery task, revealing what is stored in databases and how physical values may di er from expected allowed values listed in metadata repos- itories or data store documentation. Profiling typically examines values, ranges, frequency distributions, divergent metadata, nonstandard record formats, etc. It also may include testing the accuracy of business rules and analysis of known issues.

As a result of data profiling, organizations may pursue remediation activities such as data fixes, updating metadata descriptions, enhancing ETL scripts, changes to data structures, addition or refinement of quality rules, or application business rules, etc. What is learned through data profiling often serves as a key input to the development of rules, statistics, metrics, content standards, or business process redesign for improving the quality of the organization’s data assets.

Data profiling di ers from Data Quality Assessment in that profiling activities result in a series of conclusions about the data set, whereas the assessment process evaluates how well the data meets specific quality requirements. Data profiling is typically the first step in conducting data quality assessments.

It is recommended that an organization survey the projects and data stores for
which data profiling is currently performed, establish standard criteria for when data profiling should be performed, and implement a standard process for its execution. Prioritization of candidate data stores and data sets (internal or external) for profiling is based on business needs and objectives expressed in the data quality strategy.