The purpose of this process is to confirm that all records in a tabular data set confirm to the schema that has been identified during Data Discovery.

During the process the platform provisions cloud compute resources to process each record using the schema which can take a few minutes and is a fixed and not proportional to the size of the data being processed.  

All data sets are converted to ORC for downstream as this is the most performant format for being able to query the data.

A random sample of 25 rows is taken from the initial data load to provide the sample data for any previews. This set of sample data is not updated if the product is updated.