Data Discovery

The Data Discovery process copies data from a desktop or a cloud endpoint into a location on the platform that has been created specifically for the data product that is being created.

For CSV files a schema is extracted based on a 2MB subset of the data. Because this schema is based on a subset of the data it may infer the wrong data types and so the user is given the option to correct those.

For other tabular file formats (ORC, AVRO, Parquet) there is no need to correct the schema.

For non tabular file formats this process is not required.