Extracts are snapshots of data that are stored and loaded into the platform’s data engine (query engine) server memory as highly optimized and compressed files. Data extracts help improve performance and support large datasets. When you use extracts, you remove live connections to the databases and consequently reduce the load on these databases.
The main advantages for using data extracts for datasets:
Overall dataset performance
Reduced load on databases
High performance with column aggregations
Optimized pre-computed calculations
Extracts in columnar format
Data extracts are stored as Apache Parquet files. Apache Parquet is a columnar storage format optimized for the Write Once Read Many (WORM) approach. In other words, it takes time to create an extract, but it is especially fast to read the data from the extract. Extracts reduce the time to access and aggregate the column values, which makes them essential for analytics and data discovery. To minimize the extract size, you can also trim the total amount of data to be saved by adding all the possible filters, removing unnecessary columns, and choosing the suitable extract structure (single table or multiple tables).
Data extract structure types
When creating a data extract, you can choose the way how to store the extract:
Single table – The data is stored in a file as a single-table structure; all tables are merged and stored in a single Parquet file.
In this case, joins are applied at the time when an extract is created. For example, if we make a dataset by joining three tables like Sales (100 rows), Products (5 rows), Locations (10 rows), we have a single Parquet file containing the result of all three tables joined (100 rows). Thus, when you send a query to get a list of distinct products, the system will parse all rows from all joined tables (100 rows to get 5 distinct products).
Multiple tables – The data is stored as multiple tables. Each data source table is stored in a separate Parquet file, reflecting the database structure.
In this case, joins are applied at the time of running a query. Such extract files may be faster to generate and be smaller in size. They are faster with non-complex queries, but it may take a longer time to perform complex visualization requests. For example, if we create a dataset by joining three tables like Sales (100 rows), Products (5 rows), Locations (10 rows), we have three Parquet files, each containing data from a single table. Thus, when you send a query to get a list of distinct products, the system will parse only the Products file (5 rows to get 5 distinct products).
You can notice the performance differences between extract storage types with large amounts of data. You can experiment with both, and then determine which one gives you the best performance and size benefits.
For information on how to create an extract, see Extract data for a dataset.
Refresh data extracts
After creating an extract, you can refresh it with the latest data from the original data source connections.
You can refresh an extract in two ways:
Full refresh – Replaces the current extract with new data, where all the Parquet files are fully re-created.
Incremental refresh – Adds only those rows that are new since the previous refresh. However, it does not account for updated rows but rather appended rows to a data source. As a result, if you run incremental refresh, you should still occasionally run full refresh. For example, your data source is updated daily with new transactions. Instead of rebuilding the entire extract every day, you can just add the recent transactions added that day. Then once a month, you can rebuild the whole extract to account for new columns.
You can manage the extract the same way as any dataset. Consider the following details:
If you modify an extract dataset, it needs to be fully regenerated.
If you export an extract, the exported file contains only metadata but not the actual Parquet files. The extract files are created after you import the file.
If you duplicate the extract, you duplicate all the underlying Parquet files.
Be cautious when you schedule a full refresh for an extract. Depending on the data size, it may take a while before you can use the extract.
For information on how to manage datasets, see Datasets pane.
Permission for working with extracts
To create and manage datasets extracts, your administrator needs to assign the Dataset Extract Creator role to your user in Access Manager. It includes the dataprep.dataset.extract permission that allows you to manage extracts.