In the following query results, we can assess the following characteristics for a given table with a defined DISTKEY: Now that the table exists, it’s trivial to review the distribution. Identify the table SELECT 'lineitem_dk_l_partkey'::regclass::oid Here, it’s much more efficient to materialize a single-column temporary table, rather than redistributing the entire table only to find out there was nonuniform distribution: - Materialize a single column to check distributionĬREATE TEMP TABLE lineitem_dk_l_partkey DISTKEY (l_partkey) AS Suppose that you think that you’ve identified a column that should result in uniform distribution but want to confirm this. When distributing on a given column, it is desirable to have a nearly consistent number of rows/blocks on each slice. In the example following, l_orderkey stands out as a poor option that you can eliminate as a potential DISTKEY column: SELECT l_orderkey, COUNT(*) You can easily identify columns that contain “heavy hitters” or introduce “hot spots” by using some simple SQL code to review the dataset. Distributing on a column, customer_id, where a minority of your customers are responsible for the majority of your data.Distributing on a column containing a significant percentage of NULL values.A nonuniformly distributed data profile occurs in scenarios such as these: If the hashed column values don’t enable uniform distribution of data to the cluster slices, then you’ll end with both data skew at rest and data skew in flight (during query processing)-which results in a performance hit due to an unevenly parallelized workload. If we find that no columns are acceptable DISTKEY columns, then we can eliminate DISTSTYLE KEY as a potential DISTSTYLE option for this table.ĭoes the column data have a uniformly distributed data profile? To do so, first determine if the table contains any columns that would appropriately distribute the table data if they were specified as a DISTKEY. Phase 1 seeks to determine if KEY distribution is appropriate. Phase 1: Identifying Appropriate DISTKEY Columns This post presents a two-phase flow chart that will guide you through questions to ask of your data profile to arrive at the ideal DISTSTYLE and DISTKEY for your scenario. Which style is most appropriate for your table is determined by several criteria. ALL distribution stores a full copy of the table on the first slice of each node. On ingest, Amazon Redshift hashes each DISTKEY column value, and route hashes to the same slice consistently. KEY requires a single column to be defined as a DISTKEY.EVEN will do a round-robin distribution of data.If you’re unfamiliar with these table properties, you can watch my presentation at the 2016 AWS Santa Clara Summit, where I discussed the basics of distribution starting at the 17-minute mark. To achieve minimization, data should be located where it needs to be before the query is executed.Ī table might be defined with a DISTSTYLE of EVEN, KEY, or ALL. The secondary goal in selecting a table’s DISTSTYLE is to minimize the cost of data movement necessary for query processing. When you execute a query, the query optimizer might redistribute or broadcast the intermediate tuples throughout the cluster to facilitate any join or aggregation operations. The primary goal in selecting a table’s DISTSTYLE is to evenly distribute the data throughout the cluster for parallel processing. During query execution, all slices process the rows that they’ve had assigned in parallel. You can think of a slice like a virtual compute node. Depending on node type, each compute node contains 2, 16, or 32 slices. Within each compute node, the rows are assigned to a cluster slice. When you load data into a table, Amazon Redshift distributes the rows to each of the compute nodes according to the table’s DISTSTYLE. This blog installment presents a methodology to guide you through the identification of optimal DISTSTYLEs and DISTKEYs for your unique workload. The first table and column properties we discuss in this blog series are table distribution styles (DISTSTYLE) and distribution keys (DISTKEY). Part 3: Compound and Interleaved Sort Keys Part 2: Distribution Styles and Distribution Keys ( Translated into Japanese) Part 1: Preamble, Prerequisites, and Prioritization
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |