
上QQ阅读APP看书,第一时间看更新
Remove duplicates
The objective of this example is to prevent refresh failures due to duplicate source values in the relationship column of a dimension table. Additionally, the duplicates are to be isolated for further inspection and troubleshooting:
- Access a dimension table query from an unstructured data source such as an Excel Workbook.

Figure 71: Product Query Preview
- The source is an Excel table maintained by business users.

Figure 72: Excel Data Source
- The Product Name column is used for the relationship to the Sales fact table; therefore it must uniquely identify each row.

Figure 73: Product to Sales Relationship
- Given the one-to-many relationship, any duplicate values in the Product Name column of the Products table will result in refresh failures.

Figure 74: Duplicate Refresh Failure
- Add the following four M expressions to the Products query per Figure 75:
- Remove any leading and trailing empty spaces in the Product Name column with a Text.Trim() function
- Create a duplicate column of the Product Name key column with the Table.DuplicateColumn() function
- Add an expression to the Products query with the Table.Distinct() function to remove duplicate rows
- Add another Table.Distinct() expression to specifically remove duplicate values from the Product Name column
- As an unstructured source, the column types were defined explicitly in the query via Table.TransformColumnTypes().

Figure 75: Duplicated key and distinct expressions
- The query is still vulnerable to mixed cases such as Fender Set and Fender set.

Figure 76: Distinct cases in key values
- The M engine considers the values unique but the data model engine doesn't.
- Add an expression to force uppercase on the Product Name column via the Table.TransformColumns() function. This new expression must be applied before the duplicate removal expressions are applied.

Figure 77: Uppercase expression inserted into the query
- The query is now resilient to duplicate values and rows, mixed cases, and spaces. However, the Product Name column is now in the uppercase format.

Figure 78: All caps after duplicate removal
- Add two final expressions to replace the Product Name column with the duplicate column created in step 2.

Figure 79: Product Name Column Replacement
- The capitalized Product Name column is dropped via Table.RemoveColumns(), and Table.RenameColumns() is used convert the duplicate column into the column loaded to the data model for the Product-to-Sales relationship.
- To support troubleshooting, create a query that accesses the same source table and retrieves the values from the Product Name column with more than one row.

Figure 80: 'Duplicate Products' query expression retrieves the Product Names with more than one row
- The Product Name column is selected, grouped, and then filtered to always retrieve any duplicate key values. It also accounts for mixed casing of values.

Figure 81: The Duplicates Query
- Disable the load of the query retrieving duplicate product names--Duplicate Products in this example.