Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update BQ deduplicate macro to support partition pruning downstream #929

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

austinclooney
Copy link

@austinclooney austinclooney commented Jul 1, 2024

resolves #928

Problem

The current BQ deduplicate macro will not work for a view as it doesn't allow partition pruning downstream of the macro.

Solution

Add an an optional parameter that explicitly selects the partition by columns outside of the array_agg

Checklist

  • This code is associated with an issue which has been triaged and accepted for development.
  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the README.md (if applicable)

@@ -93,12 +93,19 @@
-- clause in BigQuery:
-- https://github.com/dbt-labs/dbt-utils/issues/335#issuecomment-788157572
#}
{%- macro bigquery__deduplicate(relation, partition_by, order_by) -%}
{%- macro bigquery__deduplicate(relation, partition_by, order_by, partition_by_pass_thru=false) -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this PR @austinclooney !

I'd prefer not to introduce a new parameter if we can avoid it.

BigQuery supports the qualify clause. Is there any reason we shouldn't / couldn't just use qualify in the same way as Snowflake and Databricks?

The original implementation for BigQuery was for performance reasons. But I'm not sure if those performance reasons still exist or if working for views outweighs those considerations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge the memory issue still exists, yes. Deduplicating with array_agg uses fewer resources than with qualify which allows it to be done on larger datasets before hitting memory issues. The reason I added a parameter here is because explicitly selecting the partition_by by default will change the column order in the output so I think it would technically be a breaking change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several different scenarios to consider for the relation being deduplicated:

  • table
  • view
  • materialized view
  • CTE

What would be the consequences if we did not add a new parameter named partition_by_pass_thru and just treated it as always being true instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the only difference between the current method and the changes added in this PR is this PR would move the partition_by columns to the end of the output (or the beginning if preferred) and the current output retains the original column order.

Since we're only explicitly selecting the partitioning keys it won't affect the actual deduplication as those are the groups we're deduplicating within.

Copy link
Contributor

@dbeatty10 dbeatty10 Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Edit: I just re-read the prior message and update the question below]

So the net consequence would be the following?

  • partition pruning allowing faster queries for less cost
  • output relation has the same columns but possibly in a different order than before

And these consequences would equally apply to all of tables, views, materialized views, and CTE?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you think of any way to do the following?

  • output relation with the same columns in the same order as before

If so, that would allow us to ship these performance improvements without any breaking changes or new parameters.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought about this for a while but I don't believe it's possible without running another query since the input to the macro can be a CTE.

Something like this would work:

{%- macro bigquery__deduplicate(relation, partition_by, order_by) -%}
    {%- set columns_query -%}
        select * from {{ relation }} limit 1
    {%- endset -%}

    {%- set columns_result = run_query(columns_query) -%}
    {%- set columns = columns_result.column_names -%}

    select
        {%- for column in columns %}
            unique.{{ column }}{%- if not loop.last %},{% endif %}
        {%- endfor %}
    from (
        select
            {{ partition_by }},
            array_agg(
                original
                order by {{ order_by }}
                limit 1
            )[offset(0)] unique
        from {{ relation }} original
        group by {{ partition_by }}
    )
{%- endmacro -%}

But I don't think we'd want to run an external query to get the macro to work.

@dbeatty10 dbeatty10 changed the title Update BQ deduplicate macro to support partition pruning downstream. Update BQ deduplicate macro to support partition pruning downstream Jul 2, 2024
@dbeatty10 dbeatty10 added the enhancement New feature or request label Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BigQuery deduplicate macro doesn't support downstream partition pruning
2 participants