Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Unable to unroll loop" error compiling generated HLSL code that writes to a matrix row in a loop #4436

Open
pixelspark opened this issue May 23, 2023 · 8 comments
Labels
area: naga back-end Outputs of naga shader conversion FXC bug lang: HLSL D3D Shading Language naga Shader Translator type: bug Something isn't working

Comments

@pixelspark
Copy link

pixelspark commented May 23, 2023

We are using wgpu in wonnx to run machine learning models using compute shaders - as I understand it the WGSL code is translated to HLSL (and MSL, etc) through naga. All of this has been working great so far, your work is much appreciated!

While testing code that works fine on macOS I nevertheless encountered an issue on Windows (webonnx/wonnx#166). The error is the following:

In Device::create_compute_pipeline
Internal error: FXC D3DCompile error (0x80004005): (81,21-35): warning X3550: array reference cannot be used as an l-value; not natively addressable, forcing loop to unroll
(35,5-15): error X3511: unable to unroll loop, loop does not appear to terminate in a timely manner (329 iterations) or unrolled 

The linked issue contains both the input WGSL code as well as the generated HLSL code. It appears the issue is the following:

// tmpsum and product are both matNxN where N = kernel_size
for(var k: u32 = 0u; k < {{ k_chunks }}u; k = k + 1u) {
  // ...
 for(var index_mat: u32 = 0u; index_mat < {{ kernel_size }}u; index_mat = index_mat + 1u) {
    tmpsum[index_mat] = tmpsum[index_mat] + product[index_mat];
 }
}

Our system replaces {{k_chunks}} and {{kernel_size}} at runtime with constant integers. The kernel_size is typically 4 while k_chunks can get quite large.

Apparently HLSL requires the inner loop to be unrolled as it apparently disallows indexing into matNxN dynamically (while WGSL allows this). It helpfully applies a workaround (unroll the loop) but inadvertently seems to unroll both the inner and outer loop.

I managed to resolve it in this case (see webonnx/wonnx@c772653) by replacing the inner loop with simply product = mat_right * mat_left;. This of course is better regardless of this issue. Nevertheless you may be interested to know about this error as I suspect it may occur in different situations as well.

(Note that I am now seeing some other error, see the linked issue. If anyone here happens to have an idea about that one, I'd be very interested to hear)

@teoxoy teoxoy added kind: bug area: naga back-end Outputs of naga shader conversion lang: HLSL D3D Shading Language labels May 25, 2023
@teoxoy
Copy link
Member

teoxoy commented May 25, 2023

Apparently HLSL requires the inner loop to be unrolled as it apparently disallows indexing into matNxN dynamically (while WGSL allows this). It helpfully applies a workaround (unroll the loop) but inadvertently seems to unroll both the inner and outer loop.

Well, this is unfortunate... We might have to come up with a workaround for this (unroll the inner loop ourselves?) but I don't think it's high priority right now.

Possibly related #4460.

@pixelspark
Copy link
Author

Yeah, not sure about what the right solution would be in this case. Fundamentally WGSL is allowing something that HLSL does not support (dynamic indexing of a matrix row). Notifying the user by throwing an error in this case seems appropriate to me. Working around the issue entirely may not be possible nor worthwhile.

Perhaps the HLSL compiler can be tricked into unrolling the inner loop but not the outer one by application of attributes (see HLSL documentation - 'unroll' is one of the attributes). That would fix this specific case but dynamic indexing in matrixes (and perhaps vectors) could also occur in other usage scenarios.

@jimblandy
Copy link
Member

@pixelspark Could you pull out a reduced test case that you could put in this issue?

@cwfitzgerald cwfitzgerald transferred this issue from gfx-rs/naga Oct 25, 2023
@cwfitzgerald cwfitzgerald added naga Shader Translator type: bug Something isn't working and removed kind: bug labels Oct 25, 2023
@teoxoy teoxoy added this to the WebGPU Specification V1 milestone Nov 3, 2023
@FL33TW00D
Copy link
Contributor

Seeing this still on DXC

In Device::create_compute_pipeline
      note: label = `1d_conv3_scalar`
    Internal error: FXC D3DCompile error (0x80004005): D:\a\1d_conv3_scalar(69,21-32): warning X3550: array reference cannot be used as an l-value; not natively addressable, forcing loop to unroll
D:\a\1d_conv3_scalar(36,5-15): error X3511: unable to unroll loop, loop does not appear to terminate in a timely manner (497 iterations) or unrolled loop is too large, use the [unroll(n)] attribute to force an exact higher number

Full shader for context:

//Each workgroup is responsible for a single filter.
//Each thread computes a single element of the output.
//Each thread places the 3 column wide filter over the input, and multiplies and accumulates the values
//to a SINGLE output element.
@group(0) @binding(0)
var<storage, read> X: array<f32>;

@group(0) @binding(1)
var<storage, read> W: array<f32>;

@group(0) @binding(2)
var<storage, read> B: array<f32>;

@group(0) @binding(3)
var<storage, read_write> Y: array<f32>;

struct Meta {
    padding: u32,
    stride: u32,
    Cin: u32,
    Lin: u32,
    KS: u32,
    F_numel: u32,
    W_numel: u32,
    Lout: u32,
    Fperthread: u32,
}

@group(1) @binding(0)
var<uniform> metadata: Meta;

var<workgroup> F: array<f32, {{ F_numel }}u>;

fn inner(input_index: u32, filter_index: u32, output_index: u32, bias_index: u32, start: u32, end: u32) {
    var inp = vec3<f32>(0f);
    var kernel = vec3<f32>(0f);
    var acc = vec3<f32>(0f); 
    for(var i = 0u; i < metadata.Cin; i++) {
        let input_start = input_index + (i * metadata.Lin) - 1u; //-1 is for padding
        //We only populate the input between the provided indices, used for padding
        for(var j = start; j <= end; j++) {
            inp[j] = X[input_start + j];
        }

        let filter_start = i * metadata.KS;
        kernel.x = F[filter_start];
        kernel.y = F[filter_start + 1u];
        kernel.z = F[filter_start + 2u];

        acc = fma(inp, kernel, acc);
    }        
    Y[output_index] = acc.x + acc.y + acc.z + B[bias_index];
}


//Each thread may load more than 1 element into shared memory
fn load_filters_into_smem(local_id: vec3<u32>, filter_index: u32) {
    let windex = filter_index + (local_id.x * metadata.Fperthread);
    let findex = (local_id.x * metadata.Fperthread);
    for(var i=0u; i < metadata.Fperthread; i++) {
        if findex + i < metadata.F_numel {
            F[findex + i] = W[windex + i];
        }
    }
}

//Doesn't support dynamic padding
@compute @workgroup_size({{ workgroup_size_x }}, {{ workgroup_size_y }}, {{ workgroup_size_z }})
fn main(@builtin(global_invocation_id) global_id: vec3<u32>,
        @builtin(local_invocation_id) local_id: vec3<u32>,
        @builtin(workgroup_id) workgroup_id: vec3<u32>) {

    let input_index = (workgroup_id.x * {{ workgroup_size_x }}u + local_id.x) * metadata.stride;
    let filter_index = (workgroup_id.y * {{ F_numel }}u); 
    
    load_filters_into_smem(local_id, filter_index);
    workgroupBarrier();

    if input_index >= metadata.Lin {
        //Break after loading because all threads may be needed for loading F
        return;
    }

    let output_index = (workgroup_id.x * {{ workgroup_size_x }}u + local_id.x) + (workgroup_id.y * metadata.Lout);
    let bias_index = workgroup_id.y;

    //TODO: dynamic padding
    if input_index == metadata.Lin - metadata.padding {
        inner(input_index, filter_index, output_index, bias_index, 0u, 1u);
    } else if input_index == 0u {
        inner(input_index, filter_index, output_index, bias_index, 1u, 2u);
    } else {
        inner(input_index, filter_index, output_index, bias_index, 0u, 2u);
    }
}

Obviously the inner function is causing issues.

@cwfitzgerald
Copy link
Member

Seeing this still on DXC

The error is from FXC

@FL33TW00D
Copy link
Contributor

Seeing this still on DXC

The error is from FXC

You're correct - using DXC fixes this.

@jimblandy
Copy link
Member

I believe the Naga bug here is that we don't generate valid HLSL for dynamically-indexed vectors and matrices. If that's the case, then getting that right would probably stop provoking FXC into trying to fix the bug with "not natively addressable, forcing loop to unroll", and things would be fine.

We should definitely look at what code Tint generates for these things.

@jimblandy
Copy link
Member

See also #4337

@teoxoy teoxoy added the FXC bug label Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: naga back-end Outputs of naga shader conversion FXC bug lang: HLSL D3D Shading Language naga Shader Translator type: bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

5 participants