Non-parametric Permutation Testing

Why this matters

Because LAION-fMRI contains only 5 subjects, all significance testing must be done at the single-subject level. If the study you are replicating used single-subject statistics, stick to the method used in the paper. However, if they use between-subject statistics (e.g., a t-test or ANOVA across subjects), you need to switch tonon-parametric permutation testing and report the results for each subject. Below you find a guide and concrete examples.

Step-by-step

1

Compute your test statistic

Choose the statistic that directly reflects the hypothesis you are testing. Compute it once on the real data for each subject. This could be a model performance, mean activation, or any other measure on which your finding is based.

2

Decide what to permute

Permute along the dimension that carries the effect of interest, while keeping the rest of the data structure intact. This step depends entirely on the null hypothesis against which you are testing. Only permute labels or assignments where, if the null hypothesis were true, the permuted data would be just as plausible as the original.

This is the most critical step. Permuting the wrong dimension, or permuting across the data as a whole, produces an invalid null distribution and unreliable p-values.

3

Build the null distribution

Repeat your Step 1 analysis 1,000 to 10,000 times (more is better, if computationally feasible), each time using a different random permutation from Step 2. Each iteration yields one sample of what your test statistic looks like when the association of interest is absent.

4

Compute your p-value

The p-value is the proportion of permuted statistics that are at least as extreme as your observed statistic from Step 1.

One-sided:Proportion of permuted results ≥ observed result.p=1 + #(TpermTobs)1 + Nperm
Two-sided:Proportion of |permuted results| ≥ |observed result|.p=1 + #(|Tperm| ≥ |Tobs|)1 + Nperm
5

Correct for multiple comparisons

After computing the p-values for each subject, correct for multiple comparisons in the same way as the original paper. The important thing is that you apply the correlction to each subject separately. Note that, because the multiple comparison correction is performed for each subject, you only need to apply it if your replications result in multiple p-values for each subject.

Examples

Both examples below use the same data: a CLIP encoding model yielding a per-voxel Pearson r across voxels in PPA and FFA, and the same base statistic. What changes is the hypothesis, and therefore what must be permuted. Using the wrong shuffle produces an invalid null distribution.

Example 1

Is encoding performance above chance?

Scenario

You want to test whether your CLIP encoding model predicts fMRI responses significantly above chance, i.e., does the model capture something real about the brain's response to images?

Test statisticMean Pearson r between model-predicted and actual betas, averaged across all voxels in PPA and FFA (per subject)
What to permuteRandomly shuffle image indices, reassigning which model prediction is compared to which image. This breaks the correspondence between image identity and brain response.
Null hypothesisThe model has no consistent relationship with any particular image's brain response: encoding performance under random image assignment
P-valueProportion of permuted mean model performance ≥ observed mean model performance (one-sided)
Example 2

Is encoding performance higher in PPA than FFA?

Scenario

Using the same encoding model and beta matrix, you now want to test whether the model explains more variance in PPA than in FFA, i.e., is there a significant difference in encoding performance between the two regions?

Test statisticDifference in mean Pearson r: mean r across PPA voxels minus mean r across FFA voxels (per subject)
What to permuteRandomly shuffle the ROI label (PPA vs FFA) across voxels, while keeping image labels and beta values completely unchanged. This breaks any difference in model performance between these regions.
Null hypothesisThere is no systematic difference in encoding performance between PPA and FFA.
P-valueProportion of absolute permuted r-differences ≥ absolute observed difference (two-sided)