Non-parametric Permutation Testing

Why this matters

Because LAION-fMRI contains only 5 subjects, all significance testing must be done at the single-subject level. If the study you are replicating used single-subject statistics, stick to the method used in the paper. However, if they use group-level statistics (e.g., a t-test or ANOVA across subjects), you need to switch to non-parametric permutation testing and report the results for each subject individually. Below you find a guide and concrete examples.

Note that the data we are using are not time series, but already estimated single-trial betas. All inferences are done when fitting models on these betas.

Step-by-step

1

Compute your test statistic

Choose the statistic that directly reflects the hypothesis you are testing. Compute it once on the real data for each subject. This could be a model performance, mean activation, or any other measure on which your finding is based.

2

Decide what to permute

Permute the data along the dimension that carries the effect of interest, while keeping the rest of the data structure intact. This step depends entirely on the null hypothesis against which you are testing. Only permute labels or assignments where, if the null hypothesis were true, the permuted data would be just as plausible as the original.

This is the most critical step. Permuting the wrong dimension, or permuting across the data as a whole, produces an invalid null distribution and unreliable p-values.

3

Build the null distribution

Repeat your Step 1 analysis 1,000 to 10,000 times (more is better, if computationally feasible), each time using a different random permutation from Step 2. Each iteration yields one sample of what your test statistic looks like when the association of interest is absent.

4

Compute your p-value

The p-value is the proportion of permuted statistics that are at least as extreme as your observed statistic from Step 1.

One-sided:Proportion of permuted results ≥ observed result.p=1 + #(TpermTobs)1 + Nperm
Two-sided:Proportion of |permuted results| ≥ |observed result|.p=1 + #(|Tperm| ≥ |Tobs|)1 + Nperm
5

Correct for multiple comparisons

After computing the p-values for each subject, correct for multiple comparisons in the same way as the original paper. The important thing is that you apply the correction to each subject separately. Note that, because the multiple comparison correction is performed for each subject, you only need to apply it if your replications result in multiple p-values for each subject.

Examples

Both examples below test encoding model performance on PPA and FFA voxels using per-voxel Pearson r as the base measure. What changes is the hypothesis — and therefore what must be permuted. Using the wrong shuffle produces an invalid null distribution.

Example 1

Is encoding performance above chance?

Scenario

You want to test whether your CLIP encoding model predicts fMRI responses significantly above chance, i.e., does the model capture something real about the brain's response to images?

Test statisticMean Pearson r between model-predicted and actual betas, averaged across all voxels in PPA and FFA (per subject)
What to permuteRandomly shuffle image indices, reassigning which model prediction is compared to which image. This breaks the correspondence between image identity and brain response.
Null hypothesisThe model has no consistent relationship with any particular image's brain response: encoding performance under random image assignment
P-valueProportion of permuted mean model performance ≥ observed mean model performance (one-sided)
Example 2

Does CLIP outperform AlexNet as an encoding model?

Scenario

Using fMRI responses across all voxels in PPA and FFA, you want to test whether CLIP predicts brain activity significantly better than AlexNet, i.e., is there a meaningful difference in encoding performance between the two models?

Test statisticMean Pearson r between CLIP predictions and betas, minus mean Pearson r between AlexNet predictions and betas.
What to permuteFor each image independently, randomly swap which model's predictions are labelled CLIP vs AlexNet, then recompute the full correlation difference. Under the null, model labels are arbitrary per image, so each swap is equally plausible.
Null hypothesisCLIP and AlexNet explain each image's brain response equally well: which model produced which prediction is arbitrary.
P-valueProportion of |permuted correlation differences| ≥ |observed correlation difference| (two-sided)