Non-parametric Permutation Testing
Why this matters
Because LAION-fMRI contains only 5 subjects, all significance testing must be done at the single-subject level. If the study you are replicating used single-subject statistics, stick to the method used in the paper. However, if they use group-level statistics (e.g., a t-test or ANOVA across subjects), you need to switch to non-parametric permutation testing and report the results for each subject individually. Below you find a guide and concrete examples.
Note that the data we are using are not time series, but already estimated single-trial betas. All inferences are done when fitting models on these betas.
Step-by-step
Compute your test statistic
Choose the statistic that directly reflects the hypothesis you are testing. Compute it once on the real data for each subject. This could be a model performance, mean activation, or any other measure on which your finding is based.
Decide what to permute
Permute the data along the dimension that carries the effect of interest, while keeping the rest of the data structure intact. This step depends entirely on the null hypothesis against which you are testing. Only permute labels or assignments where, if the null hypothesis were true, the permuted data would be just as plausible as the original.
This is the most critical step. Permuting the wrong dimension, or permuting across the data as a whole, produces an invalid null distribution and unreliable p-values.
Build the null distribution
Repeat your Step 1 analysis 1,000 to 10,000 times (more is better, if computationally feasible), each time using a different random permutation from Step 2. Each iteration yields one sample of what your test statistic looks like when the association of interest is absent.
Compute your p-value
The p-value is the proportion of permuted statistics that are at least as extreme as your observed statistic from Step 1.
Correct for multiple comparisons
After computing the p-values for each subject, correct for multiple comparisons in the same way as the original paper. The important thing is that you apply the correction to each subject separately. Note that, because the multiple comparison correction is performed for each subject, you only need to apply it if your replications result in multiple p-values for each subject.
Examples
Both examples below test encoding model performance on PPA and FFA voxels using per-voxel Pearson r as the base measure. What changes is the hypothesis — and therefore what must be permuted. Using the wrong shuffle produces an invalid null distribution.
Is encoding performance above chance?
Scenario
You want to test whether your CLIP encoding model predicts fMRI responses significantly above chance, i.e., does the model capture something real about the brain's response to images?
Does CLIP outperform AlexNet as an encoding model?
Scenario
Using fMRI responses across all voxels in PPA and FFA, you want to test whether CLIP predicts brain activity significantly better than AlexNet, i.e., is there a meaningful difference in encoding performance between the two models?