Non-parametric Permutation Testing
Why this matters
Because LAION-fMRI contains only 5 subjects, all significance testing must be done at the single-subject level. If the study you are replicating used single-subject statistics, stick to the method used in the paper. However, if they use between-subject statistics (e.g., a t-test or ANOVA across subjects), you need to switch tonon-parametric permutation testing and report the results for each subject. Below you find a guide and concrete examples.
Step-by-step
Compute your test statistic
Choose the statistic that directly reflects the hypothesis you are testing. Compute it once on the real data for each subject. This could be a model performance, mean activation, or any other measure on which your finding is based.
Decide what to permute
Permute along the dimension that carries the effect of interest, while keeping the rest of the data structure intact. This step depends entirely on the null hypothesis against which you are testing. Only permute labels or assignments where, if the null hypothesis were true, the permuted data would be just as plausible as the original.
This is the most critical step. Permuting the wrong dimension, or permuting across the data as a whole, produces an invalid null distribution and unreliable p-values.
Build the null distribution
Repeat your Step 1 analysis 1,000 to 10,000 times (more is better, if computationally feasible), each time using a different random permutation from Step 2. Each iteration yields one sample of what your test statistic looks like when the association of interest is absent.
Compute your p-value
The p-value is the proportion of permuted statistics that are at least as extreme as your observed statistic from Step 1.
Correct for multiple comparisons
After computing the p-values for each subject, correct for multiple comparisons in the same way as the original paper. The important thing is that you apply the correlction to each subject separately. Note that, because the multiple comparison correction is performed for each subject, you only need to apply it if your replications result in multiple p-values for each subject.
Examples
Both examples below use the same data: a CLIP encoding model yielding a per-voxel Pearson r across voxels in PPA and FFA, and the same base statistic. What changes is the hypothesis, and therefore what must be permuted. Using the wrong shuffle produces an invalid null distribution.
Is encoding performance above chance?
Scenario
You want to test whether your CLIP encoding model predicts fMRI responses significantly above chance, i.e., does the model capture something real about the brain's response to images?
Is encoding performance higher in PPA than FFA?
Scenario
Using the same encoding model and beta matrix, you now want to test whether the model explains more variance in PPA than in FFA, i.e., is there a significant difference in encoding performance between the two regions?