machinelearning-classification
- Classification based on machine learning using scikit-learn¶
Clinica provides a modular way to perform classification based on machine learning. To build its own classification pipeline, the user can combine three modules based on scikit-learn [Pedregosa et al., 2011]:
- Input (e.g. gray matter maps obtained from T1-weighted MR images, FDG PET images)
- Algorithm (e.g. support vector machine, logistic regression, random forest)
- Validation (e.g. K-fold cross validation, repeated K-fold cross validation, repeated hold-out validation)
This combination of modules is wrapped into the machinelearning-classification
command line interface with default values [Samper et al., 2018] for algorithm and validation modules.
If you want to fine tune these parameters or create your own module(s), please refer to the Going further section.
Prerequisites¶
You need to have performed the t1-volume
pipeline on your T1-weighted MR images and/or the pet-volume
pipeline on your PET images.
Dependencies¶
If you installed the core of Clinica, this pipeline needs no further dependencies.
Running the pipeline¶
The pipeline can be run with the following command line:
clinica run machine-learning-classification [OPTIONS] CAPS_DIRECTORY GROUP_LABEL {VoxelBased|RegionBased} {T1w|PET}
{DualSVM|LogisticRegression|RandomForest} {RepeatedHoldOut|RepeatedKFoldCV} SUBJECTS_VISITS_TSV
DIAGNOSES_TSV OUTPUT_DIRECTORY
where:
CAPS_DIRECTORY
is the folder containing the results of thet1-volume
and/or thepet-volume
pipeline.GROUP_LABEL
is a string defining the group label for the current analysis, which helps you keep track of different analyses.- The third positional argument defines the type of features for classification. It can be:
RegionBased
: a list of values stored in a TSV file is used as features. This list corresponds to PET or T1 image intensities averaged over a set of regions obtained from a brain parcellation when running thet1-volume
and/orpet-volume
pipeline.VoxelBased
: all the voxels of the image are used as features.
- The fourth positional argument defines the studied modality (
T1w
orPET
). - The fifth positional argument defines the algorithm. It can be:
DualSVM
: support vector machine (SVM) algorithmLogisticRegression
: logistic regression algorithmRandomForest
: random forest algorithm
- The sixth positional argument defines the validation method. It can be:
RepeatedHoldOut
: repeated hold-out validationRepeatedKFoldCV
: repeated K-fold cross validation
SUBJECTS_VISITS_TSV
is a TSV file containing theparticipant_id
and thesession_id
columns.DIAGNOSES_TSV
is a TSV file where the diagnosis for each participant (identified by a participant ID) is reported (e.g. AD, CN). It allows the algorithm to perform the dual classification (between the two labels reported). Example of a diagnosis TSV file:
participant_id diagnosis
sub-CLNC0001 AD
sub-CLNC0002 CN
sub-CLNC0003 AD
sub-CLNC0004 AD
sub-CLNC0005 CN
OUTPUT_DIRECTORY
: the directory where outputs are saved.
Pipeline options if you use region-based inputs:
--atlas
: Name of the atlas used for the brain parcellation generated by thet1-volume
and/or thepet-volume
pipeline. It can beAAL2
,AICHA
,Hammers
,LPBA40
orNeuromorphometrics
described here.
Pipeline options if you specified PET
inputs:
--acq_label
: Name of the label given to the PET acquisition, specifying the tracer used (trc-<acq_label>
).--suvr_reference_region
: Reference region used to perform intensity normalization (i.e. dividing each voxel of the image by the average uptake in this region) resulting in a standardized uptake value ratio (SUVR) map. It can becerebellumPons
(used for amyloid tracers) orpons
(used for FDG).
Output¶
Results are saved in the output folder following this hierarchy:
└── <image-type>
├── region_based
| └── atlas-<atlas-id>
| └── <machine-learning-algorithm>
| └── <task1>_vs_<task2>
| ├── classifier
| | └── iteration-<iteration-number>
| | ├── mean_results.tsv
| | ├── results.tsv
| | └── subjects.tsv
| ├── best_parameters.json
| ├── dual_coefficients.txt
| ├── intersect.txt
| ├── support_vector_indices.json
| ├── weights.nii.gz
| └── weights.txt
└── voxel_based
└── smoothing-<fwhm>
└── <machine-learning-algorithm>
└── <task1>_vs_<task2>
├── classifier
| └── iteration-<number-iteration>
| ├── mean_results.tsv
| ├── results.tsv
| └── subjects.tsv
├── best_parameters.json
├── dual_coefficients.txt
├── intersect.txt
├── support_vector_indices.json
├── weights.nii.gz
└── weights.txt
If image_type
is PET
:
└── <image-type>
└── region_based/voxel_base
└── pvc-<pvc>
└── ...
Going further¶
Fine tune algorithm and validation parameters¶
The machinelearning-classification
command uses sensible default options (defined in ml_workflows.py
) that were used for classification of patients with Alzheimer’s disease [Samper et al., 2018].
No matter the combination of modules chosen, the algorithm and validation parameters are:
fwhm
: the FWHM value in mm used in thet1-volume
and/or thepet-volume
pipelinemodulated
: a flag to indicate if, when running thet1-volume
pipeline, the image has been modulated or not (on
,off
)use_pvc_data
: use PET data with partial value correction (True
/False
). By default, PET data with no PVC are used.precomputed_kernel
: to load the precomputed kernel if it existsmask_zeros
: a flag to indicate if zero-valued voxels should be taken into account for the classification (True
/False
)n_iterations
: number of times a task is repeatedgrid_search_folds
: number of folds to use for the hyperparameter grid search (e.g. 10)c_range
: range used to select the best value for the C parameter, in the logspacetest_size
: percentage (between 0 and 1) representing the size of the test set for each shuffle splitbalanced
: option to balance the weights according to the number of samplespenalty
: type of penalty (l2
orl1
)
Create or combine a set of modules¶
Tip
Usage examples are available in ml_workflows.py
.
Input¶
Two classes corresponding to the voxel-based and the region-based approaches are implemented in input.py
:
CAPSRegionBasedInput
: a list of values stored in a TSV file is used as features. This list corresponds to PET or T1 image intensities averaged over a set of regions obtained from a brain parcellation when running thet1-volume
and/orpet-volume
pipeline.CAPSVoxelBasedInput
: all the voxels of the image are used as features.
Note
The atlases that can be used for the region-based approaches are listed here.
Algorithm¶
Three classes corresponding to the machine learning-based classification algorithms are implemented in algorithm.py
:
DualSVMAlgorithm
: support vector machine (SVM) algorithm (input: all the data available or a kernel that can be pre-computed)LogisticReg
: logistic regression algorithm (input: all the data available)RandomForest
: random forest algorithm (input: all the data available)
Each algorithm implements a grid search approach to choose the best parameters for the classification by looking at the value of the balanced accuracy.
The area under the receiver operating characteristic (ROC) curve (AUC) is also reported.
The labels are automatically assigned based on the DIAGNOSES_TSV
file.
Validation¶
Three classes corresponding to the validation strategies are implemented in validation.py
:
KFoldCV
: K-fold cross validationRepeatedKFoldCV
: repeated K-fold cross validationRepeatedHoldOut
: repeated hold-out validation
The input is the name of the classification algorithm used.
Describing this pipeline in your paper¶
Example of paragraph:
These results have been obtained using the machine learning-based classification modules of Clinica [Routier et al., 2021; Samper et al., 2018]. Clinica provides a modular way to perform classification based on machine learning by combining different inputs (e.g. gray matter maps obtained from T1-weighted MR images, FDG PET images), algorithms (e.g. support vector machine, logistic regression, random forest) and validation strategies (e.g. K-fold cross validation, repeated K-fold cross validation, repeated hold-out validation). These modules rely on scikit-learn [Pedregosa et al., 2011].
Support¶
- You can use the Clinica Google Group to ask for help!
- Report an issue on GitHub.