.. _riskscore: ************************* Risk score analysis ************************* CWAS-Plus2 utilizes categorized results to estimate the optimal predictor for the phenotype. It trains a Lasso regression model using the number of variants within each category across samples. After training the model with a subset of samples, the remaining test set is employed to calculate the |R2|. The significance of the |R2| value is determined by calculating it from samples with a randomly shuffled phenotype. The number of regressions (-n_reg) can be set to obtain the average |R2| value from all regressions. .. |R2| replace:: R\ :sup:`2` - -i, --input_file: Path to the categorized zarr directory, resulted from categorization process. - -o_dir, --output_directory: Path to the directory where the output files will be saved. By default, outputs will be saved at ``$CWAS_WORKSPACE``. - -s, --sample_info: Path to the txt file containing the sample information for each sample. This file must have three columns (``SAMPLE``, ``FAMILY``, ``PHENOTYPE``) with the exact name. +----------+--------+-----------+ | SAMPLE | FAMILY | PHENOTYPE | +==========+========+===========+ | 11000.p1 | 11000 | case | +----------+--------+-----------+ | 11000.s1 | 11000 | ctrl | +----------+--------+-----------+ | 11002.p1 | 11002 | case | +----------+--------+-----------+ | 11002.s1 | 11002 | ctrl | +----------+--------+-----------+ - -a, --adjustment_factor: Path to the txt file containing the adjust factors for each sample. This is optional. With this option, CWAS-Plus2 multiplies the number of variants (or carriers, in -u option) with the adjust factor per sample. +----------+--------------+ | SAMPLE | AdjustFactor | +==========+==============+ | 11000.p1 | 0.932 | +----------+--------------+ | 11000.s1 | 1.082 | +----------+--------------+ | 11002.p1 | 0.895 | +----------+--------------+ | 11002.s1 | 1.113 | +----------+--------------+ - -c_info, --category_info: Path to a text file category information (`*.category_info.txt`). - -d, --domain_list: Domain list to filter categories based on GENCODE domain. If 'run_all' is given, all available options will be tested. Available options are `run_all,all,coding,noncoding,ptv,missense,damaging_missense,promoter,noncoding_wo_promoter,intron,intergenic,utr,lincRNA`. By default, all. - -t, --tag: Tag used for the name of the output files. By default, None. - --do_each_one: Use each annotation from functional annotation to calculate risk score. By default, False. - --leave_one_out: Calculate risk score while excluding one annotation from functional annotation. This option is not used when the `--do_each_one` flag is enabled. By default, False. - -u, --use_n_carrier: Enables the sample-level analysis (the use of the number of samples with variants in each category for burden test instead of the number of variants). With this option, CWAS-Plus2 counts the number of samples that carry at least one variant of each category. - -thr, --threshold: The number of variants in controls (or the number of control carriers) used to select rare categories. For example, if set to 3, categories with less than 3 variants in controls will be used for training. By default, 3. - -tf, --train_set_fraction: The fraction of the training set. For example, if set to 0.7, 70% of the samples will be used as training set and 30% will be used as test set. By default, 0.7. - -n_reg, --num_regression: Number of regression trials to calculate a mean of R squares. By default, 10. - -f, --fold: Number of folds for cross-validation. - -n, --n_permute: The number of permutations used to calculate the p-value. By default, 1,000. - --predict_only: If set, only predict the risk score and skip the permutation process. By default, False. - -S, --seed: Seed of random state. By default, 42. - -p, --num_proc: Number of worker processes that will be used for the permutation process. By default, 1. - -fs_group, --feature_selection_group: Specify the list of groups for feature selection. Available options are ``gene_set``, ``functional_score``, ``functional_annotation``. By default, "gene_set,functional_score,functional_annotation". - -pt, --plotsize: Plot size of main histogram plot (width,height in inches, comma-separated). By default, "7,7". - -fs, --fontsize: Font size of main histogram plot. By default, 10. .. code-block:: solidity cwas risk_score -i INPUT.categorization_result.zarr \ -o_dir OUTPUT_DIR \ -s SAMPLE_LIST.txt \ -a ADJUST_FACTOR.txt \ -c_info CATEGORY_SET.txt \ -thr 3 \ -tf 0.7 \ -n_reg 10 \ -f 5 \ -n 1000 \ -p 8 Users can perform two types of risk score analyses in a loop to identify annotations with the best predictive performance and composition within the annotation set. 1. Risk score analysis for categories containing a single annotation within a specific domain .. code-block:: solidity cwas risk_score -i INPUT.categorization_result.zarr \ -o_dir OUTPUT_DIR \ -s SAMPLE_LIST.txt \ -a ADJUST_FACTOR.txt \ -c_info CATEGORY_SET.txt \ -thr 3 \ -tf 0.7 \ -n_reg 10 \ -f 5 \ -n 1000 \ -p 8 \ --do_each_one 2. Risk score analysis for categories with one annotation excluded from the total annotations .. code-block:: solidity cwas risk_score -i INPUT.categorization_result.zarr \ -o_dir OUTPUT_DIR \ -s SAMPLE_LIST.txt \ -a ADJUST_FACTOR.txt \ -c_info CATEGORY_SET.txt \ -thr 3 \ -tf 0.7 \ -n_reg 10 \ -f 5 \ -n 1000 \ -p 8 \ --leave_one_out