{ "cells": [ { "cell_type": "markdown", "id": "116bcc5c", "metadata": {}, "source": [ "# Tutorial\n", "\n", "`GaugeFixer` provides a simple interface for fixing the gauge of parameters of linear models that describe sequence-function relationships. These models are implemented as subclasses of `HierarchicalModel`.\n", "\n", "### Defining a linear sequence-function model\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "23db982a", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "from gaugefixer import AllOrderModel, PairwiseModel\n", "from gaugefixer.utils import get_all_seqs" ] }, { "cell_type": "markdown", "id": "ab3054d0", "metadata": {}, "source": [ "To create the most general `HierarchicalModel` object you must specify the configuration of the sequence space describing the sequence→function relationship:\n", "\n", "- `L`: sequence length (integer).\n", "- `alphabet`: sequence alphabet. Options:\n", " - a named biological alphabet: `'dna'`, `'rna'`, or `'protein'`;\n", " - an explicit list of alleles: `['A','C','G','T']`;\n", " - position-specific alphabets: e.g. `[['A','C'], ['A','C','G']]`.\n", "- `generating_orbits`: set of generating orbits (tuples of position indices). Orbits determine which subsequences share parameters. More specific model classes (e.g., `AllOrderModel`, `KorderModel`, `KadjacentModel`, `PairwiseModel`) provide pre-defined generating orbits for common choices (all orders, up to order k, adjacent-site interactions, pairwise, etc.).\n", "\n", "Example: define a 3-nucleotide sequence space with interactions across all possible position subsets (use the AllOrder model):" ] }, { "cell_type": "code", "execution_count": 2, "id": "1ef8bc62", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AllOrderModel(L=3,alphabet_name=dna,n_features=125,n_orbits=8)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = AllOrderModel(L=3, alphabet_name='dna')\n", "model" ] }, { "cell_type": "markdown", "id": "f036b5a7", "metadata": {}, "source": [ "### See some model properties\n", "\n", "Printing the model already gives a short summary of the model configuration, but we can access other model properties. For instance, we can see the set of generating orbits defining a given hierarchical model\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "7badb1fc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0, 1, 2)]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.get_generating_orbits()" ] }, { "cell_type": "markdown", "id": "19c8b4e0", "metadata": {}, "source": [ "We can also access the complete set of orbits and see what is the pattern of interactions across positions in this linear model. In this case, the orbits show that there are interactions of all orders across all possible sets of positions." ] }, { "cell_type": "code", "execution_count": 4, "id": "b98499c9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(), (0,), (1,), (2,), (0, 1), (0, 2), (1, 2), (0, 1, 2)]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.get_orbits()" ] }, { "cell_type": "markdown", "id": "99dc96c0", "metadata": {}, "source": [ "Then, we can extract the sequence features that are associated to each of the orbits. Features are defined by a tuple containing a tuple of the subset of positions and the subsequence at that position, e.g., `((0, 2), 'AG')`. As there are many features, lets just show a few of them." ] }, { "cell_type": "code", "execution_count": 5, "id": "8475a761", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[((), ''),\n", " ((0,), 'A'),\n", " ((0,), 'C'),\n", " ((0,), 'G'),\n", " ((0,), 'T'),\n", " ((1,), 'A'),\n", " ((1,), 'C'),\n", " ((1,), 'G'),\n", " ((1,), 'T'),\n", " ((2,), 'A'),\n", " ((2,), 'C'),\n", " ((2,), 'G'),\n", " ((2,), 'T'),\n", " ((0, 1), 'AA'),\n", " ((0, 1), 'AC')]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.get_features()[:15]" ] }, { "cell_type": "markdown", "id": "1ed99077", "metadata": {}, "source": [ "### Defining model parameters\n", "\n", "The next step in model definition is to define the values of the model parameters $\\theta$. For simplicity, lets just create some random parameter values which have to be provided as a `pd.Series` with the parameter values indexed by the model features. " ] }, { "cell_type": "code", "execution_count": 6, "id": "2f20355c", "metadata": {}, "outputs": [], "source": [ "model.set_random_params()" ] }, { "cell_type": "markdown", "id": "ac7834e8", "metadata": {}, "source": [ "Once the model parameters are defined, we can evaluate it at any sequence of interest by calling the model directly" ] }, { "cell_type": "code", "execution_count": 7, "id": "656b1f20", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-2.00383214])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model(['ACG'])" ] }, { "cell_type": "markdown", "id": "3990827b", "metadata": {}, "source": [ "In the case of an `AllOrderModel`, we can also define the model parameters from the function values across all possible sequences. Lets initialize the model with some random function values `f` and evaluate the function at some sequences:" ] }, { "cell_type": "code", "execution_count": 8, "id": "c9fc5665", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([0.32196665]), np.float64(0.32196664793539864))" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f_values = np.random.normal(size=model.size)\n", "seqs = get_all_seqs(model.alphabet_list)\n", "f = pd.Series(f_values, index=seqs)\n", "model.set_landscape(f)\n", "model(['ACG']), f.loc['ACG']" ] }, { "cell_type": "markdown", "id": "31f91b6f", "metadata": {}, "source": [ "\n", "### Fixing the Gauge\n", "\n", "Now that the model is completely specified, we may want to interpret the parameter values. However, there are subspaces of parameter values that encode the same sequence-to-function model. Fixing the gauge means choosing one among the many set of parameter values that encode the same model by specifying certain properties of the parameter values e.g. in the `hierarchical` gauge the parameters associated to low order subsequences explain as much variance as possible.\n", "\n", "We can do this operation easily by using the method `fix_gauge`. The gauge defined either by their specific names, e.g., `wild-type`, `zero-sum`, or `hierarchical`, or by directly defining the $\\lambda$ value as `lda` and the site- and allele-specific probabilities `pi_lc`, which define a specific gauge in the $\\lambda-\\pi$ family of linear gauges.\n", "\n", "For instance, lets express our parameters in the commonly used `zero-sum` gauge.\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "ab8382e0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((), ) -0.039335\n", "((0,), A) -0.106147\n", "((0,), C) -0.469653\n", "((0,), G) 0.174317\n", "((0,), T) 0.401483\n", " ... \n", "((0, 1, 2), TGT) 0.524674\n", "((0, 1, 2), TTA) -0.636448\n", "((0, 1, 2), TTC) 0.091942\n", "((0, 1, 2), TTG) 1.196552\n", "((0, 1, 2), TTT) -0.652047\n", "Length: 125, dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "theta_fixed = model.get_fixed_params(gauge='zero-sum')\n", "theta_fixed" ] }, { "cell_type": "markdown", "id": "faa105e3", "metadata": {}, "source": [ "We can verify that, while the new parameter values are different, the two models encode the same function" ] }, { "cell_type": "code", "execution_count": 10, "id": "3b74993e", "metadata": {}, "outputs": [], "source": [ "f1 = model(seqs)\n", "theta = model.theta.copy()\n", "model.set_params(theta_fixed)\n", "f2 = model(seqs)\n", "assert not np.allclose(theta, theta_fixed)\n", "assert np.allclose(f1, f2)" ] }, { "cell_type": "markdown", "id": "ab4bf5e7", "metadata": {}, "source": [ "Now the parameters can be more easily interpreted. For example, the parameter for the empty subsequence represents the average across all posible sequences " ] }, { "cell_type": "code", "execution_count": 11, "id": "8b5e09cf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((), ) -0.039335\n", "dtype: float64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "theta_fixed.loc[[((), '')]]" ] }, { "cell_type": "markdown", "id": "7d1ff7df", "metadata": {}, "source": [ "whereas the parameter associated to nucleotide A at position 3 represents the average effect of placing an A at that position across all possible sequence. " ] }, { "cell_type": "code", "execution_count": 12, "id": "f1410d8b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((2,), A) 0.055764\n", "dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "theta_fixed.loc[[((2,), 'A')]]" ] }, { "cell_type": "markdown", "id": "889ff1d6", "metadata": {}, "source": [ "But maybe we are interested in the effect of placing an A at position 2 only in sequences with A or G at position one. We can do this easily using the `hierarchical` Gauge with a under a site-independent probability distribution `pi_lc` " ] }, { "cell_type": "code", "execution_count": 13, "id": "a1765b77", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((2,), A) 0.026694\n", "dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pi_lc = [np.array([0.5, 0, 0.5, 0]), np.array([0.25, 0.25, 0.25, 0.25]), np.array([0.25, 0.25, 0.25, 0.25])]\n", "theta_fixed = model.get_fixed_params(gauge='hierarchical', pi_lc=pi_lc)\n", "theta_fixed.loc[[((2,), \"A\")]]" ] }, { "cell_type": "markdown", "id": "c1391e8a", "metadata": {}, "source": [ "### Hierarchical models\n", "\n", "In the previous case, we have fixed the Gauge in a model in which we can enumerate all possible sequences in the space and have parameter values of all possible orders, which is possible only for sequences of limited length. When having longer sequences, we often fit models that only contain additive, pairwise or, more generally, interactions up to order k, or restrict interactions within a window of adjacent sites in the sequence. These models are specific cases of `HierarchicalModel` with predefined sets of orbits and are implemented as specific classes in `GaugeFixer`, such as `KorderModel` or `KadjacentModel`, which are defined by the highest order interaction `k`.\n", "\n", "Let's try with a pairwise model for a 20 aminoacid sequence" ] }, { "cell_type": "code", "execution_count": 14, "id": "f0d3b98b", "metadata": {}, "outputs": [], "source": [ "model = PairwiseModel(L=20, alphabet_name='protein')" ] }, { "cell_type": "markdown", "id": "f0383861", "metadata": {}, "source": [ "As before, we can generate random parameter values" ] }, { "cell_type": "code", "execution_count": 15, "id": "f5fde0bc", "metadata": {}, "outputs": [], "source": [ "model.set_random_params()" ] }, { "cell_type": "markdown", "id": "c548e0df", "metadata": {}, "source": [ "and express this model in the `zero-sum` Gauge" ] }, { "cell_type": "code", "execution_count": 16, "id": "e7bb86c3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((), ) -0.940365\n", "((0,), A) -1.810552\n", "((0,), C) 0.478341\n", "((0,), D) 1.185330\n", "((0,), E) 3.573782\n", " ... \n", "((18, 19), YS) -0.610906\n", "((18, 19), YT) 0.617389\n", "((18, 19), YV) -0.678849\n", "((18, 19), YW) 0.504154\n", "((18, 19), YY) 0.920631\n", "Length: 76401, dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "theta_fixed = model.get_fixed_params(gauge='zero-sum')\n", "theta_fixed" ] }, { "cell_type": "markdown", "id": "9482a87b", "metadata": {}, "source": [ "`GaugeFixer`'s new algorithm enables efficient Gauge fixing even in models with hundreds of thousands to few millions parameters. Let's see how long it takes to fix the Gauge of a pairwise model on 100-aminoacid sequences with nearly 2 million parameters." ] }, { "cell_type": "code", "execution_count": 17, "id": "8ee57219", "metadata": {}, "outputs": [], "source": [ "model = PairwiseModel(L=100, alphabet_name=\"protein\")\n", "model.set_random_params()" ] }, { "cell_type": "code", "execution_count": 18, "id": "512d91e6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.11 s, sys: 23.3 ms, total: 2.13 s\n", "Wall time: 2.13 s\n" ] } ], "source": [ "%time theta_fixed = model.get_fixed_params(gauge='zero-sum')" ] } ], "metadata": { "kernelspec": { "display_name": "gaugefixed", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.18" } }, "nbformat": 4, "nbformat_minor": 5 }