Skip to content

stats

Statistical methods used for GWAS processing

The functions below are used during the harmonisation of summary statistics effect size, p-value and confidence intervals, standard error calculations.

NOTE: Due to low p-value values, the functions work with pvalue in one of two formats:

  • as negative log10 p-value (neglogoval)
  • as mantissa and exponent (2 separate columns)

gentropy.common.stats

Statistic calculations.

chi2_from_pvalue(p_value_mantissa: Column, p_value_exponent: Column) -> Column

Calculate chi2 from p-value.

This function calculates the chi2 value from the p-value mantissa and exponent. In case the p-value is very small (exponent < -300), it uses an approximation based on a linear regression model. The approximation is based on the formula: -5.367 * neglog_pval + 4.596, where neglog_pval is the negative log10 of the p-value mantissa.

Parameters:

Name Type Description Default
p_value_mantissa Column

Mantissa of the p-value (float)

required
p_value_exponent Column

Exponent of the p-value (integer)

required

Returns:

Name Type Description
Column Column

Chi2 value (float)

Examples:

>>> data = [(5.0, -8), (9.0, -300), (9.0, -301)]
>>> schema = "pValueMantissa FLOAT, pValueExponent INT"
>>> df = spark.createDataFrame(data, schema)
>>> df.show()
+--------------+--------------+
|pValueMantissa|pValueExponent|
+--------------+--------------+
|           5.0|            -8|
|           9.0|          -300|
|           9.0|          -301|
+--------------+--------------+
>>> mantissa = f.col("pValueMantissa")
>>> exponent = f.col("pValueExponent")
>>> chi2 = f.round(chi2_from_pvalue(mantissa, exponent), 2).alias("chi2")
>>> df2 = df.select(mantissa, exponent, chi2)
>>> df2.show()
+--------------+--------------+-------+
|pValueMantissa|pValueExponent|   chi2|
+--------------+--------------+-------+
|           5.0|            -8|  29.72|
|           9.0|          -300|1369.48|
|           9.0|          -301|1373.64|
+--------------+--------------+-------+
Source code in src/gentropy/common/stats.py
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
def chi2_from_pvalue(p_value_mantissa: Column, p_value_exponent: Column) -> Column:
    """Calculate chi2 from p-value.

    This function calculates the chi2 value from the p-value mantissa and exponent.
    In case the p-value is very small (exponent < -300), it uses an approximation based on a linear regression model.
    The approximation is based on the formula: -5.367 * neglog_pval + 4.596, where neglog_pval is the negative log10 of the p-value mantissa.


    Args:
        p_value_mantissa (Column): Mantissa of the p-value (float)
        p_value_exponent (Column): Exponent of the p-value (integer)

    Returns:
        Column: Chi2 value (float)

    Examples:
        >>> data = [(5.0, -8), (9.0, -300), (9.0, -301)]
        >>> schema = "pValueMantissa FLOAT, pValueExponent INT"
        >>> df = spark.createDataFrame(data, schema)
        >>> df.show()
        +--------------+--------------+
        |pValueMantissa|pValueExponent|
        +--------------+--------------+
        |           5.0|            -8|
        |           9.0|          -300|
        |           9.0|          -301|
        +--------------+--------------+
        <BLANKLINE>

        >>> mantissa = f.col("pValueMantissa")
        >>> exponent = f.col("pValueExponent")
        >>> chi2 = f.round(chi2_from_pvalue(mantissa, exponent), 2).alias("chi2")
        >>> df2 = df.select(mantissa, exponent, chi2)
        >>> df2.show()
        +--------------+--------------+-------+
        |pValueMantissa|pValueExponent|   chi2|
        +--------------+--------------+-------+
        |           5.0|            -8|  29.72|
        |           9.0|          -300|1369.48|
        |           9.0|          -301|1373.64|
        +--------------+--------------+-------+
        <BLANKLINE>
    """
    PVAL_EXP_THRESHOLD = f.lit(-300)
    APPROX_INTERCEPT = f.lit(-5.367)
    APPROX_COEF = f.lit(4.596)
    neglog_pval = neglogpval_from_pvalue(p_value_mantissa, p_value_exponent)
    p_value = p_value_mantissa * f.pow(10, p_value_exponent)
    neglog_approx = (neglog_pval * APPROX_COEF + APPROX_INTERCEPT).cast(t.DoubleType())

    return (
        f.when(p_value_exponent < PVAL_EXP_THRESHOLD, neglog_approx)
        .otherwise(chi2_inverse_survival_function(p_value))
        .alias("chi2")
    )

ci(pvalue_mantissa: Column, pvalue_exponent: Column, beta: Column, standard_error: Column) -> tuple[Column, Column]

Calculate the confidence interval for the effect based on the p-value and the effect size.

If the standard error already available, don't re-calculate from p-value.

Parameters:

Name Type Description Default
pvalue_mantissa Column

p-value mantissa (float)

required
pvalue_exponent Column

p-value exponent (integer)

required
beta Column

effect size in beta (float)

required
standard_error Column

standard error.

required

Returns:

Type Description
tuple[Column, Column]

tuple[Column, Column]: betaConfidenceIntervalLower (float), betaConfidenceIntervalUpper (float)

Examples:

>>> df = spark.createDataFrame([
...     (2.5, -10, 0.5, 0.2),
...     (3.0, -5, 1.0, None),
...     (1.5, -8, -0.2, 0.1)
...     ], ["pvalue_mantissa", "pvalue_exponent", "beta", "standard_error"]
... )
>>> df.select("*", *ci(f.col("pvalue_mantissa"), f.col("pvalue_exponent"), f.col("beta"), f.col("standard_error"))).show()
+---------------+---------------+----+--------------+---------------------------+---------------------------+
|pvalue_mantissa|pvalue_exponent|beta|standard_error|betaConfidenceIntervalLower|betaConfidenceIntervalUpper|
+---------------+---------------+----+--------------+---------------------------+---------------------------+
|            2.5|            -10| 0.5|           0.2|        0.10799999999999998|                      0.892|
|            3.0|             -5| 1.0|          NULL|         0.5303664052547075|         1.4696335947452925|
|            1.5|             -8|-0.2|           0.1|                     -0.396|       -0.00400000000000...|
+---------------+---------------+----+--------------+---------------------------+---------------------------+
Source code in src/gentropy/common/stats.py
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
def ci(
    pvalue_mantissa: Column,
    pvalue_exponent: Column,
    beta: Column,
    standard_error: Column,
) -> tuple[Column, Column]:
    """Calculate the confidence interval for the effect based on the p-value and the effect size.

    If the standard error already available, don't re-calculate from p-value.

    Args:
        pvalue_mantissa (Column): p-value mantissa (float)
        pvalue_exponent (Column): p-value exponent (integer)
        beta (Column): effect size in beta (float)
        standard_error (Column): standard error.

    Returns:
        tuple[Column, Column]: betaConfidenceIntervalLower (float), betaConfidenceIntervalUpper (float)

    Examples:
        >>> df = spark.createDataFrame([
        ...     (2.5, -10, 0.5, 0.2),
        ...     (3.0, -5, 1.0, None),
        ...     (1.5, -8, -0.2, 0.1)
        ...     ], ["pvalue_mantissa", "pvalue_exponent", "beta", "standard_error"]
        ... )
        >>> df.select("*", *ci(f.col("pvalue_mantissa"), f.col("pvalue_exponent"), f.col("beta"), f.col("standard_error"))).show()
        +---------------+---------------+----+--------------+---------------------------+---------------------------+
        |pvalue_mantissa|pvalue_exponent|beta|standard_error|betaConfidenceIntervalLower|betaConfidenceIntervalUpper|
        +---------------+---------------+----+--------------+---------------------------+---------------------------+
        |            2.5|            -10| 0.5|           0.2|        0.10799999999999998|                      0.892|
        |            3.0|             -5| 1.0|          NULL|         0.5303664052547075|         1.4696335947452925|
        |            1.5|             -8|-0.2|           0.1|                     -0.396|       -0.00400000000000...|
        +---------------+---------------+----+--------------+---------------------------+---------------------------+
        <BLANKLINE>
    """
    # Calculate p-value from mantissa and exponent:
    pvalue = pvalue_mantissa * f.pow(10, pvalue_exponent)

    # Fix p-value underflow:
    pvalue = f.when(pvalue == 0, sys.float_info.min).otherwise(pvalue)

    # Compute missing standard error:
    standard_error = f.when(
        standard_error.isNull(), f.abs(beta) / f.abs(zscore_from_pvalue(pvalue, beta))
    ).otherwise(standard_error)

    # Calculate upper and lower confidence interval:
    z_score_095 = 1.96
    ci_lower = (beta - z_score_095 * standard_error).alias(
        "betaConfidenceIntervalLower"
    )
    ci_upper = (beta + z_score_095 * standard_error).alias(
        "betaConfidenceIntervalUpper"
    )

    return (ci_lower, ci_upper)

get_logsum(arr: NDArray[np.float64]) -> float

Calculates logarithm of the sum of exponents of a vector. The max is extracted to ensure that the sum is not Inf.

This function emulates scipy's logsumexp expression.

Parameters:

Name Type Description Default
arr NDArray[float64]

input array

required

Returns:

Name Type Description
float float

logsumexp of the input array

Examples:

>>> l = [0.2, 0.1, 0.05, 0]
>>> round(get_logsum(l), 6)
1.476557
Source code in src/gentropy/common/stats.py
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def get_logsum(arr: NDArray[np.float64]) -> float:
    """Calculates logarithm of the sum of exponents of a vector. The max is extracted to ensure that the sum is not Inf.

    This function emulates scipy's logsumexp expression.

    Args:
        arr (NDArray[np.float64]): input array

    Returns:
        float: logsumexp of the input array

    Examples:
        >>> l = [0.2, 0.1, 0.05, 0]
        >>> round(get_logsum(l), 6)
        1.476557
    """
    MAX = np.max(arr)
    result = MAX + np.log(np.sum(np.exp(arr - MAX)))
    return float(result)

neglogpval_from_pvalue(p_value_mantissa: Column, p_value_exponent: Column) -> Column

Compute the negative log p-value.

Parameters:

Name Type Description Default
p_value_mantissa Column

P-value mantissa

required
p_value_exponent Column

P-value exponent

required

Returns:

Name Type Description
Column Column

Negative log p-value

Examples:

>>> d = [(1, 1), (5, -2), (1, -1000)]
>>> df = spark.createDataFrame(d).toDF("p_value_mantissa", "p_value_exponent")
>>> df.withColumn("neg_log_p", neglogpval_from_pvalue(f.col("p_value_mantissa"), f.col("p_value_exponent"))).show()
+----------------+----------------+------------------+
|p_value_mantissa|p_value_exponent|         neg_log_p|
+----------------+----------------+------------------+
|               1|               1|              -1.0|
|               5|              -2|1.3010299956639813|
|               1|           -1000|            1000.0|
+----------------+----------------+------------------+
Source code in src/gentropy/common/stats.py
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
def neglogpval_from_pvalue(
    p_value_mantissa: Column, p_value_exponent: Column
) -> Column:
    """Compute the negative log p-value.

    Args:
        p_value_mantissa (Column): P-value mantissa
        p_value_exponent (Column): P-value exponent

    Returns:
        Column: Negative log p-value

    Examples:
        >>> d = [(1, 1), (5, -2), (1, -1000)]
        >>> df = spark.createDataFrame(d).toDF("p_value_mantissa", "p_value_exponent")
        >>> df.withColumn("neg_log_p", neglogpval_from_pvalue(f.col("p_value_mantissa"), f.col("p_value_exponent"))).show()
        +----------------+----------------+------------------+
        |p_value_mantissa|p_value_exponent|         neg_log_p|
        +----------------+----------------+------------------+
        |               1|               1|              -1.0|
        |               5|              -2|1.3010299956639813|
        |               1|           -1000|            1000.0|
        +----------------+----------------+------------------+
        <BLANKLINE>
    """
    return -1 * (f.log10(p_value_mantissa) + p_value_exponent)

neglogpval_from_z2(z2: Column) -> Column

Calculate negative log10 of p-value from squared Z-score following chi2 distribution.

The Z-score^2 is equal to the chi2 with 1 degree of freedom.

In case of very large Z-score (very small corresponding p-value), the function uses a linear approximation.

Parameters:

Name Type Description Default
z2 Column

Z-score squared.

required

Returns:

Name Type Description
Column Column

negative log of p-value.

Examples:

>>> data = [(1.0,), (2000.0,)]
>>> schema = "z2 FLOAT"
>>> df = spark.createDataFrame(data, schema)
>>> df.show()
+------+
|    z2|
+------+
|   1.0|
|2000.0|
+------+
>>> neglogpval = f.round(neglogpval_from_z2(f.col("z2")), 2).alias("neglogpval")
>>> df2 = df.select(f.col("z2"), neglogpval)
>>> df2.show()
+------+----------+
|    z2|neglogpval|
+------+----------+
|   1.0|       0.5|
|2000.0|    436.02|
+------+----------+
Source code in src/gentropy/common/stats.py
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
def neglogpval_from_z2(z2: Column) -> Column:
    """Calculate negative log10 of p-value from squared Z-score following chi2 distribution.

    **The Z-score^2 is equal to the chi2 with 1 degree of freedom.**

    In case of very large Z-score (very small corresponding p-value), the function uses a linear approximation.

    Args:
        z2 (Column): Z-score squared.

    Returns:
        Column:  negative log of p-value.

    Examples:
        >>> data = [(1.0,), (2000.0,)]
        >>> schema = "z2 FLOAT"
        >>> df = spark.createDataFrame(data, schema)
        >>> df.show()
        +------+
        |    z2|
        +------+
        |   1.0|
        |2000.0|
        +------+
        <BLANKLINE>

        >>> neglogpval = f.round(neglogpval_from_z2(f.col("z2")), 2).alias("neglogpval")
        >>> df2 = df.select(f.col("z2"), neglogpval)
        >>> df2.show()
        +------+----------+
        |    z2|neglogpval|
        +------+----------+
        |   1.0|       0.5|
        |2000.0|    436.02|
        +------+----------+
        <BLANKLINE>
    """
    MAX_EXACT_Z2 = f.lit(1400)
    APPROX_INTERCEPT = f.lit(1.4190)
    APPROX_COEFF = f.lit(0.2173)
    approximate_neglogpval_from_z2 = APPROX_INTERCEPT + APPROX_COEFF * z2
    computed_neglogpval_from_z2 = -1 * f.log10(chi2_survival_function(z2))
    return f.when(z2 <= MAX_EXACT_Z2, computed_neglogpval_from_z2).otherwise(
        approximate_neglogpval_from_z2
    )

normalise_gwas_statistics(beta: Column, odds_ratio: Column, standard_error: Column, ci_upper: Column, ci_lower: Column, mantissa: Column, exponent: Column) -> GWASEffect

Normalise beta and standard error from given values.

This function attempts to harmonise Effect and Standard Error given various inputs.

Note

Effect (Beta) harmonisation: - If beta is not null, it is kept as is. - If beta is null, but odds ratio is not null, odds ratio is converted to beta

Note

Effect Standard Error (std(beta)) harmonisation Prefer calculation from p-value and beta, if available, as the confidence interval is usually rounded and may lead to loss of precision: - If standard error is not null, it is kept as is. - If standard error is null, but beta, pval-mantissa, pval-exponent are not null, convert pval components and beta to standard error - If standard error is null, but ci-upper and ci-lower are not null and they come from odds ratio, convert them to standard error.

Parameters:

Name Type Description Default
beta Column

Effect in beta.

required
odds_ratio Column

Effect in odds ratio.

required
standard_error Column

Standard error of the effect.

required
ci_upper Column

Upper bound of the confidence interval.

required
ci_lower Column

Lower bound of the confidence interval.

required
mantissa Column

Mantissa of the p-value.

required
exponent Column

Exponent of the p-value.

required

Returns:

Name Type Description
GWASEffect GWASEffect

named tuple with standardError and beta columns.

Examples:

>>> x1 = (0.1, 1.1, 0.1, None, None, 9.0, -100) # keep beta, keep std error
>>> x2 = (None, 1.1, 0.1, None, None, 9.0, -100) # convert odds ratio to beta, keep std error
>>> x3 = (None, 1.1, None, 1.30, 0.90, None, None) # convert odds ratio to beta, convert ci to standard error
>>> x4 = (0.1, 1.1, None, 1.30, 0.90, None, None) # keep beta, convert ci to standard error
>>> x5 = (None, 1.1, None, 1.30, 0.90, 9.0, -100) # convert beta to odds ratio, convert p-value and beta to standard error
>>> x6 = (0.1, None, None, None, None, 9.0, -100) # keep beta, convert p-value and beta to standard error
>>> x7 = (None, None, None, 1.3, 0.9, 9.0, -100) # keep beta NULL, without beta we do not want to compute the standard error
>>> data = [x1, x2, x3, x4, x5, x6, x7]
>>> schema = "beta FLOAT, oddsRatio FLOAT, standardError FLOAT, ci_upper FLOAT, ci_lower FLOAT, mantissa FLOAT, exp INT"
>>> df = spark.createDataFrame(data, schema)
>>> df.show()
+----+---------+-------------+--------+--------+--------+----+
|beta|oddsRatio|standardError|ci_upper|ci_lower|mantissa| exp|
+----+---------+-------------+--------+--------+--------+----+
| 0.1|      1.1|          0.1|    NULL|    NULL|     9.0|-100|
|NULL|      1.1|          0.1|    NULL|    NULL|     9.0|-100|
|NULL|      1.1|         NULL|     1.3|     0.9|    NULL|NULL|
| 0.1|      1.1|         NULL|     1.3|     0.9|    NULL|NULL|
|NULL|      1.1|         NULL|     1.3|     0.9|     9.0|-100|
| 0.1|     NULL|         NULL|    NULL|    NULL|     9.0|-100|
|NULL|     NULL|         NULL|     1.3|     0.9|     9.0|-100|
+----+---------+-------------+--------+--------+--------+----+
>>> beta = f.col("beta")
>>> odds_ratio = f.col("oddsRatio")
>>> se = f.col("standardError")
>>> ci_upper = f.col("ci_upper")
>>> ci_lower = f.col("ci_lower")
>>> mantissa = f.col("mantissa")
>>> exponent = f.col("exp")
>>> cols = normalise_gwas_statistics(
...     beta, odds_ratio, se, ci_upper, ci_lower, mantissa, exponent
... )
>>> beta_computed = f.round(cols.beta, 2).alias("beta")
>>> standard_error_computed = f.round(cols.standard_error, 2).alias("standardError")
>>> df.select(beta_computed, standard_error_computed).show()
+----+-------------+
|beta|standardError|
+----+-------------+
| 0.1|          0.1|
| 0.1|          0.1|
| 0.1|         0.09|
| 0.1|         0.09|
| 0.1|          0.0|
| 0.1|          0.0|
|NULL|         NULL|
+----+-------------+
Source code in src/gentropy/common/stats.py
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
def normalise_gwas_statistics(
    beta: Column,
    odds_ratio: Column,
    standard_error: Column,
    ci_upper: Column,
    ci_lower: Column,
    mantissa: Column,
    exponent: Column,
) -> GWASEffect:
    """Normalise beta and standard error from given values.

    This function attempts to harmonise Effect and Standard Error given various inputs.

    Note:
        Effect (Beta) harmonisation:
        - If beta is not null, it is kept as is.
        - If beta is null, but odds ratio is not null, odds ratio is converted to beta

    Note:
        Effect Standard Error (std(beta)) harmonisation
        **Prefer calculation from p-value and beta, if available, as the confidence interval is usually rounded and may lead to loss of precision**:
        - If standard error is not null, it is kept as is.
        - If standard error is null, but beta, pval-mantissa, pval-exponent are not null, convert pval components and beta to standard error
        - If standard error is null, but ci-upper and ci-lower are not null and they come from odds ratio, convert them to standard error.


    Args:
        beta (Column): Effect in beta.
        odds_ratio (Column): Effect in odds ratio.
        standard_error (Column): Standard error of the effect.
        ci_upper (Column): Upper bound of the confidence interval.
        ci_lower (Column): Lower bound of the confidence interval.
        mantissa (Column): Mantissa of the p-value.
        exponent (Column): Exponent of the p-value.

    Returns:
        GWASEffect: named tuple with standardError and beta columns.

    Examples:
        >>> x1 = (0.1, 1.1, 0.1, None, None, 9.0, -100) # keep beta, keep std error
        >>> x2 = (None, 1.1, 0.1, None, None, 9.0, -100) # convert odds ratio to beta, keep std error
        >>> x3 = (None, 1.1, None, 1.30, 0.90, None, None) # convert odds ratio to beta, convert ci to standard error
        >>> x4 = (0.1, 1.1, None, 1.30, 0.90, None, None) # keep beta, convert ci to standard error
        >>> x5 = (None, 1.1, None, 1.30, 0.90, 9.0, -100) # convert beta to odds ratio, convert p-value and beta to standard error
        >>> x6 = (0.1, None, None, None, None, 9.0, -100) # keep beta, convert p-value and beta to standard error
        >>> x7 = (None, None, None, 1.3, 0.9, 9.0, -100) # keep beta NULL, without beta we do not want to compute the standard error
        >>> data = [x1, x2, x3, x4, x5, x6, x7]

        >>> schema = "beta FLOAT, oddsRatio FLOAT, standardError FLOAT, ci_upper FLOAT, ci_lower FLOAT, mantissa FLOAT, exp INT"
        >>> df = spark.createDataFrame(data, schema)
        >>> df.show()
        +----+---------+-------------+--------+--------+--------+----+
        |beta|oddsRatio|standardError|ci_upper|ci_lower|mantissa| exp|
        +----+---------+-------------+--------+--------+--------+----+
        | 0.1|      1.1|          0.1|    NULL|    NULL|     9.0|-100|
        |NULL|      1.1|          0.1|    NULL|    NULL|     9.0|-100|
        |NULL|      1.1|         NULL|     1.3|     0.9|    NULL|NULL|
        | 0.1|      1.1|         NULL|     1.3|     0.9|    NULL|NULL|
        |NULL|      1.1|         NULL|     1.3|     0.9|     9.0|-100|
        | 0.1|     NULL|         NULL|    NULL|    NULL|     9.0|-100|
        |NULL|     NULL|         NULL|     1.3|     0.9|     9.0|-100|
        +----+---------+-------------+--------+--------+--------+----+
        <BLANKLINE>

        >>> beta = f.col("beta")
        >>> odds_ratio = f.col("oddsRatio")
        >>> se = f.col("standardError")
        >>> ci_upper = f.col("ci_upper")
        >>> ci_lower = f.col("ci_lower")
        >>> mantissa = f.col("mantissa")
        >>> exponent = f.col("exp")
        >>> cols = normalise_gwas_statistics(
        ...     beta, odds_ratio, se, ci_upper, ci_lower, mantissa, exponent
        ... )
        >>> beta_computed = f.round(cols.beta, 2).alias("beta")
        >>> standard_error_computed = f.round(cols.standard_error, 2).alias("standardError")
        >>> df.select(beta_computed, standard_error_computed).show()
        +----+-------------+
        |beta|standardError|
        +----+-------------+
        | 0.1|          0.1|
        | 0.1|          0.1|
        | 0.1|         0.09|
        | 0.1|         0.09|
        | 0.1|          0.0|
        | 0.1|          0.0|
        |NULL|         NULL|
        +----+-------------+
        <BLANKLINE>
    """
    beta = (
        f.when(beta.isNotNull(), beta)
        .when(odds_ratio.isNotNull(), f.log(odds_ratio))
        .otherwise(f.lit(None))
        .alias("beta")
    )
    chi2 = chi2_from_pvalue(mantissa, exponent)

    standard_error = (
        f.when(standard_error.isNotNull(), standard_error)
        .when(
            standard_error.isNull()
            & mantissa.isNotNull()
            & exponent.isNotNull()
            & beta.isNotNull(),
            stderr_from_chi2_and_effect_size(chi2, beta),
        )
        .when(
            standard_error.isNull()
            & ci_lower.isNotNull()
            & ci_upper.isNotNull()
            & odds_ratio.isNotNull(),
            stderr_from_ci(ci_upper, ci_lower),
        )
        .otherwise(f.lit(None))
        .alias("standardError")
    )

    return GWASEffect(standard_error=standard_error, beta=beta)

pvalue_from_neglogpval(p_value: Column) -> PValComponents

Computing p-value mantissa and exponent based on the negative 10 based logarithm of the p-value.

Parameters:

Name Type Description Default
p_value Column

Neg-log p-value (string)

required

Returns:

Name Type Description
PValComponents PValComponents

mantissa and exponent of the p-value

Examples:

>>> (
... spark.createDataFrame([(4.56, 'a'),(2109.23, 'b')], ['negLogPv', 'label'])
... .select('negLogPv',*pvalue_from_neglogpval(f.col('negLogPv')))
... .show()
... )
+--------+--------------+--------------+
|negLogPv|pValueMantissa|pValueExponent|
+--------+--------------+--------------+
|    4.56|     2.7542286|            -5|
| 2109.23|     5.8884363|         -2110|
+--------+--------------+--------------+
Source code in src/gentropy/common/stats.py
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
def pvalue_from_neglogpval(p_value: Column) -> PValComponents:
    """Computing p-value mantissa and exponent based on the negative 10 based logarithm of the p-value.

    Args:
        p_value (Column): Neg-log p-value (string)

    Returns:
        PValComponents: mantissa and exponent of the p-value

    Examples:
        >>> (
        ... spark.createDataFrame([(4.56, 'a'),(2109.23, 'b')], ['negLogPv', 'label'])
        ... .select('negLogPv',*pvalue_from_neglogpval(f.col('negLogPv')))
        ... .show()
        ... )
        +--------+--------------+--------------+
        |negLogPv|pValueMantissa|pValueExponent|
        +--------+--------------+--------------+
        |    4.56|     2.7542286|            -5|
        | 2109.23|     5.8884363|         -2110|
        +--------+--------------+--------------+
        <BLANKLINE>
    """
    exponent: Column = f.ceil(p_value)
    mantissa: Column = f.pow(f.lit(10), (exponent - p_value))

    return PValComponents(
        mantissa=mantissa.cast(t.FloatType()).alias("pValueMantissa"),
        exponent=(-1 * exponent).cast(t.IntegerType()).alias("pValueExponent"),
    )

split_pvalue(pvalue: float) -> tuple[float, int]

Convert a float to 10 based exponent and mantissa.

Parameters:

Name Type Description Default
pvalue float

p-value

required

Returns:

Type Description
tuple[float, int]

tuple[float, int]: Tuple with mantissa and exponent

Raises:

Type Description
ValueError

If p-value is not between 0 and 1

Examples:

>>> split_pvalue(0.00001234)
(1.234, -5)
>>> split_pvalue(1)
(1.0, 0)
>>> split_pvalue(0.123)
(1.23, -1)
>>> split_pvalue(0.99)
(9.9, -1)
Source code in src/gentropy/common/stats.py
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
def split_pvalue(pvalue: float) -> tuple[float, int]:
    """Convert a float to 10 based exponent and mantissa.

    Args:
        pvalue (float): p-value

    Returns:
        tuple[float, int]: Tuple with mantissa and exponent

    Raises:
        ValueError: If p-value is not between 0 and 1

    Examples:
        >>> split_pvalue(0.00001234)
        (1.234, -5)

        >>> split_pvalue(1)
        (1.0, 0)

        >>> split_pvalue(0.123)
        (1.23, -1)

        >>> split_pvalue(0.99)
        (9.9, -1)
    """
    if pvalue < 0.0 or pvalue > 1.0:
        raise ValueError("P-value must be between 0 and 1")

    exponent = floor(log10(pvalue)) if pvalue != 0 else 0
    mantissa = round(pvalue / 10**exponent, 3)
    return (mantissa, exponent)

split_pvalue_column(pv: Column) -> PValComponents

This function takes a p-value string and returns two columns mantissa (float), exponent (integer).

Parameters:

Name Type Description Default
pv Column

P-value as string

required

Returns:

Name Type Description
PValComponents PValComponents

pValueMantissa (float), pValueExponent (integer)

Examples:

>>> d = [("0.01",),("4.2E-45",),("43.2E5",),("0",),("1",)]
>>> spark.createDataFrame(d, ['pval']).select('pval',*split_pvalue_column(f.col('pval'))).show()
+-------+--------------+--------------+
|   pval|pValueMantissa|pValueExponent|
+-------+--------------+--------------+
|   0.01|           1.0|            -2|
|4.2E-45|           4.2|           -45|
| 43.2E5|          43.2|             5|
|      0|         2.225|          -308|
|      1|           1.0|             0|
+-------+--------------+--------------+
Source code in src/gentropy/common/stats.py
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
def split_pvalue_column(pv: Column) -> PValComponents:
    """This function takes a p-value string and returns two columns mantissa (float), exponent (integer).

    Args:
        pv (Column): P-value as string

    Returns:
        PValComponents: pValueMantissa (float), pValueExponent (integer)

    Examples:
        >>> d = [("0.01",),("4.2E-45",),("43.2E5",),("0",),("1",)]
        >>> spark.createDataFrame(d, ['pval']).select('pval',*split_pvalue_column(f.col('pval'))).show()
        +-------+--------------+--------------+
        |   pval|pValueMantissa|pValueExponent|
        +-------+--------------+--------------+
        |   0.01|           1.0|            -2|
        |4.2E-45|           4.2|           -45|
        | 43.2E5|          43.2|             5|
        |      0|         2.225|          -308|
        |      1|           1.0|             0|
        +-------+--------------+--------------+
        <BLANKLINE>
    """
    # Making sure there's a number in the string:
    pv = f.when(
        pv == f.lit("0"), f.lit(sys.float_info.min).cast(t.StringType())
    ).otherwise(pv)

    # Get exponent:
    exponent = f.when(
        f.upper(pv).contains("E"),
        f.split(f.upper(pv), "E").getItem(1),
    ).otherwise(f.floor(f.log10(pv)))

    # Get mantissa:
    mantissa = f.when(
        f.upper(pv).contains("E"),
        f.split(f.upper(pv), "E").getItem(0),
    ).otherwise(pv / (10**exponent))

    # Round value:
    mantissa = f.round(mantissa, 3)

    return PValComponents(
        mantissa=mantissa.cast(t.FloatType()).alias("pValueMantissa"),
        exponent=exponent.cast(t.IntegerType()).alias("pValueExponent"),
    )

stderr_from_chi2_and_effect_size(chi2_col: Column, beta: Column) -> Column

Calculate standard error from chi2 and beta.

This function calculates the standard error from the chi2 value and beta.

Parameters:

Name Type Description Default
chi2_col Column

Chi2 value (float)

required
beta Column

Beta value (float)

required

Returns:

Name Type Description
Column Column

Standard error (float)

Examples:

>>> data = [(29.72, 3.0), (3.84, 1.0)]
>>> schema = "chi2 FLOAT, beta FLOAT"
>>> df = spark.createDataFrame(data, schema)
>>> df.show()
+-----+----+
| chi2|beta|
+-----+----+
|29.72| 3.0|
| 3.84| 1.0|
+-----+----+
>>> chi2_col = f.col("chi2")
>>> beta = f.col("beta")
>>> standard_error = f.round(stderr_from_chi2_and_effect_size(chi2_col, beta), 2).alias("standardError")
>>> df2 = df.select(chi2_col, beta, standard_error)
>>> df2.show()
+-----+----+-------------+
| chi2|beta|standardError|
+-----+----+-------------+
|29.72| 3.0|         0.55|
| 3.84| 1.0|         0.51|
+-----+----+-------------+
Source code in src/gentropy/common/stats.py
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
def stderr_from_chi2_and_effect_size(chi2_col: Column, beta: Column) -> Column:
    """Calculate standard error from chi2 and beta.

    This function calculates the standard error from the chi2 value and beta.

    Args:
        chi2_col (Column): Chi2 value (float)
        beta (Column): Beta value (float)

    Returns:
        Column: Standard error (float)

    Examples:
        >>> data = [(29.72, 3.0), (3.84, 1.0)]
        >>> schema = "chi2 FLOAT, beta FLOAT"
        >>> df = spark.createDataFrame(data, schema)
        >>> df.show()
        +-----+----+
        | chi2|beta|
        +-----+----+
        |29.72| 3.0|
        | 3.84| 1.0|
        +-----+----+
        <BLANKLINE>

        >>> chi2_col = f.col("chi2")
        >>> beta = f.col("beta")
        >>> standard_error = f.round(stderr_from_chi2_and_effect_size(chi2_col, beta), 2).alias("standardError")
        >>> df2 = df.select(chi2_col, beta, standard_error)
        >>> df2.show()
        +-----+----+-------------+
        | chi2|beta|standardError|
        +-----+----+-------------+
        |29.72| 3.0|         0.55|
        | 3.84| 1.0|         0.51|
        +-----+----+-------------+
        <BLANKLINE>

    """
    return (f.abs(beta) / f.sqrt(chi2_col)).alias("standardError")

stderr_from_ci(ci_upper: Column, ci_lower: Column, odds_ratio_based: bool = True) -> Column

Calculate standard error from confidence interval.

This function calculates the standard error from the confidence interval upper and lower bounds.

Parameters:

Name Type Description Default
ci_upper Column

Upper bound of the confidence interval (float)

required
ci_lower Column

Lower bound of the confidence interval (float)

required
odds_ratio_based bool

If True, the function assumes that the confidence interval is based on odds ratio. use log difference (default), else it assumes that the confidence interval is based on beta.

True

Returns:

Name Type Description
Column Column

Standard error (float)

Note

Absolute value of the log difference is used to ensure that the standard error is always positive, even if the ci bounds are inverted.

Examples:

>>> data = [(0.5, 0.1), (1.0, 0.5)]
>>> schema = "ci_upper FLOAT, ci_lower FLOAT"
>>> df = spark.createDataFrame(data, schema)
>>> df.show()
+--------+--------+
|ci_upper|ci_lower|
+--------+--------+
|     0.5|     0.1|
|     1.0|     0.5|
+--------+--------+
>>> ci_upper = f.col("ci_upper")
>>> ci_lower = f.col("ci_lower")
>>> standard_error = f.round(stderr_from_ci(ci_upper, ci_lower), 2).alias("standardError")
>>> df2 = df.select(ci_upper, ci_lower, standard_error)
>>> df2.show()
+--------+--------+-------------+
|ci_upper|ci_lower|standardError|
+--------+--------+-------------+
|     0.5|     0.1|         0.41|
|     1.0|     0.5|         0.18|
+--------+--------+-------------+
Source code in src/gentropy/common/stats.py
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
def stderr_from_ci(
    ci_upper: Column, ci_lower: Column, odds_ratio_based: bool = True
) -> Column:
    """Calculate standard error from confidence interval.

    This function calculates the standard error from the confidence interval upper and lower bounds.

    Args:
        ci_upper (Column): Upper bound of the confidence interval (float)
        ci_lower (Column): Lower bound of the confidence interval (float)
        odds_ratio_based (bool): If True, the function assumes that the confidence interval is based on odds ratio.
            use log difference (default), else it assumes that the confidence interval is based on beta.

    Returns:
        Column: Standard error (float)

    Note:
        Absolute value of the log difference is used to ensure that the standard error is always positive, even if the ci bounds are inverted.


    Examples:
        >>> data = [(0.5, 0.1), (1.0, 0.5)]
        >>> schema = "ci_upper FLOAT, ci_lower FLOAT"
        >>> df = spark.createDataFrame(data, schema)
        >>> df.show()
        +--------+--------+
        |ci_upper|ci_lower|
        +--------+--------+
        |     0.5|     0.1|
        |     1.0|     0.5|
        +--------+--------+
        <BLANKLINE>

        >>> ci_upper = f.col("ci_upper")
        >>> ci_lower = f.col("ci_lower")
        >>> standard_error = f.round(stderr_from_ci(ci_upper, ci_lower), 2).alias("standardError")
        >>> df2 = df.select(ci_upper, ci_lower, standard_error)
        >>> df2.show()
        +--------+--------+-------------+
        |ci_upper|ci_lower|standardError|
        +--------+--------+-------------+
        |     0.5|     0.1|         0.41|
        |     1.0|     0.5|         0.18|
        +--------+--------+-------------+
        <BLANKLINE>
    """
    if odds_ratio_based:
        return (f.abs(f.log(ci_upper) - f.log(ci_lower)) / (2 * 1.96)).alias(
            "standardError"
        )
    return (f.abs(ci_upper - ci_lower) / (2 * 1.96)).alias("standardError")

zscore_from_pvalue(pval_col: Column, beta: Column) -> Column

Convert p-value column to z-score column.

Parameters:

Name Type Description Default
pval_col Column

p-value

required
beta Column

Effect size in beta - used to derive the sign of the z-score.

required

Returns:

Name Type Description
Column Column

p-values transformed to z-scores

Examples:

>>> data = [("1.0", -1.0), ("0.9", -1.0), ("0.05", 1.0), ("1e-300", 1.0), ("1e-1000", None), (None, 1.0)]
>>> schema = "pval STRING, beta FLOAT"
>>> df = spark.createDataFrame(data, schema)
>>> df.show()
+-------+----+
|   pval|beta|
+-------+----+
|    1.0|-1.0|
|    0.9|-1.0|
|   0.05| 1.0|
| 1e-300| 1.0|
|1e-1000|NULL|
|   NULL| 1.0|
+-------+----+
>>> df.withColumn("zscore", zscore_from_pvalue(f.col("pval"), f.col("beta"))).show()
+-------+----+--------------------+
|   pval|beta|              zscore|
+-------+----+--------------------+
|    1.0|-1.0|                -0.0|
|    0.9|-1.0|-0.12566134685507405|
|   0.05| 1.0|   1.959963984540055|
| 1e-300| 1.0|  37.065787880772135|
|1e-1000|NULL|   67.75421020128564|
|   NULL| 1.0|                NULL|
+-------+----+--------------------+
Source code in src/gentropy/common/stats.py
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
def zscore_from_pvalue(pval_col: Column, beta: Column) -> Column:
    """Convert p-value column to z-score column.

    Args:
        pval_col (Column): p-value
        beta (Column): Effect size in beta - used to derive the sign of the z-score.

    Returns:
        Column: p-values transformed to z-scores

    Examples:
        >>> data = [("1.0", -1.0), ("0.9", -1.0), ("0.05", 1.0), ("1e-300", 1.0), ("1e-1000", None), (None, 1.0)]
        >>> schema = "pval STRING, beta FLOAT"
        >>> df = spark.createDataFrame(data, schema)
        >>> df.show()
        +-------+----+
        |   pval|beta|
        +-------+----+
        |    1.0|-1.0|
        |    0.9|-1.0|
        |   0.05| 1.0|
        | 1e-300| 1.0|
        |1e-1000|NULL|
        |   NULL| 1.0|
        +-------+----+
        <BLANKLINE>

        >>> df.withColumn("zscore", zscore_from_pvalue(f.col("pval"), f.col("beta"))).show()
        +-------+----+--------------------+
        |   pval|beta|              zscore|
        +-------+----+--------------------+
        |    1.0|-1.0|                -0.0|
        |    0.9|-1.0|-0.12566134685507405|
        |   0.05| 1.0|   1.959963984540055|
        | 1e-300| 1.0|  37.065787880772135|
        |1e-1000|NULL|   67.75421020128564|
        |   NULL| 1.0|                NULL|
        +-------+----+--------------------+
        <BLANKLINE>

    """
    mantissa, exponent = split_pvalue_column(pval_col)
    sign = (
        f.when(beta > 0, f.lit(1))
        .when(beta < 0, f.lit(-1))
        .when(beta.isNull(), f.lit(1))
    )
    return (sign * f.sqrt(chi2_from_pvalue(mantissa, exponent))).alias("zscore")

gentropy.common.stats.get_logsum(arr: NDArray[np.float64]) -> float

Calculates logarithm of the sum of exponents of a vector. The max is extracted to ensure that the sum is not Inf.

This function emulates scipy's logsumexp expression.

Parameters:

Name Type Description Default
arr NDArray[float64]

input array

required

Returns:

Name Type Description
float float

logsumexp of the input array

Examples:

>>> l = [0.2, 0.1, 0.05, 0]
>>> round(get_logsum(l), 6)
1.476557
Source code in src/gentropy/common/stats.py
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def get_logsum(arr: NDArray[np.float64]) -> float:
    """Calculates logarithm of the sum of exponents of a vector. The max is extracted to ensure that the sum is not Inf.

    This function emulates scipy's logsumexp expression.

    Args:
        arr (NDArray[np.float64]): input array

    Returns:
        float: logsumexp of the input array

    Examples:
        >>> l = [0.2, 0.1, 0.05, 0]
        >>> round(get_logsum(l), 6)
        1.476557
    """
    MAX = np.max(arr)
    result = MAX + np.log(np.sum(np.exp(arr - MAX)))
    return float(result)

gentropy.common.stats.split_pvalue(pvalue: float) -> tuple[float, int]

Convert a float to 10 based exponent and mantissa.

Parameters:

Name Type Description Default
pvalue float

p-value

required

Returns:

Type Description
tuple[float, int]

tuple[float, int]: Tuple with mantissa and exponent

Raises:

Type Description
ValueError

If p-value is not between 0 and 1

Examples:

>>> split_pvalue(0.00001234)
(1.234, -5)
>>> split_pvalue(1)
(1.0, 0)
>>> split_pvalue(0.123)
(1.23, -1)
>>> split_pvalue(0.99)
(9.9, -1)
Source code in src/gentropy/common/stats.py
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
def split_pvalue(pvalue: float) -> tuple[float, int]:
    """Convert a float to 10 based exponent and mantissa.

    Args:
        pvalue (float): p-value

    Returns:
        tuple[float, int]: Tuple with mantissa and exponent

    Raises:
        ValueError: If p-value is not between 0 and 1

    Examples:
        >>> split_pvalue(0.00001234)
        (1.234, -5)

        >>> split_pvalue(1)
        (1.0, 0)

        >>> split_pvalue(0.123)
        (1.23, -1)

        >>> split_pvalue(0.99)
        (9.9, -1)
    """
    if pvalue < 0.0 or pvalue > 1.0:
        raise ValueError("P-value must be between 0 and 1")

    exponent = floor(log10(pvalue)) if pvalue != 0 else 0
    mantissa = round(pvalue / 10**exponent, 3)
    return (mantissa, exponent)

gentropy.common.stats.chi2_from_pvalue(p_value_mantissa: Column, p_value_exponent: Column) -> Column

Calculate chi2 from p-value.

This function calculates the chi2 value from the p-value mantissa and exponent. In case the p-value is very small (exponent < -300), it uses an approximation based on a linear regression model. The approximation is based on the formula: -5.367 * neglog_pval + 4.596, where neglog_pval is the negative log10 of the p-value mantissa.

Parameters:

Name Type Description Default
p_value_mantissa Column

Mantissa of the p-value (float)

required
p_value_exponent Column

Exponent of the p-value (integer)

required

Returns:

Name Type Description
Column Column

Chi2 value (float)

Examples:

>>> data = [(5.0, -8), (9.0, -300), (9.0, -301)]
>>> schema = "pValueMantissa FLOAT, pValueExponent INT"
>>> df = spark.createDataFrame(data, schema)
>>> df.show()
+--------------+--------------+
|pValueMantissa|pValueExponent|
+--------------+--------------+
|           5.0|            -8|
|           9.0|          -300|
|           9.0|          -301|
+--------------+--------------+
>>> mantissa = f.col("pValueMantissa")
>>> exponent = f.col("pValueExponent")
>>> chi2 = f.round(chi2_from_pvalue(mantissa, exponent), 2).alias("chi2")
>>> df2 = df.select(mantissa, exponent, chi2)
>>> df2.show()
+--------------+--------------+-------+
|pValueMantissa|pValueExponent|   chi2|
+--------------+--------------+-------+
|           5.0|            -8|  29.72|
|           9.0|          -300|1369.48|
|           9.0|          -301|1373.64|
+--------------+--------------+-------+
Source code in src/gentropy/common/stats.py
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
def chi2_from_pvalue(p_value_mantissa: Column, p_value_exponent: Column) -> Column:
    """Calculate chi2 from p-value.

    This function calculates the chi2 value from the p-value mantissa and exponent.
    In case the p-value is very small (exponent < -300), it uses an approximation based on a linear regression model.
    The approximation is based on the formula: -5.367 * neglog_pval + 4.596, where neglog_pval is the negative log10 of the p-value mantissa.


    Args:
        p_value_mantissa (Column): Mantissa of the p-value (float)
        p_value_exponent (Column): Exponent of the p-value (integer)

    Returns:
        Column: Chi2 value (float)

    Examples:
        >>> data = [(5.0, -8), (9.0, -300), (9.0, -301)]
        >>> schema = "pValueMantissa FLOAT, pValueExponent INT"
        >>> df = spark.createDataFrame(data, schema)
        >>> df.show()
        +--------------+--------------+
        |pValueMantissa|pValueExponent|
        +--------------+--------------+
        |           5.0|            -8|
        |           9.0|          -300|
        |           9.0|          -301|
        +--------------+--------------+
        <BLANKLINE>

        >>> mantissa = f.col("pValueMantissa")
        >>> exponent = f.col("pValueExponent")
        >>> chi2 = f.round(chi2_from_pvalue(mantissa, exponent), 2).alias("chi2")
        >>> df2 = df.select(mantissa, exponent, chi2)
        >>> df2.show()
        +--------------+--------------+-------+
        |pValueMantissa|pValueExponent|   chi2|
        +--------------+--------------+-------+
        |           5.0|            -8|  29.72|
        |           9.0|          -300|1369.48|
        |           9.0|          -301|1373.64|
        +--------------+--------------+-------+
        <BLANKLINE>
    """
    PVAL_EXP_THRESHOLD = f.lit(-300)
    APPROX_INTERCEPT = f.lit(-5.367)
    APPROX_COEF = f.lit(4.596)
    neglog_pval = neglogpval_from_pvalue(p_value_mantissa, p_value_exponent)
    p_value = p_value_mantissa * f.pow(10, p_value_exponent)
    neglog_approx = (neglog_pval * APPROX_COEF + APPROX_INTERCEPT).cast(t.DoubleType())

    return (
        f.when(p_value_exponent < PVAL_EXP_THRESHOLD, neglog_approx)
        .otherwise(chi2_inverse_survival_function(p_value))
        .alias("chi2")
    )

gentropy.common.stats.ci(pvalue_mantissa: Column, pvalue_exponent: Column, beta: Column, standard_error: Column) -> tuple[Column, Column]

Calculate the confidence interval for the effect based on the p-value and the effect size.

If the standard error already available, don't re-calculate from p-value.

Parameters:

Name Type Description Default
pvalue_mantissa Column

p-value mantissa (float)

required
pvalue_exponent Column

p-value exponent (integer)

required
beta Column

effect size in beta (float)

required
standard_error Column

standard error.

required

Returns:

Type Description
tuple[Column, Column]

tuple[Column, Column]: betaConfidenceIntervalLower (float), betaConfidenceIntervalUpper (float)

Examples:

>>> df = spark.createDataFrame([
...     (2.5, -10, 0.5, 0.2),
...     (3.0, -5, 1.0, None),
...     (1.5, -8, -0.2, 0.1)
...     ], ["pvalue_mantissa", "pvalue_exponent", "beta", "standard_error"]
... )
>>> df.select("*", *ci(f.col("pvalue_mantissa"), f.col("pvalue_exponent"), f.col("beta"), f.col("standard_error"))).show()
+---------------+---------------+----+--------------+---------------------------+---------------------------+
|pvalue_mantissa|pvalue_exponent|beta|standard_error|betaConfidenceIntervalLower|betaConfidenceIntervalUpper|
+---------------+---------------+----+--------------+---------------------------+---------------------------+
|            2.5|            -10| 0.5|           0.2|        0.10799999999999998|                      0.892|
|            3.0|             -5| 1.0|          NULL|         0.5303664052547075|         1.4696335947452925|
|            1.5|             -8|-0.2|           0.1|                     -0.396|       -0.00400000000000...|
+---------------+---------------+----+--------------+---------------------------+---------------------------+
Source code in src/gentropy/common/stats.py
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
def ci(
    pvalue_mantissa: Column,
    pvalue_exponent: Column,
    beta: Column,
    standard_error: Column,
) -> tuple[Column, Column]:
    """Calculate the confidence interval for the effect based on the p-value and the effect size.

    If the standard error already available, don't re-calculate from p-value.

    Args:
        pvalue_mantissa (Column): p-value mantissa (float)
        pvalue_exponent (Column): p-value exponent (integer)
        beta (Column): effect size in beta (float)
        standard_error (Column): standard error.

    Returns:
        tuple[Column, Column]: betaConfidenceIntervalLower (float), betaConfidenceIntervalUpper (float)

    Examples:
        >>> df = spark.createDataFrame([
        ...     (2.5, -10, 0.5, 0.2),
        ...     (3.0, -5, 1.0, None),
        ...     (1.5, -8, -0.2, 0.1)
        ...     ], ["pvalue_mantissa", "pvalue_exponent", "beta", "standard_error"]
        ... )
        >>> df.select("*", *ci(f.col("pvalue_mantissa"), f.col("pvalue_exponent"), f.col("beta"), f.col("standard_error"))).show()
        +---------------+---------------+----+--------------+---------------------------+---------------------------+
        |pvalue_mantissa|pvalue_exponent|beta|standard_error|betaConfidenceIntervalLower|betaConfidenceIntervalUpper|
        +---------------+---------------+----+--------------+---------------------------+---------------------------+
        |            2.5|            -10| 0.5|           0.2|        0.10799999999999998|                      0.892|
        |            3.0|             -5| 1.0|          NULL|         0.5303664052547075|         1.4696335947452925|
        |            1.5|             -8|-0.2|           0.1|                     -0.396|       -0.00400000000000...|
        +---------------+---------------+----+--------------+---------------------------+---------------------------+
        <BLANKLINE>
    """
    # Calculate p-value from mantissa and exponent:
    pvalue = pvalue_mantissa * f.pow(10, pvalue_exponent)

    # Fix p-value underflow:
    pvalue = f.when(pvalue == 0, sys.float_info.min).otherwise(pvalue)

    # Compute missing standard error:
    standard_error = f.when(
        standard_error.isNull(), f.abs(beta) / f.abs(zscore_from_pvalue(pvalue, beta))
    ).otherwise(standard_error)

    # Calculate upper and lower confidence interval:
    z_score_095 = 1.96
    ci_lower = (beta - z_score_095 * standard_error).alias(
        "betaConfidenceIntervalLower"
    )
    ci_upper = (beta + z_score_095 * standard_error).alias(
        "betaConfidenceIntervalUpper"
    )

    return (ci_lower, ci_upper)

gentropy.common.stats.neglogpval_from_z2(z2: Column) -> Column

Calculate negative log10 of p-value from squared Z-score following chi2 distribution.

The Z-score^2 is equal to the chi2 with 1 degree of freedom.

In case of very large Z-score (very small corresponding p-value), the function uses a linear approximation.

Parameters:

Name Type Description Default
z2 Column

Z-score squared.

required

Returns:

Name Type Description
Column Column

negative log of p-value.

Examples:

>>> data = [(1.0,), (2000.0,)]
>>> schema = "z2 FLOAT"
>>> df = spark.createDataFrame(data, schema)
>>> df.show()
+------+
|    z2|
+------+
|   1.0|
|2000.0|
+------+
>>> neglogpval = f.round(neglogpval_from_z2(f.col("z2")), 2).alias("neglogpval")
>>> df2 = df.select(f.col("z2"), neglogpval)
>>> df2.show()
+------+----------+
|    z2|neglogpval|
+------+----------+
|   1.0|       0.5|
|2000.0|    436.02|
+------+----------+
Source code in src/gentropy/common/stats.py
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
def neglogpval_from_z2(z2: Column) -> Column:
    """Calculate negative log10 of p-value from squared Z-score following chi2 distribution.

    **The Z-score^2 is equal to the chi2 with 1 degree of freedom.**

    In case of very large Z-score (very small corresponding p-value), the function uses a linear approximation.

    Args:
        z2 (Column): Z-score squared.

    Returns:
        Column:  negative log of p-value.

    Examples:
        >>> data = [(1.0,), (2000.0,)]
        >>> schema = "z2 FLOAT"
        >>> df = spark.createDataFrame(data, schema)
        >>> df.show()
        +------+
        |    z2|
        +------+
        |   1.0|
        |2000.0|
        +------+
        <BLANKLINE>

        >>> neglogpval = f.round(neglogpval_from_z2(f.col("z2")), 2).alias("neglogpval")
        >>> df2 = df.select(f.col("z2"), neglogpval)
        >>> df2.show()
        +------+----------+
        |    z2|neglogpval|
        +------+----------+
        |   1.0|       0.5|
        |2000.0|    436.02|
        +------+----------+
        <BLANKLINE>
    """
    MAX_EXACT_Z2 = f.lit(1400)
    APPROX_INTERCEPT = f.lit(1.4190)
    APPROX_COEFF = f.lit(0.2173)
    approximate_neglogpval_from_z2 = APPROX_INTERCEPT + APPROX_COEFF * z2
    computed_neglogpval_from_z2 = -1 * f.log10(chi2_survival_function(z2))
    return f.when(z2 <= MAX_EXACT_Z2, computed_neglogpval_from_z2).otherwise(
        approximate_neglogpval_from_z2
    )

gentropy.common.stats.neglogpval_from_pvalue(p_value_mantissa: Column, p_value_exponent: Column) -> Column

Compute the negative log p-value.

Parameters:

Name Type Description Default
p_value_mantissa Column

P-value mantissa

required
p_value_exponent Column

P-value exponent

required

Returns:

Name Type Description
Column Column

Negative log p-value

Examples:

>>> d = [(1, 1), (5, -2), (1, -1000)]
>>> df = spark.createDataFrame(d).toDF("p_value_mantissa", "p_value_exponent")
>>> df.withColumn("neg_log_p", neglogpval_from_pvalue(f.col("p_value_mantissa"), f.col("p_value_exponent"))).show()
+----------------+----------------+------------------+
|p_value_mantissa|p_value_exponent|         neg_log_p|
+----------------+----------------+------------------+
|               1|               1|              -1.0|
|               5|              -2|1.3010299956639813|
|               1|           -1000|            1000.0|
+----------------+----------------+------------------+
Source code in src/gentropy/common/stats.py
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
def neglogpval_from_pvalue(
    p_value_mantissa: Column, p_value_exponent: Column
) -> Column:
    """Compute the negative log p-value.

    Args:
        p_value_mantissa (Column): P-value mantissa
        p_value_exponent (Column): P-value exponent

    Returns:
        Column: Negative log p-value

    Examples:
        >>> d = [(1, 1), (5, -2), (1, -1000)]
        >>> df = spark.createDataFrame(d).toDF("p_value_mantissa", "p_value_exponent")
        >>> df.withColumn("neg_log_p", neglogpval_from_pvalue(f.col("p_value_mantissa"), f.col("p_value_exponent"))).show()
        +----------------+----------------+------------------+
        |p_value_mantissa|p_value_exponent|         neg_log_p|
        +----------------+----------------+------------------+
        |               1|               1|              -1.0|
        |               5|              -2|1.3010299956639813|
        |               1|           -1000|            1000.0|
        +----------------+----------------+------------------+
        <BLANKLINE>
    """
    return -1 * (f.log10(p_value_mantissa) + p_value_exponent)

gentropy.common.stats.normalise_gwas_statistics(beta: Column, odds_ratio: Column, standard_error: Column, ci_upper: Column, ci_lower: Column, mantissa: Column, exponent: Column) -> GWASEffect

Normalise beta and standard error from given values.

This function attempts to harmonise Effect and Standard Error given various inputs.

Note

Effect (Beta) harmonisation: - If beta is not null, it is kept as is. - If beta is null, but odds ratio is not null, odds ratio is converted to beta

Note

Effect Standard Error (std(beta)) harmonisation Prefer calculation from p-value and beta, if available, as the confidence interval is usually rounded and may lead to loss of precision: - If standard error is not null, it is kept as is. - If standard error is null, but beta, pval-mantissa, pval-exponent are not null, convert pval components and beta to standard error - If standard error is null, but ci-upper and ci-lower are not null and they come from odds ratio, convert them to standard error.

Parameters:

Name Type Description Default
beta Column

Effect in beta.

required
odds_ratio Column

Effect in odds ratio.

required
standard_error Column

Standard error of the effect.

required
ci_upper Column

Upper bound of the confidence interval.

required
ci_lower Column

Lower bound of the confidence interval.

required
mantissa Column

Mantissa of the p-value.

required
exponent Column

Exponent of the p-value.

required

Returns:

Name Type Description
GWASEffect GWASEffect

named tuple with standardError and beta columns.

Examples:

>>> x1 = (0.1, 1.1, 0.1, None, None, 9.0, -100) # keep beta, keep std error
>>> x2 = (None, 1.1, 0.1, None, None, 9.0, -100) # convert odds ratio to beta, keep std error
>>> x3 = (None, 1.1, None, 1.30, 0.90, None, None) # convert odds ratio to beta, convert ci to standard error
>>> x4 = (0.1, 1.1, None, 1.30, 0.90, None, None) # keep beta, convert ci to standard error
>>> x5 = (None, 1.1, None, 1.30, 0.90, 9.0, -100) # convert beta to odds ratio, convert p-value and beta to standard error
>>> x6 = (0.1, None, None, None, None, 9.0, -100) # keep beta, convert p-value and beta to standard error
>>> x7 = (None, None, None, 1.3, 0.9, 9.0, -100) # keep beta NULL, without beta we do not want to compute the standard error
>>> data = [x1, x2, x3, x4, x5, x6, x7]
>>> schema = "beta FLOAT, oddsRatio FLOAT, standardError FLOAT, ci_upper FLOAT, ci_lower FLOAT, mantissa FLOAT, exp INT"
>>> df = spark.createDataFrame(data, schema)
>>> df.show()
+----+---------+-------------+--------+--------+--------+----+
|beta|oddsRatio|standardError|ci_upper|ci_lower|mantissa| exp|
+----+---------+-------------+--------+--------+--------+----+
| 0.1|      1.1|          0.1|    NULL|    NULL|     9.0|-100|
|NULL|      1.1|          0.1|    NULL|    NULL|     9.0|-100|
|NULL|      1.1|         NULL|     1.3|     0.9|    NULL|NULL|
| 0.1|      1.1|         NULL|     1.3|     0.9|    NULL|NULL|
|NULL|      1.1|         NULL|     1.3|     0.9|     9.0|-100|
| 0.1|     NULL|         NULL|    NULL|    NULL|     9.0|-100|
|NULL|     NULL|         NULL|     1.3|     0.9|     9.0|-100|
+----+---------+-------------+--------+--------+--------+----+
>>> beta = f.col("beta")
>>> odds_ratio = f.col("oddsRatio")
>>> se = f.col("standardError")
>>> ci_upper = f.col("ci_upper")
>>> ci_lower = f.col("ci_lower")
>>> mantissa = f.col("mantissa")
>>> exponent = f.col("exp")
>>> cols = normalise_gwas_statistics(
...     beta, odds_ratio, se, ci_upper, ci_lower, mantissa, exponent
... )
>>> beta_computed = f.round(cols.beta, 2).alias("beta")
>>> standard_error_computed = f.round(cols.standard_error, 2).alias("standardError")
>>> df.select(beta_computed, standard_error_computed).show()
+----+-------------+
|beta|standardError|
+----+-------------+
| 0.1|          0.1|
| 0.1|          0.1|
| 0.1|         0.09|
| 0.1|         0.09|
| 0.1|          0.0|
| 0.1|          0.0|
|NULL|         NULL|
+----+-------------+
Source code in src/gentropy/common/stats.py
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
def normalise_gwas_statistics(
    beta: Column,
    odds_ratio: Column,
    standard_error: Column,
    ci_upper: Column,
    ci_lower: Column,
    mantissa: Column,
    exponent: Column,
) -> GWASEffect:
    """Normalise beta and standard error from given values.

    This function attempts to harmonise Effect and Standard Error given various inputs.

    Note:
        Effect (Beta) harmonisation:
        - If beta is not null, it is kept as is.
        - If beta is null, but odds ratio is not null, odds ratio is converted to beta

    Note:
        Effect Standard Error (std(beta)) harmonisation
        **Prefer calculation from p-value and beta, if available, as the confidence interval is usually rounded and may lead to loss of precision**:
        - If standard error is not null, it is kept as is.
        - If standard error is null, but beta, pval-mantissa, pval-exponent are not null, convert pval components and beta to standard error
        - If standard error is null, but ci-upper and ci-lower are not null and they come from odds ratio, convert them to standard error.


    Args:
        beta (Column): Effect in beta.
        odds_ratio (Column): Effect in odds ratio.
        standard_error (Column): Standard error of the effect.
        ci_upper (Column): Upper bound of the confidence interval.
        ci_lower (Column): Lower bound of the confidence interval.
        mantissa (Column): Mantissa of the p-value.
        exponent (Column): Exponent of the p-value.

    Returns:
        GWASEffect: named tuple with standardError and beta columns.

    Examples:
        >>> x1 = (0.1, 1.1, 0.1, None, None, 9.0, -100) # keep beta, keep std error
        >>> x2 = (None, 1.1, 0.1, None, None, 9.0, -100) # convert odds ratio to beta, keep std error
        >>> x3 = (None, 1.1, None, 1.30, 0.90, None, None) # convert odds ratio to beta, convert ci to standard error
        >>> x4 = (0.1, 1.1, None, 1.30, 0.90, None, None) # keep beta, convert ci to standard error
        >>> x5 = (None, 1.1, None, 1.30, 0.90, 9.0, -100) # convert beta to odds ratio, convert p-value and beta to standard error
        >>> x6 = (0.1, None, None, None, None, 9.0, -100) # keep beta, convert p-value and beta to standard error
        >>> x7 = (None, None, None, 1.3, 0.9, 9.0, -100) # keep beta NULL, without beta we do not want to compute the standard error
        >>> data = [x1, x2, x3, x4, x5, x6, x7]

        >>> schema = "beta FLOAT, oddsRatio FLOAT, standardError FLOAT, ci_upper FLOAT, ci_lower FLOAT, mantissa FLOAT, exp INT"
        >>> df = spark.createDataFrame(data, schema)
        >>> df.show()
        +----+---------+-------------+--------+--------+--------+----+
        |beta|oddsRatio|standardError|ci_upper|ci_lower|mantissa| exp|
        +----+---------+-------------+--------+--------+--------+----+
        | 0.1|      1.1|          0.1|    NULL|    NULL|     9.0|-100|
        |NULL|      1.1|          0.1|    NULL|    NULL|     9.0|-100|
        |NULL|      1.1|         NULL|     1.3|     0.9|    NULL|NULL|
        | 0.1|      1.1|         NULL|     1.3|     0.9|    NULL|NULL|
        |NULL|      1.1|         NULL|     1.3|     0.9|     9.0|-100|
        | 0.1|     NULL|         NULL|    NULL|    NULL|     9.0|-100|
        |NULL|     NULL|         NULL|     1.3|     0.9|     9.0|-100|
        +----+---------+-------------+--------+--------+--------+----+
        <BLANKLINE>

        >>> beta = f.col("beta")
        >>> odds_ratio = f.col("oddsRatio")
        >>> se = f.col("standardError")
        >>> ci_upper = f.col("ci_upper")
        >>> ci_lower = f.col("ci_lower")
        >>> mantissa = f.col("mantissa")
        >>> exponent = f.col("exp")
        >>> cols = normalise_gwas_statistics(
        ...     beta, odds_ratio, se, ci_upper, ci_lower, mantissa, exponent
        ... )
        >>> beta_computed = f.round(cols.beta, 2).alias("beta")
        >>> standard_error_computed = f.round(cols.standard_error, 2).alias("standardError")
        >>> df.select(beta_computed, standard_error_computed).show()
        +----+-------------+
        |beta|standardError|
        +----+-------------+
        | 0.1|          0.1|
        | 0.1|          0.1|
        | 0.1|         0.09|
        | 0.1|         0.09|
        | 0.1|          0.0|
        | 0.1|          0.0|
        |NULL|         NULL|
        +----+-------------+
        <BLANKLINE>
    """
    beta = (
        f.when(beta.isNotNull(), beta)
        .when(odds_ratio.isNotNull(), f.log(odds_ratio))
        .otherwise(f.lit(None))
        .alias("beta")
    )
    chi2 = chi2_from_pvalue(mantissa, exponent)

    standard_error = (
        f.when(standard_error.isNotNull(), standard_error)
        .when(
            standard_error.isNull()
            & mantissa.isNotNull()
            & exponent.isNotNull()
            & beta.isNotNull(),
            stderr_from_chi2_and_effect_size(chi2, beta),
        )
        .when(
            standard_error.isNull()
            & ci_lower.isNotNull()
            & ci_upper.isNotNull()
            & odds_ratio.isNotNull(),
            stderr_from_ci(ci_upper, ci_lower),
        )
        .otherwise(f.lit(None))
        .alias("standardError")
    )

    return GWASEffect(standard_error=standard_error, beta=beta)

gentropy.common.stats.pvalue_from_neglogpval(p_value: Column) -> PValComponents

Computing p-value mantissa and exponent based on the negative 10 based logarithm of the p-value.

Parameters:

Name Type Description Default
p_value Column

Neg-log p-value (string)

required

Returns:

Name Type Description
PValComponents PValComponents

mantissa and exponent of the p-value

Examples:

>>> (
... spark.createDataFrame([(4.56, 'a'),(2109.23, 'b')], ['negLogPv', 'label'])
... .select('negLogPv',*pvalue_from_neglogpval(f.col('negLogPv')))
... .show()
... )
+--------+--------------+--------------+
|negLogPv|pValueMantissa|pValueExponent|
+--------+--------------+--------------+
|    4.56|     2.7542286|            -5|
| 2109.23|     5.8884363|         -2110|
+--------+--------------+--------------+
Source code in src/gentropy/common/stats.py
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
def pvalue_from_neglogpval(p_value: Column) -> PValComponents:
    """Computing p-value mantissa and exponent based on the negative 10 based logarithm of the p-value.

    Args:
        p_value (Column): Neg-log p-value (string)

    Returns:
        PValComponents: mantissa and exponent of the p-value

    Examples:
        >>> (
        ... spark.createDataFrame([(4.56, 'a'),(2109.23, 'b')], ['negLogPv', 'label'])
        ... .select('negLogPv',*pvalue_from_neglogpval(f.col('negLogPv')))
        ... .show()
        ... )
        +--------+--------------+--------------+
        |negLogPv|pValueMantissa|pValueExponent|
        +--------+--------------+--------------+
        |    4.56|     2.7542286|            -5|
        | 2109.23|     5.8884363|         -2110|
        +--------+--------------+--------------+
        <BLANKLINE>
    """
    exponent: Column = f.ceil(p_value)
    mantissa: Column = f.pow(f.lit(10), (exponent - p_value))

    return PValComponents(
        mantissa=mantissa.cast(t.FloatType()).alias("pValueMantissa"),
        exponent=(-1 * exponent).cast(t.IntegerType()).alias("pValueExponent"),
    )

gentropy.common.stats.split_pvalue_column(pv: Column) -> PValComponents

This function takes a p-value string and returns two columns mantissa (float), exponent (integer).

Parameters:

Name Type Description Default
pv Column

P-value as string

required

Returns:

Name Type Description
PValComponents PValComponents

pValueMantissa (float), pValueExponent (integer)

Examples:

>>> d = [("0.01",),("4.2E-45",),("43.2E5",),("0",),("1",)]
>>> spark.createDataFrame(d, ['pval']).select('pval',*split_pvalue_column(f.col('pval'))).show()
+-------+--------------+--------------+
|   pval|pValueMantissa|pValueExponent|
+-------+--------------+--------------+
|   0.01|           1.0|            -2|
|4.2E-45|           4.2|           -45|
| 43.2E5|          43.2|             5|
|      0|         2.225|          -308|
|      1|           1.0|             0|
+-------+--------------+--------------+
Source code in src/gentropy/common/stats.py
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
def split_pvalue_column(pv: Column) -> PValComponents:
    """This function takes a p-value string and returns two columns mantissa (float), exponent (integer).

    Args:
        pv (Column): P-value as string

    Returns:
        PValComponents: pValueMantissa (float), pValueExponent (integer)

    Examples:
        >>> d = [("0.01",),("4.2E-45",),("43.2E5",),("0",),("1",)]
        >>> spark.createDataFrame(d, ['pval']).select('pval',*split_pvalue_column(f.col('pval'))).show()
        +-------+--------------+--------------+
        |   pval|pValueMantissa|pValueExponent|
        +-------+--------------+--------------+
        |   0.01|           1.0|            -2|
        |4.2E-45|           4.2|           -45|
        | 43.2E5|          43.2|             5|
        |      0|         2.225|          -308|
        |      1|           1.0|             0|
        +-------+--------------+--------------+
        <BLANKLINE>
    """
    # Making sure there's a number in the string:
    pv = f.when(
        pv == f.lit("0"), f.lit(sys.float_info.min).cast(t.StringType())
    ).otherwise(pv)

    # Get exponent:
    exponent = f.when(
        f.upper(pv).contains("E"),
        f.split(f.upper(pv), "E").getItem(1),
    ).otherwise(f.floor(f.log10(pv)))

    # Get mantissa:
    mantissa = f.when(
        f.upper(pv).contains("E"),
        f.split(f.upper(pv), "E").getItem(0),
    ).otherwise(pv / (10**exponent))

    # Round value:
    mantissa = f.round(mantissa, 3)

    return PValComponents(
        mantissa=mantissa.cast(t.FloatType()).alias("pValueMantissa"),
        exponent=exponent.cast(t.IntegerType()).alias("pValueExponent"),
    )

gentropy.common.stats.stderr_from_chi2_and_effect_size(chi2_col: Column, beta: Column) -> Column

Calculate standard error from chi2 and beta.

This function calculates the standard error from the chi2 value and beta.

Parameters:

Name Type Description Default
chi2_col Column

Chi2 value (float)

required
beta Column

Beta value (float)

required

Returns:

Name Type Description
Column Column

Standard error (float)

Examples:

>>> data = [(29.72, 3.0), (3.84, 1.0)]
>>> schema = "chi2 FLOAT, beta FLOAT"
>>> df = spark.createDataFrame(data, schema)
>>> df.show()
+-----+----+
| chi2|beta|
+-----+----+
|29.72| 3.0|
| 3.84| 1.0|
+-----+----+
>>> chi2_col = f.col("chi2")
>>> beta = f.col("beta")
>>> standard_error = f.round(stderr_from_chi2_and_effect_size(chi2_col, beta), 2).alias("standardError")
>>> df2 = df.select(chi2_col, beta, standard_error)
>>> df2.show()
+-----+----+-------------+
| chi2|beta|standardError|
+-----+----+-------------+
|29.72| 3.0|         0.55|
| 3.84| 1.0|         0.51|
+-----+----+-------------+
Source code in src/gentropy/common/stats.py
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
def stderr_from_chi2_and_effect_size(chi2_col: Column, beta: Column) -> Column:
    """Calculate standard error from chi2 and beta.

    This function calculates the standard error from the chi2 value and beta.

    Args:
        chi2_col (Column): Chi2 value (float)
        beta (Column): Beta value (float)

    Returns:
        Column: Standard error (float)

    Examples:
        >>> data = [(29.72, 3.0), (3.84, 1.0)]
        >>> schema = "chi2 FLOAT, beta FLOAT"
        >>> df = spark.createDataFrame(data, schema)
        >>> df.show()
        +-----+----+
        | chi2|beta|
        +-----+----+
        |29.72| 3.0|
        | 3.84| 1.0|
        +-----+----+
        <BLANKLINE>

        >>> chi2_col = f.col("chi2")
        >>> beta = f.col("beta")
        >>> standard_error = f.round(stderr_from_chi2_and_effect_size(chi2_col, beta), 2).alias("standardError")
        >>> df2 = df.select(chi2_col, beta, standard_error)
        >>> df2.show()
        +-----+----+-------------+
        | chi2|beta|standardError|
        +-----+----+-------------+
        |29.72| 3.0|         0.55|
        | 3.84| 1.0|         0.51|
        +-----+----+-------------+
        <BLANKLINE>

    """
    return (f.abs(beta) / f.sqrt(chi2_col)).alias("standardError")

gentropy.common.stats.stderr_from_ci(ci_upper: Column, ci_lower: Column, odds_ratio_based: bool = True) -> Column

Calculate standard error from confidence interval.

This function calculates the standard error from the confidence interval upper and lower bounds.

Parameters:

Name Type Description Default
ci_upper Column

Upper bound of the confidence interval (float)

required
ci_lower Column

Lower bound of the confidence interval (float)

required
odds_ratio_based bool

If True, the function assumes that the confidence interval is based on odds ratio. use log difference (default), else it assumes that the confidence interval is based on beta.

True

Returns:

Name Type Description
Column Column

Standard error (float)

Note

Absolute value of the log difference is used to ensure that the standard error is always positive, even if the ci bounds are inverted.

Examples:

>>> data = [(0.5, 0.1), (1.0, 0.5)]
>>> schema = "ci_upper FLOAT, ci_lower FLOAT"
>>> df = spark.createDataFrame(data, schema)
>>> df.show()
+--------+--------+
|ci_upper|ci_lower|
+--------+--------+
|     0.5|     0.1|
|     1.0|     0.5|
+--------+--------+
>>> ci_upper = f.col("ci_upper")
>>> ci_lower = f.col("ci_lower")
>>> standard_error = f.round(stderr_from_ci(ci_upper, ci_lower), 2).alias("standardError")
>>> df2 = df.select(ci_upper, ci_lower, standard_error)
>>> df2.show()
+--------+--------+-------------+
|ci_upper|ci_lower|standardError|
+--------+--------+-------------+
|     0.5|     0.1|         0.41|
|     1.0|     0.5|         0.18|
+--------+--------+-------------+
Source code in src/gentropy/common/stats.py
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
def stderr_from_ci(
    ci_upper: Column, ci_lower: Column, odds_ratio_based: bool = True
) -> Column:
    """Calculate standard error from confidence interval.

    This function calculates the standard error from the confidence interval upper and lower bounds.

    Args:
        ci_upper (Column): Upper bound of the confidence interval (float)
        ci_lower (Column): Lower bound of the confidence interval (float)
        odds_ratio_based (bool): If True, the function assumes that the confidence interval is based on odds ratio.
            use log difference (default), else it assumes that the confidence interval is based on beta.

    Returns:
        Column: Standard error (float)

    Note:
        Absolute value of the log difference is used to ensure that the standard error is always positive, even if the ci bounds are inverted.


    Examples:
        >>> data = [(0.5, 0.1), (1.0, 0.5)]
        >>> schema = "ci_upper FLOAT, ci_lower FLOAT"
        >>> df = spark.createDataFrame(data, schema)
        >>> df.show()
        +--------+--------+
        |ci_upper|ci_lower|
        +--------+--------+
        |     0.5|     0.1|
        |     1.0|     0.5|
        +--------+--------+
        <BLANKLINE>

        >>> ci_upper = f.col("ci_upper")
        >>> ci_lower = f.col("ci_lower")
        >>> standard_error = f.round(stderr_from_ci(ci_upper, ci_lower), 2).alias("standardError")
        >>> df2 = df.select(ci_upper, ci_lower, standard_error)
        >>> df2.show()
        +--------+--------+-------------+
        |ci_upper|ci_lower|standardError|
        +--------+--------+-------------+
        |     0.5|     0.1|         0.41|
        |     1.0|     0.5|         0.18|
        +--------+--------+-------------+
        <BLANKLINE>
    """
    if odds_ratio_based:
        return (f.abs(f.log(ci_upper) - f.log(ci_lower)) / (2 * 1.96)).alias(
            "standardError"
        )
    return (f.abs(ci_upper - ci_lower) / (2 * 1.96)).alias("standardError")

gentropy.common.stats.zscore_from_pvalue(pval_col: Column, beta: Column) -> Column

Convert p-value column to z-score column.

Parameters:

Name Type Description Default
pval_col Column

p-value

required
beta Column

Effect size in beta - used to derive the sign of the z-score.

required

Returns:

Name Type Description
Column Column

p-values transformed to z-scores

Examples:

>>> data = [("1.0", -1.0), ("0.9", -1.0), ("0.05", 1.0), ("1e-300", 1.0), ("1e-1000", None), (None, 1.0)]
>>> schema = "pval STRING, beta FLOAT"
>>> df = spark.createDataFrame(data, schema)
>>> df.show()
+-------+----+
|   pval|beta|
+-------+----+
|    1.0|-1.0|
|    0.9|-1.0|
|   0.05| 1.0|
| 1e-300| 1.0|
|1e-1000|NULL|
|   NULL| 1.0|
+-------+----+
>>> df.withColumn("zscore", zscore_from_pvalue(f.col("pval"), f.col("beta"))).show()
+-------+----+--------------------+
|   pval|beta|              zscore|
+-------+----+--------------------+
|    1.0|-1.0|                -0.0|
|    0.9|-1.0|-0.12566134685507405|
|   0.05| 1.0|   1.959963984540055|
| 1e-300| 1.0|  37.065787880772135|
|1e-1000|NULL|   67.75421020128564|
|   NULL| 1.0|                NULL|
+-------+----+--------------------+
Source code in src/gentropy/common/stats.py
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
def zscore_from_pvalue(pval_col: Column, beta: Column) -> Column:
    """Convert p-value column to z-score column.

    Args:
        pval_col (Column): p-value
        beta (Column): Effect size in beta - used to derive the sign of the z-score.

    Returns:
        Column: p-values transformed to z-scores

    Examples:
        >>> data = [("1.0", -1.0), ("0.9", -1.0), ("0.05", 1.0), ("1e-300", 1.0), ("1e-1000", None), (None, 1.0)]
        >>> schema = "pval STRING, beta FLOAT"
        >>> df = spark.createDataFrame(data, schema)
        >>> df.show()
        +-------+----+
        |   pval|beta|
        +-------+----+
        |    1.0|-1.0|
        |    0.9|-1.0|
        |   0.05| 1.0|
        | 1e-300| 1.0|
        |1e-1000|NULL|
        |   NULL| 1.0|
        +-------+----+
        <BLANKLINE>

        >>> df.withColumn("zscore", zscore_from_pvalue(f.col("pval"), f.col("beta"))).show()
        +-------+----+--------------------+
        |   pval|beta|              zscore|
        +-------+----+--------------------+
        |    1.0|-1.0|                -0.0|
        |    0.9|-1.0|-0.12566134685507405|
        |   0.05| 1.0|   1.959963984540055|
        | 1e-300| 1.0|  37.065787880772135|
        |1e-1000|NULL|   67.75421020128564|
        |   NULL| 1.0|                NULL|
        +-------+----+--------------------+
        <BLANKLINE>

    """
    mantissa, exponent = split_pvalue_column(pval_col)
    sign = (
        f.when(beta > 0, f.lit(1))
        .when(beta < 0, f.lit(-1))
        .when(beta.isNull(), f.lit(1))
    )
    return (sign * f.sqrt(chi2_from_pvalue(mantissa, exponent))).alias("zscore")