# SNOWFLAKE analysis of Virtual Screening Results

**
SNOWFLAKE
**
is a data-driven approach to interpreting virtual screening results under a set of simple statistical assumptions where the null hypothesis is that there are only inactive molecules in the screen. Upon analysis deviations from the expexted behavior are indicative of likely active molecules. See
this presentation by Paul Hawkins
for more details.

**
Snowflake Assumptions
**

*
Null Hypothesis
*

All scores are distributed at random.

This could occur because A) your scoring method is crap, or B) you didnt have any active molecules in your screen.

If the null hypothesis seems to break down, especially at the high end of the score distribution, then perhaps it is time to examine the:

*
Alternate Hypothesis
*

there

be present within this set a small number of actives. If our screening method not crap, these active molecules will tend to have better scores than the inactives.may

Lets generate some hypothetical data. We often use bounded scores (e.g. Tanimoto) for VS methods, so we'll generate some hypothetical screening scores below using beta distributions. In this case we'll assume are screening method is successful at separating actives from decoys, but of course there are a much smaller number of actives in the molecules screened than decoys.

```
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_context("poster")
```

```
from scipy.stats import beta
# You can adjust these distributions to determine the accuracy of our imaginary screening method
active_dist = beta(30,15)
inactive_dist = beta(10,15)
_ = np.linspace(0,1,100)
plt.plot( _ , active_dist.pdf(_) , label="Active Score PDF")
plt.plot( _ , inactive_dist.pdf(_), label="Inactive Score PDF")
plt.legend(loc="upper left");
```

Now we'll draw some sample data from these distributions. We'll assume we screened 10000 molecules, of which only 200 were actives.

```
n_inactives = 9800
n_actives = 200
n_screened = n_actives+n_inactives
act_scores = active_dist.rvs(n_actives)
inact_scores = inactive_dist.rvs(n_inactives)
plt.hist(act_scores , bins=50, histtype='bar', alpha=0.4, label="Actives")
plt.hist(inact_scores , bins=50, histtype='stepfilled', alpha=0.4, label="Decoys")
plt.legend(loc="upper left");
```

Of course in a true VS we don't know the distribution of actives, so all we know is we get a distribution like this:

```
scores = np.concatenate((act_scores, inact_scores))
plt.hist(scores , bins=50, histtype='stepfilled', color='k', alpha=0.4, label="Scores")
plt.legend(loc="upper left");
```

#### Applying SNOWFLAKE ¶

Now we have our VS scores, a small percentage of which are actives. Now we'll apply
**
SNOWFLAKE
**
, but first we'll need to transform our scores.
**
SNOWFLAKE
**
assumes a gaussian distribution of scores for inactives, so we'll convert our bounded score to unbounded using the
Logit transform
.

```
from scipy.special import logit, expit
from scipy.stats import norm, zscore
tr_scores = logit(scores)
```

Now we'll fit a gaussian distribution to our scores and assume this is the distribution of inactives. Then we'll compare the Z-scores of the observed scores to the expected normal distribution of scores. We'll assume any deviation from normality on the high end of the scores are due to a small sample of active molecules.

```
fitmean, fitstd = norm.fit(tr_scores)
zs = np.sort(zscore(tr_scores))
x = np.linspace(-5,5,1000)
plt.plot(x,norm.pdf(x),label="Normal")
plt.hist(zs , bins=50, histtype='stepfilled', normed=True, color='k', alpha=0.2, label="Observed")
plt.xlabel("Z-scores")
plt.legend(loc="upper left");
```

We determine the
**
SNOWFLAKE
**
cutoff as the highest Z-score value where the number of observed scores is higher than the expected distribution.

```
from scipy.stats import probplot
plt.figure()
_ = np.linspace(-4,4)
plt.plot(_,_,'k--',alpha=0.3)
osm, osr = probplot(zs, fit=False, plot=None)
plt.plot(osm, osr, 'b.', alpha=0.4)
diff = osr-osm
stop_pts = np.where(diff < 0.0)[0]
qq_cutoff = osm[stop_pts[-1]]
plt.vlines(qq_cutoff, 0, qq_cutoff*2, linestyles="--", linewidth=4, label="SNOWFLAKE Z: {:.2f}".format(qq_cutoff))
plt.title("QQ Plot")
plt.legend(loc="upper left");
```

```
# Determine 95% CI cutoff
obs_count = np.asarray([np.sum(zs >= i) for i in x])
binom_std = np.sqrt(n_screened*norm.sf(x)*(1-norm.sf(x)))
# print(orig_stop, osm[orig_stop], x[orig_stop])
diff = obs_count - (norm.sf(x)*n_screened + 2*binom_std)
x_idx = np.argwhere(x < qq_cutoff)[-1][0]
stop_pts = np.where(diff[x_idx:] > 0.0)[0]
cutoff95 = x[x_idx+stop_pts[0]]
plt.semilogy(x,norm.sf(x)*n_screened, label="SF")
plt.fill_between(x,norm.sf(x)*n_screened - 2*binom_std,
norm.sf(x)*n_screened + 2*binom_std,
alpha=0.3, label="95% CI")
plt.semilogy(x,obs_count, label="Obsv >= z")
plt.vlines(qq_cutoff, 1, norm.sf(qq_cutoff)*n_screened, linestyles="--", linewidth=4, label="SNOWFLAKE z: {:.2f}".format(qq_cutoff))
plt.vlines(cutoff95, 1, np.sum(zs >= cutoff95), linestyles=":", linewidth=4, label="SNOWFLAKE95 z: {:.2f}".format(cutoff95))
# plt.xlim(qq_cutoff-0.25, qq_cutoff+0.25)
plt.xlim(qq_cutoff-1.5*(cutoff95-qq_cutoff), qq_cutoff+2*(cutoff95-qq_cutoff))
cut_oom = np.floor(np.log10(norm.sf(qq_cutoff)*n_screened))
plt.ylim(10**(cut_oom),10**(cut_oom+1))
plt.xlabel("Z-value")
plt.ylabel("Count")
plt.legend();
```

We have the
**
SNOWFLAKE
**
Z-cutoff for this virtual screen, the Z-score above which we found more scores than would be expected were this a normally-distributed set of inactive scores. Now we can use the
binomial test
to explore the null hypothesis of randomly observing the number of scores above the
**
SNOWFLAKE
**
cutoff for a screen of this size.

Finally, we'll convert our snowflake cutoff from a Z-score back to a VS score for use directly with the screening results.

```
from scipy.stats import binom_test
n_above_cutoff = np.sum(zs >= qq_cutoff)
prob = norm.sf(qq_cutoff)
score_cutoff = expit(qq_cutoff*fitstd+fitmean)
print("SNOWFLAKE cutoff: {:.3f}".format(score_cutoff))
print("Probability of an inactive scoring above SNOWFLAKE: {:.4f}".format(prob))
print("{} scores above the cutoff out of {}".format(n_above_cutoff,len(zs)))
print("p-value of One-sided Binomial Test: {:.4f}".format(binom_test(n_above_cutoff, len(zs), prob, alternative="greater")))
print("\n")
n_above_cutoff = np.sum(zs >= cutoff95)
prob = norm.sf(cutoff95)
score_cutoff95 = expit(cutoff95*fitstd+fitmean)
print("SNOWFLAKE95 cutoff: {:.3f}".format(score_cutoff95))
print("Probability of an inactive scoring above SNOWFLAKE95: {:.4f}".format(prob))
print("{} scores above the 95% cutoff out of {}".format(n_above_cutoff,len(zs)))
print("p-value of One-sided Binomial Test: {:.4f}".format(binom_test(n_above_cutoff, len(zs), prob, alternative="greater")))
print("")
```

```
plt.hist(scores , bins=50, histtype='stepfilled', color='k', alpha=0.4, label="Scores")
plt.vlines(score_cutoff, 0, len(scores)*0.05, linestyles="--", linewidth=4, label="SNOWFLAKE: {:.3f}".format(score_cutoff))
plt.vlines(score_cutoff95, 0, len(scores)*0.05, linestyles=":", linewidth=4, label="SNOWFLAKE95: {:.3f}".format(score_cutoff95))
plt.legend(loc="upper left");
```

If we go back to our hidden knowledge about the scores of actives and decoys, we can see how we the
**
SNOWFLAKE
**
cutoff would do at selecting a subset that was enriched with actives:

```
acts_selected = np.sum(act_scores >= score_cutoff)
inacts_selected = np.sum(inact_scores >= score_cutoff)
n_above_cutoff = np.sum(zs >= qq_cutoff)
print("Actives Selected: {} of {}".format(acts_selected, len(act_scores)))
print("Inactives Selected: {} of {}".format(inacts_selected, len(inact_scores)))
print("Actives in Selected Scores {} of {} PPV: {:.3f}".format(acts_selected, n_above_cutoff,1.*acts_selected/n_above_cutoff))
print("\n")
acts_selected = np.sum(act_scores >= score_cutoff95)
inacts_selected = np.sum(inact_scores >= score_cutoff95)
n_above_cutoff = np.sum(zs >= cutoff95)
print("Actives Selected: {} of {}".format(acts_selected, len(act_scores)))
print("Inactives Selected: {} of {}".format(inacts_selected, len(inact_scores)))
print("Actives in Selected Scores {} of {} PPV: {:.3f}".format(acts_selected, n_above_cutoff,1.*acts_selected/n_above_cutoff))
plt.hist(act_scores , bins=50, histtype='stepfilled', alpha=0.4, label="Actives")
plt.hist(inact_scores , bins=50, histtype='stepfilled', alpha=0.4, label="Inactives")
plt.vlines(score_cutoff, 0, len(scores)*0.05, linestyles="--", linewidth=4, label="SNOWFLAKE: {:.2f}".format(score_cutoff))
plt.vlines(score_cutoff95, 0, len(scores)*0.05, linestyles=":", linewidth=4, label="SNOWFLAKE95: {:.2f}".format(score_cutoff95))
plt.legend(loc="upper left");
```

### Snowflake Function ¶

Below we wrap all of snowflake into one function with performs the analysis and optionally generates the plots

```
from snowflake import snowflake
```

```
snowflake_cutoff, snowflake95_cutoff = snowflake(scores)
```

### Real World Example ¶

We can now apply SNOWFLAKE to real data, in this case a VS from a DUDE validation exercise using ROCS performed on just over 9 million molecules.

**
Loading and working with data of this size may take up to a minute on some systems
**

```
import pandas as pd
big = pd.read_table("/Users/mattg/Downloads/Data_fits/fak1_decoys_default_xray_1.rpt", usecols=['TanimotoCombo'])
```

```
snow_cut, snow95_cut = snowflake(big.TanimotoCombo.values/2., do_plot=True)
```

This post was written in a Jupyter Notebook. You can find the notebook here with instructions for downloading and running it yourself.