Wednesday, November 28, 2007

Biclusters and support threshold.

Each relation was biclustered for several different support values.
Support threshold chosen for relations:
1) Gene -metabolite = 1
In this relation, we have excluded the following metabolites (considering that they occur in just too many reactions):

GDP, Hydrogen peroxide, 2-Oxoglutarate, Ammonium, Acetyl-CoA, L-Glutamate, O2, Nicotinamide adenine dinucleotide - reduced, AMP, Nicotinamide adenine dinucleotide, CO2, Coenzyme A, Nicotinamide adenine dinucleotide phosphate - reduced, Nicotinamide adenine dinucleotide phosphate, Diphosphate, Phosphate, ADP, ATP, H2O, H+

Because we removed these metabolites, the total number of genes in the relation reduces from 748 to 627!!

2) GO-Bio - Gene = 1
3) GO - Cel - Gene = 1
4) GO - Mol - Gene = 1
5) Gene - Biochem pathway = 1

6) DNABinding
We have 14 different DNAbinding relations. The support threshold for each relation are:
Acid - 1
Alpha - 1
BUT14 -1
BUT90 - 1
GAL - 1
H2O2 Hi - 1
H2O2 Lo - 1
HEAT - 1
Pi - 1
RAFF - 1
RAPA - 1
SM - 1
Thi - 1
YPD - 15

YPD was a huge relation. The number of biclusters at
support 1 = 61431
support 10 = 27156
support 11 = 22225
support 12 = 18081
support 13 = 14594
support 14 = 11752
support 15 = 9510
support 16 = 7752
support 17 = 6413
support 20 = 3874
support 25 = 1950
support 30 = 1094
support 35 = 670

Looking at the number of biclusters the support 1 and support 10 were eliminated. For the rest, I plotted the size of biclusters(rowxcol) against their support

Since 10,000 is a number that the CDM pipeline can handle, the contention was between support 14,15 and 16. I plotted the following graphs to get an idea as to how good the sizes of the biclusters are:

1)
Number of genes x Number of TFs
2)
Number of genes x Number of biclusters with Y number of genes
3)
Number of TFs x Number of biclusters with Y number of TFs
4) Number of genes x Number of TFs x Number of biclusters with X genes and Y TFs

I have 15 as the support threshold for YPD because at this threshold, the number of biclusters are feasible to handle and the number of biclsuters with thick rows and columns are considerably high.

Saturday, November 24, 2007

CDM over yeast data

For generating chains in yeast data, we are considering the following information. We are identifying each yeast gene using their ynumbers. The mapping of the Ynumbers to other ids was downloaded from SGD.

1) Gene - gene
I am using the yeast transcriptional network published by
Harbison et al. in Nature 431: 99-104, 2004. This data has been split according to different stress conditions. There are 6229 genes. Under each stress condition, we would test if particular TFs have an effect on what genes. So we have a 2D matrix with genes along the rows and the interested TFs along the columns. I have 14 such conditions.

2) Gene - protein
I downloaded protein - protein interaction data for yeast from CYGD, SGD and DIP. I extracted the interacting gene ids and mapped them to their respective Ynumbers. All the three sets of interactions were combined. Totally, I have 5799 genes and 76125 interactions.

3) Gene - metabolite
I used the iND750 yeast metabolic network model published by Duarte NC et al,
Genome Res. 2004 Jul;14(7):1298-309. We considered the following ways to connect two genes sharing a metabolite
a) connect 2 genes acting on same metabolite in the same pathway
b) connect all genes in one pathway
c)

4) Gene - biochemical pathway
This data was downloaded from SGD. This data was biclustered too.

5) Gene - GO annotations
Data downloaded from GO website. This file was deposited on 10/12/2007.
I separated the Molecular function, biological process and cellular compartment relations.
The three relations were then biclustered.