GENOMICS
PROTEOMICS &
BIOINFORMATICS
www.sciencedirect.com/science/journal/16720229
Article
Mining Functional Gene Modules Linked with Rheumatoid
Arthritis Using a SNP-SNP Network
Lin Hua, Hui Lin, Dongguo Li, Lin Li*, and Zhicheng Liu
Biomedical Engineering Institute, Capital Medical University, Beijing 100069, China.
Genomics Proteomics Bioinformatics 2012 Feb; 10(1): 23-34 DOI: 10.1016/S1672-0229(11)60030-2
Received: Mar 11, 2011; Accepted: Aug 31, 2011
Abstract
The identification of functional gene modules that are derived from integration of information from different types
of networks is a powerful strategy for interpreting the etiology of complex diseases such as rheumatoid arthritis
(RA). Genetic variants are known to increase the risk of developing RA. Here, a novel method, the construction of
a genetic network, was used to mine functional gene modules linked with RA. A polymorphism interaction analy-
sis (PIA) algorithm was used to obtain cooperating single nucleotide polymorphisms (SNPs) that contribute to RA
disease. The acquired SNP pairs were used to construct a SNP-SNP network. Sub-networks defined by hub SNPs
were then extracted and turned into gene modules by mapping SNPs to genes using dbSNP database. We per-
formed Gene Ontology (GO) analysis on each gene module, and some GO terms enriched in the gene modules
can be used to investigate clustered gene function for better understanding RA pathogenesis. This method was
applied to the Genetic Analysis Workshop 15 (GAW 15) RA dataset. The results show that genes involved in func-
tional gene modules, such as CD160 (rs744877) and RUNX1 (rs2051179), are especially relevant to RA, which is
supported by previous reports. Furthermore, the 43 SNPs involved in the identified gene modules were found to
be the best classifiers when used as variables for sample classification.
Key words: polymorphism interaction analysis, hub SNP, sub-networks, GO enrichment analysis
Introduction
It is well-recognized that complex diseases are caused
by multiple gene-gene interactions, in which each
gene may have a small effect on disease development,
rather than by single gene defects (1). As high-density
single nucleotide polymorphism (SNP) arrays and
subsequent genome-wide association studies (GWAS)
were developed, the study of complex diseases has
*Corresponding author.
E-mail: lil@ccmu.edu.cn
© 2012 Beijing Institute of Genomics. All rights reserved.
become of widespread interest for researchers. Tradi-
tional methods of genetic analysis are often weak
when applied to some complex diseases, which are
most likely to be both genetically multifactorial and
phenotypically heterogeneous. It is therefore sug-
gested that the study of complex diseases should not
be restricted to single gene identification, but should
focus on gene interaction studies. Recently, there have
been several studies exploring gene-gene interactions
in different ways (2-5).
Furthermore, more and more evidence shows that
investigating gene-gene interactions may lead to the
development of a functional network and functional
Hua et al. / Mining Gene Modules Using SNP-SNP Network
Genomics Proteomics Bioinformatics 2012 Feb; 10(1): 23-34 24
modules (6). Iossifov et al (7) predicted pathways or
networks of interacting genes that contribute to com-
mon heritable disorders by combining standard ge-
netic linkage formalism with whole-genome molecu-
lar interaction data. Similarly, Wang et al (8) demon-
strated pathway-based approaches, which jointly con-
sidered multiple contributing factors in the same
pathway. In addition, Franke et al (9) developed a
functional human gene network that integrated infor-
mation on genes and the functional relationships be-
tween genes based on multiple databases. They used
the network to identify important candidate genes
from numerous loci on the basis of their functional
interactions and reduced the cost of pinpointing true
disease genes in the analyses of disorders. In summary,
molecular networks can be obtained from many levels
including co-expression (10), co-regulation (11) or
protein-protein interactions (6). Depending on the
different networks, a variety of methods have been
suggested for the mining of useful functional infor-
mation, such as clustering genes that show high cor-
relation coefficients between gene expression profiles
(12, 13), identifying functional modules based on the
structure of transcriptional regulation (14), or pre-
dicting functional modules encoded in a microbial
genome (15). With the rapid development of GWAS,
the construction methods of molecular networks pro-
vided us with a potential strategy for obtaining a net-
work at the genetic level by using predicted interac-
tions between SNPs. Networks like this may show
special features due to the genetic component and
may aid in the explanation of complex diseases. Ac-
cordingly, by introducing disease information into
such a network and further analyzing functional gene
modules, we can learn more about the functional cha-
racteristics of disease etiology.
As we know, rheumatoid arthritis (RA) is a chronic
disease that leads to inflammation of the joints and
surrounding tissues. Recent studies have indicated
that genetic factors play important roles in the in-
creased risk of developing RA. In the present study,
we present a novel method for mining functional gene
modules linked with RA. First, we carried out the
Haseman-Elston (H-E) test (16) and Random Forest
(RF) algorithm (17) to screen out disease-related
SNPs from a whole-genome dataset. Secondly, can-
didate SNPs shared by the H-E test and RF algorithm
were used to construct the SNP-SNP network with
polymorphism interaction analysis (PIA) algorithm
(18), which was developed as a new method to iden-
tify the synergistic contribution of SNPs to diseases.
Then, sub-networks were extracted by analyzing the
structure of the SNP-SNP network. Further, using the
dbSNP database, all of sub-networks were mapped
onto gene modules. We used a permutation-based
procedure to evaluate the significance of associated
SNP pairs. For the five gene modules we discovered,
Gene Ontology (GO) analysis indicated that genes
within a common module were likely to be enriched
on some RA-related GO terms. Furthermore, the 43
SNPs involved in the identified gene modules, were
found to be the best classifiers when used as variables
for sample classification. Finally, we compared the
results of our method to existing tools including
GRAIL (19) and GSEA-SNP (20) to evaluate the
similarity and novelty of our results.
Results
Construction of RA-specific SNP-SNP net-
work
In this study, we defined a total score for each SNP-
SNP pair as described in the Materials and Methods
section. We found when we kept the top 1,000 SNP
pairs obtained using each of seven scores involved in
PIA, only a small number of overlapping SNP pairs
were found. However, it is interesting to note that us-
ing the aforementioned total score, we acquired the
maximum number of overlapping SNP pairs with all
of seven measures involved in PIA. Therefore, total
score was a more reasonable measure for evaluating
cooperating SNP pairs contributing to disease. As a
result, we used total score to evaluate the interaction
strength of each SNP-SNP pair (Table S1).
According to our permutation test as described in
the Materials and Methods section, the empirical dis-
tribution of total scores was formed from 1,000,000
scores, and a threshold value of total score
( 1.3394S ) was considered as a cut-off value at a
significance level (P=0.05) to screen out SNP pairs
(Figure 1). Among the top 1,000 SNP pairs acquired
with the original dataset, we found that the total
Hua et al. / Mining Gene Modules Using SNP-SNP Network
Genomics Proteomics Bioinformatics 2012 Feb; 10(1): 23-34 25
scores of the top 100 SNP pairs were all greater than
the threshold value. We used these significant SNP
pairs to construct a SNP-SNP network specific to RA.
This network contains 110 SNPs and 100 edges,
where an edge indicated a SNP pair. The total scores
and their corresponding P-values for the top 100 SNP
pairs are shown in Table S2. The SNP-SNP network
shown in Figure S1 was generated with MAVisto
software (21).
Figure 1 The empirical distribution of total scores. By per-
muting sample labels 1,000 times, the PIA algorithm is per-
formed repeatedly for 1,000 new datasets. The empirical dis-
tribution of total scores is formed from all above results. The
threshold value of 1.3394 corresponds to a significance level of
0.045.
Identification of functional gene modules
According to our rule for extracting hub SNP, five
hub SNPs with a degree greater than 5 were extracted:
rs1424903 (degree=18, P=1.5×10-13), rs744877 (de-
gree=9, P=2.3×105), rs164466 (degree=5, P=0.010),
rs1004531 (degree=5, P=0.010) and rs759382 (de-
gree=5, P=0.010). Then, five sub-networks defined by
hub SNPs were extracted. To mine functional gene
modules and high risk genes linked with RA, we
mapped the SNPs onto genes using a dbSNP database.
We calculated the distances of all SNPs to the splice
variants of their nearest genes along chromosomes.
The highest frequency occurs in the range from 0 to
4,000 base pairs. This result closely agrees with pre-
vious reports, in which SNPs that are >500 kb away
from any gene are not considered because most en-
hancers and repressors are <500 kb away from genes,
and most linkage disequilibrium blocks are <500 kb
(8). This mapping method allowed us to identify
genes implicated by SNPs of sub-networks to obtain
gene-gene interaction modules. As a result, 59 genes
associated with RA were identified from 110 SNPs
involved in the SNP-SNP network. A total of 12 genes
were associated with 19 SNPs in the
rs1424903-related gene module. The rs744877
(CD160)-related gene module contained seven genes
associated with 10 SNPs. five genes corresponding to
six SNPs were included in the rs1004531 (TNFAIP8)
related gene module. The rs164466-related and
rs759382 (SLC9A4)-related gene modules each con-
tained two genes corresponding to six SNPs.
We identified 59 genes involved in the SNP-SNP
network as the background set, and genes involved in
5 gene modules as the test sets. Using a significance
level of 0.05, most enrichment results were found in
the rs1424903-related and rs744877-related gene
modules. At a significance level of 0.1, two enrich-
ment GO terms occurred in the rs1004531-related
gene module: GO: 0005515 (P=0.08591) and GO:
0005886 (P=0.0702). However, no distinct enrich-
ment phenomena were seen in other gene modules
owing to their low number of genes. Those gene
modules with significant GO terms were considered
functional gene modules relevant to RA. Two gene
modules particularly enriched in GO terms (the
rs1424903-related and rs744877-related gene modules)
are shown in Table 1. In addition, the enrichment re-
sults for five gene modules are shown in the heat map
generated by Cytoscape software (22) in Figure S2.
We found that SNPs [rs2051179 (RUNX1),
rs164466, rs1424903, rs744877 (CD160) and
rs759382 (SLC9A4)] involved in functional gene
modules were previously identified as susceptibility
loci in a study using ensemble decision trees (4) and
another study using the BGTA algorithm (14). As
shown by “Molecular Function” in Table 1, the
rs1424903-related gene module was relevant to pro-
tein binding (GO: 0005515). Kristensen has previ-
ously reported that there seems to be a qualitative
rather than a quantitative change in 3H-imipramine
binding in patients with RA (23). As shown by the
dimension “Biological Process", a significant GO
term was regulation of transcription (GO: 0006355).
Hua et al. / Mining Gene Modules Using SNP-SNP Network
Genomics Proteomics Bioinformatics 2012 Feb; 10(1): 23-34 26
Table 1 Enriched GO terms with P<0.1 in the rs1424903-related and rs744877-related gene modules
Gene module Category GO term P n# m# Description
rs1424903-related MF GO:0005515 0.0550 16 5 Protein-binding
GO:0003700 0.0523 5 2 Transcriptional activator activity
GO:0008270 0.0461 8 3 Zinc ion-binding
GO:0005524 0.0729 9 3 ATP-binding
BP GO:0006355 0.0461 8 3 Regulation of transcription
CC GO:0005634 0.0729 9 3 Nucleus
GO:0005622 0.0461 8 3 Intracellular
GO:0005737 0.0360 11 4 Cytoplasm
GO:0016021 0.0729 9 3 Integral to membrane
GO:0005886 0.0919 6 2 Plasma membrane
rs744877-related MF GO:0005524 0.0401 9 2 ATP-binding
BP GO:0007165 0.0182 7 2 Signal transduction
CC GO:0016021 0.0401 9 2 Integral to membrane
GO:0005886 0.0109 6 2 Plasma membrane
Note: n#, Number of genes contained in a category counted using 59 background genes. m#, Number of genes contained in a category counted using
12 genes and 6 genes for the rs1424903-related and rs744877-related gene modules, respectively. MF stands for Molecular Function; BP and CC
stand for Biological Process and Cellular Component, respectively. Enriched GO terms with P<0.05 are in bold.
Redlich et al have previously found that overexpres-
sion of Ets-1 in RA synovial tissue may be due to tu-
mor necrosis factor-alpha (TNF-) and interleukin 1
(IL-1). Therefore, they suggested that Ets-1 may be an
important transcription factor in the cytokine-mediated
inflammatory pathway and destructive cascade char-
acteristic of RA (24). Aud and Peng also investigated
whether transcription factors have important roles in
the pathogenesis of inflammatory arthritis, and they
have proposed several targets for anti-inflammatory
therapies to modulate transcription factor activity (25).
Based on the dimension “Cellular Component”, there
is also evidence to support the significance of cyto-
plasm (GO: 0005737). Anti-neutrophil cytoplasm an-
tibodies (ANCA) occur occasionally in RA, but their
incidence and clinical significance are unknown. Sa-
vige et al demonstrated that ANCA may be associated
with systemic vasculitis, and there is an incomplete
correlation between indirect immunofluorescence
patterns and antibody specificity in enzyme-linked
immunosorbent assay (ELISA) systems (26).
For the rs744877-related gene module, four sig-
nificant GO terms were found. As shown by “Mo-
lecular Function”, GO: 0005524 (ATP binding) was
related to the inflammatory response (27). Schimitz et
al (28) demonstrated that the ATP-binding cassette
(ABC) transporter, ABCA1, was induced during dif-
ferentiation of human monocytes into macrophages,
and there was a dual regulatory function for ABCA1
in macrophage lipid metabolism and inflammation. As
shown by 'Biological Process', the significance of GO:
0007165 (signal transduction) is also supported by
previous studies. Extracellular signals are transduced
intracellularly via multiple pathways, resulting in al-
terations in the transcription and translation of spe-
cific proteins. Some of these signaling pathways re-
sult in the production of proteins, including cytokines
and matrix metalloproteinases, which are implicated
in the pathogenesis of RA (29).
Further analysis revealed more valuable informa-
tion in the functional gene modules. Among the 12
genes included in the rs1424903-related gene module,
zinc-finger protein 238 (ZNF238) is attached to
zinc-finger proteins that can regulate the human im-
munodeficiency virus type 1 (HIV-1) long terminal
Hua et al. / Mining Gene Modules Using SNP-SNP Network
Genomics Proteomics Bioinformatics 2012 Feb; 10(1): 23-34 27
repeat (LTR) (30). In the rs744877-related gene mod-
ule, hub gene CD160 is a potential RA association
gene. The CD160 receptor represents a unique trig-
gering surface molecule that is expressed by cytotoxic
NK cells, participates in the inflammatory response
and determines the type of subsequent specific immu-
nity (31).
In addition, it is interesting to observe two links
among five hub SNPs. One pair was rs164466-
rs1004531 (TNFAIP8), and the other was rs1424903
rs759382 (SLC9A4). This may suggest that functional
gene modules cooperate to affect RA and highlight the
need for further study.
Comparison with GRAIL
We also sought to compare our method with another
SNP analytical tool, GRAIL. In the GRAIL program,
we took all query regions involved in the whole net-
work as the input to GRAIL. Interestingly, common
gene cliques were found between gene groups ac-
quired with GRAIL and the functional gene modules
identified with our methods (Table 2). For example,
MYH9, CTSB, ELOVL6 and PHACTR1 in the
rs1424903-related gene module were also included in
the gene group obtained with GRAIL (Ptext=0.0026).
GRAIL and our study are two methods based on dif-
ferent paths for mining gene groups associated with
disease. GRAIL can extract similar genes from all
query regions that the user is attempting to evaluate.
Compared to those gene groups acquired by GRAIL,
functional gene modules linked with RA identified
using our method are more conservative and represent
higher risk because these modules are prioritized layer
by layer, and include gene-gene interactions associ-
ated with disease.
Comparison with Gene Set Enrichment Anal-
ysis-SNP (GSEA-SNP) on five sub-networks
defined by hub SNPs
GSEA-SNP programs were performed for five
sub-networks. An enrichment score (ES) was com-
puted for each sub-network. Using a permutation test,
we obtained the threshold value of ES at a signifi-
cance level of 0.05 for each sub-network (Figure S3).
Those sub-networks with P<0.05 were extracted as
enrichment sub-networks associated with disease. The
results showed that three sub-networks were signifi-
cant: the rs1424903-related (P=0.0020), rs164466-
related (P=0.0020) and rs759382-related sub-netw-
orks (P<0.0001) (Table 3). It is worth noting that the
rs1424903-related gene module was also the functional
gene module with the most significant GO terms,
Table 2 The comparison between gene modules identified by our method and similar genes acquired with GRAIL
Gene modules identified by our method Similar genes acquired with GRAIL
Gene module (sub-network) Genes included in gene module Pmodulea Similar genes Ptext b
rs1424903-related
ZNF238, NDEL1, GMDS, RUNX1,
MYH9, ELOVL6, CTSB,
PHACTR1, BHMT2, SLC9A4,
LOC339977, ANKH
1.5E-13
PRKCB1, PLEK,
IQGAP2, MYH9,
ELOVL6, CTSB,
PHACTR1
0.0026
rs744877 (CD160)-related
CD160, C21orf34, LHFP, GRPEL2,
CNTN4, CSNK2A2, WDR62
2.3E-5
CTSB, CASP6,
PRKCB1,GRPEL2,
CNTN4, CSNK2A2
0.0131
rs1004531 (TNFAIP8)-related TNFAIP8, PLAU, RIMS1, C10orf55, ELF1 0.010
BTBD9, CASP6, CSNK2A2,
ELF1, TNFATP8, RIMS1,
PLAU
0.0249
rs164466-related TNFAIP8, MTMR9 0.010
MYH9, NLRP7,
CSNK2A2, ELOVL6,
PRKCB1
0.0038
rs759382 (SLC9A4)-related SLC9A4, C21orf34 0.010
Note: aPmodule indicates the probability of a hub SNP with >t connections (t is the degree of the hub) in a random network. bPtext is the text-based simi-
larity metric based on GRAIL. Similar genes with Ptext<0.01 by GRAIL analysis are shown. Genes underlined represent common genes shared by two
gene sets, which are gene modules identified by our method and genes acquired with GRAIL.
Hua et al. / Mining Gene Modules Using SNP-SNP Network
Genomics Proteomics Bioinformatics 2012 Feb; 10(1): 23-34 28
Table 3 GSEA-SNP results for five sub-networks defined by hub SNPs
Number of SNPs/genes
Sub-network
SNPs Genes
ES
Number of significant SNPs
by x2-test (P<0.05)
P FDR ES threshold
values
rs1424903-related 19 12 0.6672 9 0.0020 <0.001 0.5807
rs744877-related 10 7 0.5652 3 0.2398 0.002 0.6566
rs164466-related 6 2 0.8032 4 0.0020 0.002 0.7455
rs1004531-related 6 5 0.7493 3 0.0551 0