Zhiliang's Workbench:
Information / progress track

Major works performed: May 20, 2011 - June 15, 2011

Project: Pig array annotations - comparisons across 14 platforms

GOALS: - Blast link array elements to match their annoations
       - Analyze the annotations

DATA PREP:

  1. A-MEXP-693.adf: In-house printed array using the Illumina RefSet human
                     oligonucleotide collection (link)

  2. GPL1881.txt: Qiagen-NRSP-8 porcine oligo array

  3. GPL3594_GPL3585_GPL6173.txt: GPL3585: DIAS_PIG_55K2_v1
                                  GPL3594: DIAS_PIG_27K2_v2
                                  GPL6173: DJF__Pig_55K__v1 
                 (which could be called "combined Danish microarray set")

  4. GPL3764.txt: Porcine oligo microarray version 3 (POM3)

  5. GPL4930.txt: In-house made at "University of Illinois at Urbana-Champaign,
                  Urbana IL" (link)

  6. GPL5622.txt: SLA-RI/NRSP8-13K (which could be called "SLA_PrV porcine
                  DNA/cDNA microarray") this is the GEO name

  7. GPL7151.txt: SLA/Immune Response/NRSP8 Pig 70 mers Oligonucleotides
                  3.8K + 13.3K v1

  8. GPL7435.txt: Swine Protein-Annotated Oligonucleotide Microarray
                  (Illumina Oligo synthesis)

  9. GPL7576.txt: Porcine oligonucleotide microarray version 4 (POM4)
                  (Condensed version)

 10. GPL8448.txt: USDA/APHIS/FADDL Pan-viral 15K v4.2(Agilent)

 11. GPL9710.txt: Operon Pig 14.4K genome microarray v1.0.2 (aka 13K NRSP8 oligo array)
                  and effectively there is the 2 Affy platforms
                  (GPL9710 overlaps almost completely with GPL1881. They are
                  different spottings of the same oligo set from Qiagen-Operon)

The Affy Array: 2010: - SNOWBALL_array_seqs.fa -- Already annotated by FIOS - miRNAs_array_seqs_v4.fa |- The three sections of the 2010 - unique_coding_seqs_for_array_v4.fa | Affy platform, which will be merged - virus_genomes_array_seqs_v4.fa | at some point into one "SNOWBALL" | platform (Chris Tuggle; 2011). [2012 update: Freeman et al., 2012] 2005: - affy_ssc_consensus: 23935 sequences |- 2005 data; Added on Oct 06, 2011 - affy_ssc_target: 24123 sequences |- 2005 data; Added on Oct 06, 2011 - newAffy_probes: 599981 sequences |- 2011 data; Added on Oct 10, 2011
APPROACH: Since the IPA is the most recent, inclusive data set (combining all previously known Affy data sets), and has been well annotated, the basic approach is to blast match all "other" affy platforms to IPA to get an idea how they link to each other. PREVIOUS ANNOTATIONS: 1. annot_Affy_20k.csv (original file name: "Affy_20k_annot.csv"): (from Oliver) 2. annot_Affy_20k_hs.csv (original file name: "Affy_20k_annot_human.csv"): (from Oliver) 3. annot_IPA.csv (original file name: "ipa_annot.csv"): (from Oliver, replacing "annot_Affy_20k.csv" and "annot_Affy_20k_hs.csv") 4. annot_snowball_20110509 (original file name: "snowball_annotation_20110509") "snowball" annotation from Dario- snowball is the name of the new pig affy chip. The original annotation which I think comes from Affy; not sure. The dates can be confusing on these files. This one is 08-03-11 which means March 8 2011. Dario's dates means May 9 2011 (Chris Tuggle). WORK LOCATIONS: Project dir: DELL:/home/hu/projects/Tuggle_annot Blast dir: DELL:/cluster/nagrp/run/ or CLUSTER:/storage/nagrp/run/ IPA Annotations: MySQL::annotdb::IPAannot (569,378 rows) Planned database: MySQL::host_arrayanno::Iblasted (not used) Results dir: DELL:~apache/doc/pig/projects/array_annotatn Results db: MySQL::annotdb::Iblasted PROGRESS/STATUS: 1. Blast: Initial considerations: try to see the "coverage" - to be conservative such that it will be more "inclusive". Threshold: E-value cut off: 1e-3; Priming seq length: 15 bp May 15, 2011: Initial blast
FASTA sequence files; each platformNumber of sequencesNumber of sequences that has at least 1 hit to IPA
A-MEXP-693.fa228007174
GPL1881.fa1367713132
GPL3594_GPL3585_GPL6173.fa260354242
GPL3764.fa18181764
GPL4930.fa1329713132
GPL5622.fa1745916795
GPL7151.fa1745916795
GPL7435.fa2040019489
GPL7576.fa357346
GPL8448.fa14985864
GPL9710.fa1405713324
SNOWBALL_array_seqs.fa1091987752423
miRNAs_array_seqs_v4.fa2370336
unique_coding_seqs_for_array_v4.fa4776958481
virus_genomes_array_seqs_v4.fa354
IPA.fa639177
May 18, 2011: Decision from the conf call: - Use evalue cut off of 1e-10 (pig to pig; human to refseq) - Leave out "virus_genomes_array_seqs_v4.fa" Jun 01, 2011: 2nd results: Blast at 1e-10 2. Blast results trim (by minimum overlap lengths/ % identity): Jun 04, 2011: Filter results by Score > 40 Alimtlenth > 20 Identity > 80% Jun 07, 2011: New Blast data summary
FASTA sequence files; each platformNumber of sequencesNumber of sequences that has at least 1 hit to IPA
A-MEXP-693228007174
GPL18811367713324
GPL3594-3585-61732603534676
GPL376418181764
GPL49301329713132
GPL56221745916795
GPL71511745916795
GPL74352040019489
GPL7576357346
GPL844814985864
GPL97101405713324
v4_coding.seq4776958217
v4_miRNAs2370336
v4_SNOWBALL1091987752423
IPA.fa639177
3. Port the blast results to database: Jun 08, 2011: done. 4. Build db queries to integrate the combined annotation matches Jun 09, 2011 / Jun 15, 2011: Query 1: The match metrix of the 14 platforms elements to IPA, with annotations o The output is limited to 200 for preview; o Download the entire data set in this format, tab delimited, here Query 2: Blast match of the positive blast hits to IPA, with blast scores o The output is limited to 200 for preview; o Download the entire data set in this format, tab delimited, here 5. * Fine tune to tweak for optimal matches thresholds * Follow up works to improve the matches Oct 3, 2011: Added Gene IDs (NCBI) and Symbols to the Query 1 results Oct 6, 2011: Added Affy platform (see updated Query 1 link for results). The overall platforms are now: 1. A-MEXP-693 2. Affy_consensus / Affy_target <-- NEW! 3. GPL1881 4. GPL3594-3585-6173 5. GPL3764 6. GPL4930 7. GPL5622 8. GPL7151 9. GPL7435 10. GPL7576 11. GPL8448 12. GPL9710 13. v4_coding.seq 14. v4_miRNAs 15. v4_SNOWBALL Oct 10, 2011: Added "new Affy" platform (see updated Query 1 link for results). The overall platforms are now: 1. A-MEXP-693 7174 2. affy_consensus 27897 <-- NEW! affy_target 24717 <-- NEW! 3. GPL1881 13324 4. GPL3594-3585-6173 34676 5. GPL3764 1764 6. GPL4930 13132 7. GPL5622 16795 8. GPL7151 16795 9. GPL7435 19489 10. GPL7576 346 11. GPL8448 864 12. GPL9710 13324 13. newAffy 252 <-- NEW! 14. v4_coding.seq 58217 15. v4_miRNAs 336 16. v4_SNOWBALL 752423 x) ... 6. Wrap up: May 22, 2012: The outcome are further analysed by Bouabid Badaoui at Parco Tecnologico Padano - CERSA, Italy. A manuscript was subsequently developed. The results from this pipeline are used as "Supplementary Data" to the publication (public site): http://www.animalgenome.org/repository/pub/ITALY2013.0312 CONTACT: Chris Tuggle LAST UPDATE: May 2012

Reference:
  • May 20, 2011 - June 15, 2012 • Zhiliang Hu •