From James.Kijas csiro.au Tue Nov 6 16:46:11 2012
From: James.Kijas csiro.au (by way of Jill Maddox <jillian.maddox alumni.unimelb.edu.au>)
To: Multiple Recipients of <sheepmodels animalgenome.org>
Subject: Minutes from Sheep Consortium Call #102
Date: Tue, 06 Nov 2012 16:46:11 -0600
Minutes from ISGC Conference Call #102
Participants: Gwenola Tosser, John McEwan, Rudi Brauning, Shannon Clarke, Kim
Worley, Cindy Lawley, Brian Dalrymple, Jiang Yu, Josh Miller, Noelle Cockett,
James Kijas and Lakshmi Matukumali. The minutes were recorded by the secretary
(JK) from the meeting held at 8am QLD time Nov 6th 2012.
1. Assembly v3.1, RNA seq and plans for Annotation Brian outlined discussions
from late September held in the UK with the ENSEMBL team around application of
their annotation pipeline for the sheep genome. Since September, v3.1 has been
released by NCBI and the ISGC has agreed not to produce another assembly
version in the near term. The current task involves collecting RNAseq datasets
generated from a distribution of tissues to use in the construction of high
quality genes models. Brian is coordinating this collection of RNAseq
datasets, and sent out an initial email to describe the process. The team
discussed when to set a deadline, after which ENSEMBL will commence
annotation. PAG was seen as an important opportunity to engage researchers who
are not regular ISGC participants, so January 31 2013 was set as the last day.
Brian described the process for submitting datasets, which is primarily via
external hard drives sent to EBI / ENSEMBL. Brian will be sending out
additional information on the process (eg technical preference for stranded
and paired end sequenced libraries) in the coming week.
Denis asked about the publication strategy and the team discussed the
timelines associated with two scenarios. First, the assembly could be
prepared for submission as a manuscript now, prior to the annotation and
association biological interpretation. The group felt the assembly data alone
was unlikely to yield a high impact paper. The alternative is to include the
outcomes from the annotation (assumes the pipeline at ENSEMBL starts in Feb /
Mar 2013, and will take a couple of months before subsequent analysis can
commence). While this would mean a genome paper may not eventuate until 2014,
many participants on the call felt this was the only shot at a high impact
paper.
John asked if a current list of contacts could be prepared so he can check if
local researchers have been successfully linked in.
2. Sequencing to improve assembly v3.1 Team has been devising a strategy for
improving aspects of the v3.1 through additional sequencing:
- there are a large number of individually small gaps (which additional ILMN
sequence is seen as unlikely to resolve)
- there are a comparatively small number of large gaps (where a BAC based
approach has been discussed)
The EU FP7 3SR consortium has funds available to assist with improvement of
the assembly, and their recent board discussion considered a range of options
in consultation with the ISGC. The resulting operational plan involves
identification of BACs spanning 30 – 40 largegaps (> 200 Kb) that appear to
be problems with the sheep genome based on comparative analysis with cattle
and human. A total of 500 BACs that span the large gaps and other difficult to
assemble regions will be identified, before 96 are sequenced at the Roslin
using 3SR funds.
Brian and Noelle noted that the costs associated with pulling and growing the
BACs may be carried by her USDA ‘finishing the genome’ grant. The group (3SR)
had also considered but abandoned plans to fill the 5 – 10 thousand short gaps
using a targeted (PCR or hybridisation based) sequencing strategy. Instead,
sequencing to understand the prevalence and distribution of sheep CNV will be
commissioned by 3SR with INRA (Thomas Faraut and Gwenola). This might best
include ISGC samples that already have CGH array data and/or 454 sequence
reads. John provided a summary of the activities that have already been
completed in his group looking at CNV. Key points were:
- subject of a PhD program
- custom 2.1M CGH array constructed and used to assay 13 – 15 trios drawn
mainly from the IMF (supported by USDA funds and an ISGC activity)
- some of these sequenced to 10 fold coverage and CNV detection performed
using variation in read depth
- happy to work with the 3SR / INRA team providing it doesn’t compromise the
publication strategy for the PhD student who is scheduled to finish Aug 2013
*John, James, Thomas, Gwenola and others to progress the plan by e-mail
Final point concerned PacBio as a technology to address the large number of
short gaps. Brian asked Kim for an indicative price after she provided
positive information on their experience with the read length.
3. Genome Resequencing and HD chip.
John provided an update around the design for the HD chip. Strong progress
has been made and the team is on schedule to both finalise the design and have
the orders locked in prior to Christmas. Cindy confirmed that manufacture
requires 14 – 16 weeks, which should see chips shipped sometimeafter mid
April. John and Cindy will be re-engaging with everyone who has indicated
interest in access to the HD product over the coming weeks. James gave an
update on progress with sorting out the sample misidentification situation in
the 75 genomes project. Work is underway at USDA (Mike Heaton and Ted
Kalbfleisch) to match the remaining genomes with their correct identifiers
using amplicon resequencing data. Final point involved the filters to finalise
the SNP set for the selection and domestication analysis. Team agreed that
given variants had to be independently identified by two pipelines, that if
the alternate allele is present in at least two individuals this was
sufficient. Details about the pipelines and subsequent filters have been
copied below these minutes.
4. PAG 2013
Abstracts were coordinated by e-mail prior to the call. James to develop
draft agenda, and once finalised it will be uploaded on the PAG website.
Noelle will chair the meeting in January as James will not be present.
5. Other Business
None recorded. James thanked the participants for dialling in. Please contact
the secretary to correct any errors or serious omissions in these minutes
Cheers
James
James.Kijas csiro.au
.From: Kijas, James (CAFHS, St. Lucia)
.Sent: Monday, 29 October 2012 1:56 PM
.Subject: Agenda for Sheep Consortium Call #102 NEXT WEEK
Hi everyone,
It has been a number of months since the last
consortium-wide call for the sheep genomics
community. I have scheduled one for next week to
update the full group on the activities that have
been ongoing. Hope to speak with you then.
Cheers,
James
Draft Agenda:
1. Assembly v3.1, RNAseq and plans for annotation Brian and his team announced
completion of v3.1 in October after it was accepted into NCBI. Brian is
seeking to collect RNAseq datasets for the annotation effort. Team to discuss.
2. Sequencing to improve assembly v3.1 Over the last couple of weeks,
discussions have been underway involving 3SR, Noelle’s USDA gran and the ISGC
concerning how best to perform additional sequencing to improve v3.1. Brian,
Noelle and others to summarise the plan going forwards.
3. Genome Resequencing and HD chip.
The key project participants have been meeting almost every week to work on
SNP calling, QC and design for the HD chip using data from 75 genomes. James,
John, Rudi, Kim and others to summarise progress and plan next steps.
4. PAG 2013
Team to review what presentations are planned.
5. Other Business
Time (taken from
http://www.timeanddate.com/worldclock/converter.html)
Brisbane (Australia - Queensland) Tuesday, 6
November 2012 at 8:00:00 AM EST UTC+10 hours
Melbourne (Australia - Victoria) Tuesday, 6
November 2012 at 9:00:00 AM EDT UTC+11 hours
Wellington (New Zealand) Tuesday, 6 November 2012
at 11:00:00 AM NZDT UTC+13 hours
Salt Lake City (U.S.A. - Utah) Monday, 5 November
2012 at 3:00:00 PM MST UTC-7 hours
Houston (U.S.A. - Texas) Monday, 5 November 2012 at 4:00:00 PM CST UTC-6
hours
Cardiff (United Kingdom - Wales) Monday, 5
November 2012 at 10:00:00 PM GMT UTC
Paris (France) Monday, 5 November 2012 at 11:00:00 PM CET UTC+1 hour
Dial In Details
To join, you will need to:
1. Dial the TollFree number which corresponds to your location
Tollfree AUSTRALIA:
1800 681583
Tollfree NORTHERN CHINA:
1080 06100311
Tollfree SOUTHERN CHINA:
1080 02610311
Tollfree FRANCE:
0800 907046
Tollfree HONG KONG:
800 900194
Tollfree NZ:
0800 443188
Tollfree UK:
0800 0281738
Tollfree USA:
1877 4974432
2. When prompted by the recorded voice, enter the
ACCOUNT NUMBER and PIN followed by the HASH (#) KEY.
The Account Number is: 76309443
The Pin is: 0863
That should link you into the conference call.
For additional information on the process you can visit:
http://conferencing.telstra.com/solutions/SHphone_Kit.pdf
.From: Kijas, James (CAFHS, St. Lucia)
.Sent: Monday, 5 November 2012 1:11 PM
.To: 'Brauning, Rudiger'; 'hans.daetwyler dpi.vic.gov.au'
.Cc: McEwan, John
.Subject: RE: time to define the final SNP list
and .vcf for furture analysis in sheep
Hi guys,
This all assisted greatly in my understanding of what has been done, and the
numbers to fall out at each step. It sounds like we are very close given what
you have already achieved.
- We have 32,066,168 SNP common to the BCM and DPI pipelines (it would be
great to know how many were identified by each pipeline prior to the
comparison).
- Rudi then imposed the alternate allele frequency >= 0.1 filter to define
18,669,693.
- this means the alternate allele needed to be in at least 7 animals (7 / 74 =
0.1)
I suggest we modify this to be alternate allele frequency >=0.04. This means
the alternate allele is in at least 3 animals (3 / 74 = 0.041).
Once applied to the 32,066,168 SNP I think we might be done. What do you think?
Cheers
James
.From: Brauning, Rudiger [<Rudiger.Brauning agresearch.co.nz>]
.Sent: Friday, 2 November 2012 2:05 PM
.To: 'hans.daetwyler dpi.vic.gov.au'; Kijas, James (CAFHS, St. Lucia)
.Cc: McEwan, John
.Subject: RE: time to define the final SNP list
and .vcf for furture analysis in sheep
Thanks Hans,
That’s a nice, exhaustive set of filters.
Maybe a good time to refresh our memory on our current SNP collection:
My filters were:
· No N in reference
· No more than 2 alleles per locus
· SNP quality (ATLAS) of 60
· SNPs with a depth between 4 and (2*most_frequent_depth)
· Alternative allele frequency >= 0.04
calculated for BCM as :
my_af = (genotypes['1/1'] + genotypes['0/1'] / float(2)) / float(75)
retrieved from DPI vcf files (AF1=)
· Only accept SNPs common to BCM and DPI.
This adds Baylor’s filters to the list:
o ##FILTER=<ID=low_qual,Description="SNP posterior probability is less than 0.95 (QUAL<22)">
o ##FILTER=<ID=low_VariantReads,Description="Number of variant reads is less than 3">
o ##FILTER=<ID=low_VariantRatio,Description="Variant read ratio is less than 0.1">
o ##FILTER=<ID=single_strand,Description="More than 99% variant reads are in a single strand direction">
o ##FILTER=<ID=low_coverage,Description="Total coverage is less than 6">
š 32,066,168 SNPs common to BCM and DPI
š 18,669,693 of these have an alternative allele frequency >= 0.1
š 6,026,695 SNPs have no variants (SNPs or
indels with alternative allele frequency >= 0.04
as outlined above) / Ns / repeats within 50 bases
on at least one side of the SNP.
These SNPs went to Illumina for scoring.
Cheers,
Rudiger
.From: hans.daetwyler dpi.vic.gov.au
.Sent: Friday, 2 November 2012 1:31 p.m.
.To: <mailto:James.Kijas@csiro.au>James.Kijas csiro.au
.Cc: McEwan, John; Brauning, Rudiger
.Subject: Re: time to define the final SNP list and .vcf for furture analysis
in sheep
Hi Guys,
Below is what we use in the 1000 bulls. Differences to sheep would be:
* we cannot do the opposing homozygote
filters as we don't have parent/offspring pairs
* I would likely only use a proximity filter
of 2, as we have a very diverse sample of sheep
* variants with two or more alleles can be
removed and analysed separately if desired
Filters used in 1000 Bull pipeline – Run 2.0
2012-10-04
In pipeline order:
Samtools command line:
samtools-0.1.18/samtools mpileup -r
Chr1:0-20000000 -P ILLUMINA –BA -ugf
ReferenceFile -b BamListFile | bcfTools view -N -cvg -> out.vcf
Sites not in reference genome
· Specify –N in mpileup to remove them
All the following filters implemented in python using open source python
parser PyVCF https://github.com/jamescasbon/PyVCF/ to read vcf file.
Removal of variants with 2 or more alternative alleles
· Unfiltered vcf file of variants with 2
or more alternative alleles available upon request
Minimum number of alternative allele observations on forward and reverse
reads
· Threshold used was 1, removes variants
never observed on forward or reverse reads
· Replaces strand bias filters, which was
found to be too dependent on sample size making
it cumbersome to choose an optimal p-value
Overall quality
· samtools mpileup populates the vcf file QUAL field.
· Threshold chosen QUAL 20 (phred score)
Mapping quality
· Filtered on MQ 30 (phred)
Setting minimum and maximum read depths for filtering
· Set minimum number as 10 across all animals
o Previous 10% threshold judged to stringent
o Not a critical filter with many animals in mpileup
· Set maximum as: median read depth + 3 * standard deviation read depth
Opposing Homozygotes Filter
· Count the number of opposing homozygotes in parent/offspring pairs
o Accumulate information as proportion of
homozygotes per SNP and individual
· Use a threshold to filter variants that produce too many opposing homozygotes
o Used 10%
· Implemented in bash and Fortran90
Remove variants with same basepair position
· If two variants have the same bp position they are removed
o Resolves issue with SNP and INDEL calls at same position
Filtering INDELs based on proximity to INDELs
· Remove lower QUAL INDEL when closer than 10 basepairs
Filtering all variants based on proximity to each other
· remove lower QUAL variants if closer than 4bp
Filtering SNP based on proximity to INDELs
· Remove SNP closer than 4bp to INDELs
------------------------------------------------
Dr. Hans D. Daetwyler
Senior Research Scientist Genetics
Biosciences Research Division
Department of Primary Industries
5 Ring Rd.
Bundoora 3083
Victoria, Australia
Phone: +61 (0)3 9032 7037
Paper copies:
https://sites.google.com/site/hansdd/papers/
|