Archived Post

From James.Kijascsiro.au  Tue Nov  6 16:46:11 2012
From: James.Kijascsiro.au (by way of Jill Maddox <jillian.maddoxalumni.unimelb.edu.au>)
To: Multiple Recipients of <sheepmodelsanimalgenome.org>
Subject: Minutes from Sheep Consortium Call #102
Date: Tue, 06 Nov 2012 16:46:11 -0600

Minutes from ISGC Conference Call #102

Participants: Gwenola Tosser, John McEwan, Rudi Brauning, Shannon Clarke, Kim 
Worley, Cindy Lawley, Brian Dalrymple, Jiang Yu, Josh Miller, Noelle Cockett, 
James Kijas and Lakshmi Matukumali. The minutes were recorded by the secretary 
(JK) from the meeting held at 8am QLD time Nov 6th 2012.

1. Assembly v3.1, RNA seq and plans for Annotation Brian outlined discussions 
from late September held in the UK with the ENSEMBL team around application of 
their annotation pipeline for the sheep genome. Since September, v3.1 has been 
released by NCBI and the ISGC has agreed not to produce another assembly 
version in the near term. The current task involves collecting RNAseq datasets 
generated from a distribution of tissues to use in the construction of high 
quality genes models. Brian is coordinating this collection of RNAseq 
datasets, and sent out an initial email to describe the process. The team 
discussed when to set a deadline, after which ENSEMBL will commence 
annotation. PAG was seen as an important opportunity to engage researchers who 
are not regular ISGC participants, so January 31 2013 was set as the last day. 
Brian described the process for submitting datasets, which is primarily via 
external hard drives sent to EBI / ENSEMBL. Brian will be sending out 
additional information on the process (eg technical preference for stranded 
and paired end sequenced libraries) in the coming week.

Denis asked about the publication strategy and the team discussed the 
timelines associated with two scenarios. First, the assembly could be 
prepared for submission as a manuscript now, prior to the annotation and 
association biological interpretation. The group felt the assembly data alone 
was unlikely to yield a high impact paper. The alternative is to include the 
outcomes from the annotation (assumes the pipeline at ENSEMBL starts in Feb / 
Mar 2013, and will take a couple of months before subsequent analysis can 
commence). While this would mean a genome paper may not eventuate until 2014, 
many participants on the call felt this was the only shot at a high impact 
paper.

John asked if a current list of contacts could be prepared so he can check if 
local researchers have been successfully linked in.

2. Sequencing to improve assembly v3.1 Team has been devising a strategy for 
improving aspects of the v3.1 through additional sequencing:
- there are a large number of individually small gaps (which additional ILMN 
sequence is seen as unlikely to resolve)
- there are a comparatively small number of large gaps (where a BAC based 
approach has been discussed)

The EU FP7 3SR consortium has funds available to assist with improvement of 
the assembly, and their recent board discussion considered a range of options 
in consultation with the ISGC. The resulting operational plan involves 
identification of BACs spanning 30 – 40 largegaps (> 200 Kb) that appear to 
be problems with the sheep genome based on comparative analysis with cattle 
and human. A total of 500 BACs that span the large gaps and other difficult to 
assemble regions will be identified, before 96 are sequenced at the Roslin 
using 3SR funds. 

Brian and Noelle noted that the costs associated with pulling and growing the 
BACs may be carried by her USDA ‘finishing the genome’ grant. The group (3SR) 
had also considered but abandoned plans to fill the 5 – 10 thousand short gaps 
using a targeted (PCR or hybridisation based) sequencing strategy. Instead, 
sequencing to understand the prevalence and distribution of sheep CNV will be 
commissioned by 3SR with INRA (Thomas Faraut and Gwenola). This might best 
include ISGC samples that already have CGH array data and/or 454 sequence 
reads. John provided a summary of the activities that have already been 
completed in his group looking at CNV. Key points were:
- subject of a PhD program
- custom 2.1M CGH array constructed and used to assay 13 – 15 trios drawn 
mainly from the IMF (supported by USDA funds and an ISGC activity)
- some of these sequenced to 10 fold coverage and CNV detection performed 
using variation in read depth
- happy to work with the 3SR / INRA team providing it doesn’t compromise the 
publication strategy for the PhD student who is scheduled to finish Aug 2013
*John, James, Thomas, Gwenola and others to progress the plan by e-mail

Final point concerned PacBio as a technology to address the large number of 
short gaps. Brian asked Kim for an indicative price after she provided 
positive information on their experience with the read length.

3. Genome Resequencing and HD chip.
John provided an update around the design for the HD chip. Strong progress 
has been made and the team is on schedule to both finalise the design and have 
the orders locked in prior to Christmas. Cindy confirmed that manufacture 
requires 14 – 16 weeks, which should see chips shipped sometimeafter mid 
April. John and Cindy will be re-engaging with everyone who has indicated 
interest in access to the HD product over the coming weeks. James gave an 
update on progress with sorting out the sample misidentification situation in 
the 75 genomes project. Work is underway at USDA (Mike Heaton and Ted 
Kalbfleisch) to match the remaining genomes with their correct identifiers 
using amplicon resequencing data. Final point involved the filters to finalise 
the SNP set for the selection and domestication analysis. Team agreed that 
given variants had to be independently identified by two pipelines, that if 
the alternate allele is present in at least two individuals this was 
sufficient. Details about the pipelines and subsequent filters have been 
copied below these minutes.

4. PAG 2013
Abstracts were coordinated by e-mail prior to the call. James to develop 
draft agenda, and once finalised it will be uploaded on the PAG website.  
Noelle will chair the meeting in January as James will not be present.

5. Other Business
None recorded. James thanked the participants for dialling in. Please contact 
the secretary to correct any errors or serious omissions in these minutes

Cheers

James
James.Kijascsiro.au


.From: Kijas, James (CAFHS, St. Lucia)
.Sent: Monday, 29 October 2012 1:56 PM
.Subject: Agenda for Sheep Consortium Call #102 NEXT WEEK

Hi everyone,

It has been a number of months since the last 
consortium-wide call for the sheep genomics 
community. I have scheduled one for next week to 
update the full group on the activities that have 
been ongoing. Hope to speak with you then.

Cheers,

James


Draft Agenda:

1. Assembly v3.1, RNAseq and plans for annotation Brian and his team announced 
completion of v3.1 in October after it was accepted into NCBI. Brian is 
seeking to collect RNAseq datasets for the annotation effort. Team to discuss.

2. Sequencing to improve assembly v3.1 Over the last couple of weeks,  
discussions have been underway involving 3SR, Noelle’s USDA gran and the ISGC 
concerning how best to perform additional sequencing to improve v3.1. Brian, 
Noelle and others to summarise the plan going forwards.

3. Genome Resequencing and HD chip.
The key project participants have been meeting almost every week to work on 
SNP calling, QC and design for the HD chip using data from 75 genomes. James, 
John, Rudi, Kim and others to summarise progress and plan next steps.

4. PAG 2013
Team to review what presentations are planned.

5. Other Business


Time (taken from 
http://www.timeanddate.com/worldclock/converter.html)

Brisbane (Australia - Queensland) Tuesday, 6 
November 2012 at 8:00:00 AM EST UTC+10 hours
Melbourne (Australia - Victoria) Tuesday, 6 
November 2012 at 9:00:00 AM EDT UTC+11 hours
Wellington (New Zealand) Tuesday, 6 November 2012 
at 11:00:00 AM NZDT UTC+13 hours
Salt Lake City (U.S.A. - Utah) Monday, 5 November 
2012 at 3:00:00 PM MST UTC-7 hours
Houston (U.S.A. - Texas) Monday, 5 November 2012 at 4:00:00 PM CST UTC-6
 hours
Cardiff (United Kingdom - Wales) Monday, 5 
November 2012 at 10:00:00 PM GMT UTC
Paris (France) Monday, 5 November 2012 at 11:00:00 PM CET UTC+1 hour


Dial In Details
To join, you will need to:
1. Dial the TollFree number which corresponds to your location
Tollfree AUSTRALIA:
   1800 681583
Tollfree NORTHERN CHINA:
   1080 06100311
Tollfree SOUTHERN CHINA:
   1080 02610311
Tollfree FRANCE:
   0800 907046
Tollfree HONG KONG:
   800 900194
Tollfree NZ:
   0800 443188
Tollfree UK:
   0800 0281738
Tollfree USA:
   1877 4974432

2. When prompted by the recorded voice, enter the 
ACCOUNT NUMBER and PIN followed by the HASH (#) KEY.
The Account Number is:        76309443
The Pin is:             0863
That should link you into the conference call. 
For additional information on the process you can visit:
http://conferencing.telstra.com/solutions/SHphone_Kit.pdf

.From: Kijas, James (CAFHS, St. Lucia)
.Sent: Monday, 5 November 2012 1:11 PM
.To: 'Brauning, Rudiger'; 'hans.daetwylerdpi.vic.gov.au'
.Cc: McEwan, John
.Subject: RE: time to define the final SNP list 
and .vcf for furture analysis in sheep

Hi guys,

This all assisted greatly in my understanding of what has been done, and the 
numbers to fall out at each step. It sounds like we are very close given what 
you have already achieved.

- We have 32,066,168 SNP common to the BCM and DPI pipelines (it would be 
great to know how many were identified by each pipeline prior to the 
comparison).
- Rudi then imposed the alternate allele frequency >= 0.1 filter to define 
18,669,693.
- this means the alternate allele needed to be in at least 7 animals (7 / 74 = 
0.1)

I suggest we modify this to be alternate allele frequency >=0.04. This means 
the alternate allele is in at least 3 animals (3 / 74 = 0.041).

Once applied to the 32,066,168 SNP I think we might be done. What do you think?

Cheers

James


.From: Brauning, Rudiger [<Rudiger.Brauningagresearch.co.nz>]
.Sent: Friday, 2 November 2012 2:05 PM
.To: 'hans.daetwylerdpi.vic.gov.au'; Kijas, James (CAFHS, St. Lucia)
.Cc: McEwan, John
.Subject: RE: time to define the final SNP list 
and .vcf for furture analysis in sheep

Thanks Hans,

That’s a nice, exhaustive set of filters.

Maybe a good time to refresh our memory on our current SNP collection:

My filters were:
·         No N in reference
·         No more than 2 alleles per locus
·         SNP quality (ATLAS) of 60
·         SNPs with a depth between 4 and (2*most_frequent_depth)
·         Alternative allele frequency >= 0.04
calculated for BCM as : 
my_af = (genotypes['1/1'] + genotypes['0/1'] / float(2)) / float(75)
retrieved from DPI vcf files (AF1=)
·         Only accept SNPs common to BCM and DPI. 
This adds Baylor’s filters to the list:
o ##FILTER=<ID=low_qual,Description="SNP posterior probability is less than 0.95 (QUAL<22)">
o ##FILTER=<ID=low_VariantReads,Description="Number of variant reads is less than 3">
o ##FILTER=<ID=low_VariantRatio,Description="Variant read ratio is less than 0.1">
o ##FILTER=<ID=single_strand,Description="More than 99% variant reads are in a single strand direction">
o ##FILTER=<ID=low_coverage,Description="Total coverage is less than 6">

ð  32,066,168 SNPs common to BCM and DPI

ð  18,669,693 of these have an alternative allele frequency >= 0.1

ð  6,026,695 SNPs have no variants (SNPs or 
indels with alternative allele frequency >= 0.04 
as outlined above) / Ns / repeats within 50 bases 
on at least one side of the SNP.
These SNPs went to Illumina for scoring.

Cheers,
Rudiger

.From: hans.daetwylerdpi.vic.gov.au 
.Sent: Friday, 2 November 2012 1:31 p.m.
.To: <mailto:James.Kijas@csiro.au>James.Kijascsiro.au
.Cc: McEwan, John; Brauning, Rudiger
.Subject: Re: time to define the final SNP list and .vcf for furture analysis 
  in sheep

Hi Guys,

Below is what we use in the 1000 bulls.  Differences to sheep would be:
    * we cannot do the opposing homozygote 
filters as we don't have parent/offspring pairs
    * I would likely only use a proximity filter 
of 2, as we have a very diverse sample of sheep
    * variants with two or more alleles can be 
removed and analysed separately if desired

Filters used in 1000 Bull pipeline – Run 2.0

2012-10-04

In pipeline order:
Samtools command line:

samtools-0.1.18/samtools mpileup -r 
Chr1:0-20000000 -P ILLUMINA –BA -ugf 
ReferenceFile -b BamListFile | bcfTools view -N -cvg -> out.vcf

Sites not in reference genome

· Specify –N in mpileup to remove them

All the following filters implemented in python using open source python 
parser PyVCF https://github.com/jamescasbon/PyVCF/ to read vcf file.

Removal of variants with 2 or more alternative alleles
· Unfiltered vcf file of variants with 2 
or more alternative alleles available upon request

Minimum number of alternative allele observations on forward and reverse
 reads
· Threshold used was 1, removes variants 
never observed on forward or reverse reads

· Replaces strand bias filters, which was 
found to be too dependent on sample size making 
it cumbersome to choose an optimal p-value

Overall quality
· samtools mpileup populates the vcf file QUAL field.

· Threshold chosen QUAL 20 (phred score)

Mapping quality
· Filtered on MQ 30 (phred)

Setting minimum and maximum read depths for filtering
· Set minimum number as 10 across all animals

o Previous 10% threshold judged to stringent

o Not a critical filter with many animals in mpileup

· Set maximum as: median read depth + 3 * standard deviation read depth

Opposing Homozygotes Filter
· Count the number of opposing homozygotes in parent/offspring pairs

o Accumulate information as proportion of 
homozygotes per SNP and individual

· Use a threshold to filter variants that produce too many opposing homozygotes

o Used 10%

· Implemented in bash and Fortran90

Remove variants with same basepair position
· If two variants have the same bp position they are removed

o Resolves issue with SNP and INDEL calls at same position

Filtering INDELs based on proximity to INDELs
· Remove lower QUAL INDEL when closer than 10 basepairs

Filtering all variants based on proximity to each other
· remove lower QUAL variants if closer than 4bp

Filtering SNP based on proximity to INDELs
· Remove SNP closer than 4bp to INDELs


------------------------------------------------
Dr. Hans D. Daetwyler
Senior Research Scientist Genetics
Biosciences Research Division
Department of Primary Industries
5 Ring Rd.
Bundoora 3083
Victoria, Australia
Phone: +61 (0)3 9032 7037
Paper copies: 
https://sites.google.com/site/hansdd/papers/