RescueMu GSS Assembly and Analysis

This is a supplementary page of Genome-wide mutagenesis of Zea mays L. using RescueMu. Here you will be able to find the detailed materials and methods that used to assemble and analyze RescueMu GSS.

Materials

Definition of RescueMu GSS sequences that we used for assembly were drawn from six grids of RescueMu-transformed maize plants (G, H, I, K, M, and P; click here to see the corresponding project number).

When RescueMu inserts into maize genome, it usually creates a 9 bp host target site duplication (TSD) next to its terminal inverted repeats (TIRs) at each end of the transposon. After plasmid rescue, sequencing was performed using primers complementary to the left or right TIRs. To re-construct the original genome sequence at a particular RescueMu insertion site, we need to identify the 9 bp TSDs and use them to connect the left and right sequences flanking a RescueMu insertion site. For sequences derived from the same sequencing plasmid, the reverse-complementary of the left should overlap the right sequence starting with the 9 bp TSD. The raw RescueMu GSS contains some TIR sequences, but these were masked with Xs using CrossMatch program. In addition, the raw sequences contain low quality regions (phred quality score < 20) that were trimmed off prior to submission to GenBank.

Why is the number of GenBank sequences larger than the raw sequences count? This occurs because one raw sequence (corresponding to one sequencing read) is sometimes split into two sequences for separate submission into GenBank. This procedure was used to identify sequences that might be made contiguous during the plasmid rescue circularization of a segment of maize genomic DNA. For example, the sequence 1006001B06.x1 in the raw sequence was split into two sequences at GenBank (1006001B06.x1 and 1006001B06.1EL_x1). The split occurs if either BamHI, BglII, or their possible mix ligation site was encountered. More information about the splitting can be found at http://www.mutransposon.org/project/RescueMu/zmdb/library-plate/GridGprep.html. As you can also see, a minor number of RescueMu plasmids didn't submit into GenBank due to low qualities.

Assembly

Step 1 - vector sequence cleaning: The first step of our assembly is to check and clean all the contaminating vector sequences. The above 131,364GenBank sequences were screened against against UniVec and RescueMu plasmid using CROSSMATCH program (-mismatch 12 -penalty -2 -minscore 20). The CROSSMATCH output was processed by a perl script (VectorTrim.pl) to trim off any vector sequences. The resulted total number of vector-free sequences is 130,861 derived from 56,728 plasmids (note: 503 vector sequences were discarded from assembly and subsequently withdrawn from GenBank).

Step 2 - repeat sequence masking: The vast majority of the maize genome was estimated to be repeat elements (e.g., retrotransposons). Our stragety here is to mask the repeat elements to avoid different loci from being clustered together due to their common repeat elements. Therefore, we screened the above 130,861 vector-free sequences against maize repeat elements annotated by TIGR. The screening was performed by the VMATCH program (-d -p -l 50 -h 3 -identity 95). The screened repeat sequences were trimmed off at the ends by a perl script (RepeatTrim.pl) and partitioned into the following two pools:

Step 3 - identify parental sequences: Any given RescueMu-transformed plant contain the parental RescueMu elements that were recovered at a high frequency during sequencing (from every sequenced row or column). Because our goal is to analyze the gene discovery by newly inserted RescueMu (i.e., where the non-parentals inserted into the maize genome), we decide to filter out the parental sequences as much as possible. We designed a fast-filter to quickly identify large duplicated (or near-identical) sequences for each grid based on the property that parental sequences were recovered from every rows/columns of grid plants. The following is how our filter works:

  1. From the above 127,708 repeat-free sequences, extract the left (Y) and right (X) sequences from each grid separately by a perl script (ExtractGridXYseq.pl).

    Sequence count from the above 127,708 repeat-free sequences:

  2. Single-linkage cluster the above X, Y sequences from each grid (GridG_XYseq.fsa.gz, GridH_XYseq.fsa.gz, GridI_XYseq.fsa.gz, GridK_XYseq.fsa.gz, GridM_XYseq.fsa.gz, GridP_XYseq.fsa.gz) by VMATCH program (-d -p -v -l 100 -seedlength 50 -exdrop 1 -identity 99 -dbcluster 98 95). The clustering was performed as the following: given two sequences, first checking if they share 50 bp identical regions (seeds); if so, extend that seeds to both directions, allowing only one mismatch/gap during the extension. Then the two sequences would be considered as "neighbors" if they satisfy the following three criteria: (1) their match region is no less than 100 bp, (2) >=99% identity in the match region; (3) the match region is 98% of the total length of the shorter sequence, 95% of the longer sequence. The criteria clusters together extremely similar (or identical) sequences together. Single-linkage cluster: a member was added into a cluster if this member is a neighbor to at least one other member in the cluster.


  3. Then a perl script (ParentalVmatchCluster.pl) was used to identify parental VMATCH clusters.

    Table 1: Grid Sequencing Information
    GridNum of Sequenced RowsNum of Sequenced Cols
    G451
    H135
    I362
    K273
    M031
    P421

    Some key points of the ParentalVmatchCluster.pl script: the above table (showing the number of sequenced rows/columns for each grid) and the RescueMu sequence grid location information were combined to identify parentals. If a VMATCH cluster contains plasmids that were derived from all the rows (or columns, whichever the one is larger), then this VMATCH cluster is marked as parental cluster in which all the plasmids were marked as parental plasmids. For example, if a grid G cluster has plasmids from 45 different rows, then this is a parental cluster at grid G. Grid P is a special case: the parentals from row 1 to row 37 are different from the rest rows; therefore we only counted clusters with plasmids from row 1 to row 37 as parentals.

    The result: total 16,122 plasmids were identified as parentals plasmids from all six grids, representing total 42,738 repeat-free sequences that were excluded from the subsequent assembly. Note that these 16,122 parental plasmids were identified by very stringent matching criteria. Therefore, some parental sequences may slip through this parental filter. Please read on to see how we calibrate that.

Step 4 - plasmid-level assembly: As stated above, each sequencing plasmid may generate up to four sequences that were separately submitted to the GenBank (X, Y, EL_X, EL_Y). Therefore, before we start assembling sequences from all the plasmids, we first did a plasmid-level assembly to (1) connect X and Y together through 9 bp TSD; (2) assign orientation of EL sequences to X or Y if they share enough sequence overlap (e.g., if EL_Y overlaps with X, it indicates EL_Y and X were on the same side of the RescueMu ligation). The plasmid assembly was carried out by a perl script, RescueMuNonParentalPlasmidAssembly.pl.

Some key points for this perl script:

Result of the plasmid-level assembly:
  1. Sequences thrown away due to the inconsistency: 784 (0.6% of the total sequences to be assembled)
  2. Sequences set aside due to multiple EL sites: 9,797 (7.5% of the total sequences to be assembled)
  3. Xs and Ys from 8,327 plasmids were connected through 9 bp TSD.
  4. The final result of this plasmid-level assembly is 66,905 putative Non-Parental Plasmid Assembly Sequences, derived from 38,526 plasmids.

Step 5: PaCE and CAP3: CAP3 is one of the most popular programs that assembles overlapping sequences and derives consensus sequences from each assembly. However, CAP3 is a computationally expensive program that uses dynamic-programming techniques; these require substantial memory and is very time-consuming for large amount of sequences. Therefore, we first used PaCE, which is a program built on suffix tree data structure for fast string comparison, to pre-cluster the combined sequences (if they share at least 35 identical bases within 40 base overlap region). Then, CAP3 was used to assemble the individual PaCE clusters (CAP3 parameters: -p 90 -s 700).

Parentals re-visit: as stated above, we have identified total 16,122 parental plasmids and excluded them from the PaCE/CAP3 assembly. Here we want to re-visit this issue and answer the following two questions:

In order to answer the above two questions, we did further analysis on all the parental clusters identified at the above Assembly Step 3.3 using a series of scripts.
  1. Build representative contigs: For each grid, a perl script (RandomXYSeqCluster.pl was used to randomly select X and Y sequences from 50 different parental plasmids from each parental cluster. The X and Y were joined together through their 9 bp TSD. Then, all the selected sequences were pooled together for each grid and ran CAP3 (-p 90 -s 700) to generate contigs. For example, there were 2 parental clusters from grid G. Then 50 random sequences from each cluster were selected and combined (100 sequences) for CAP3 to generate a representative 1 contig. Notice that the parental cluster was built using individual X and Y sequences, which means that X and Y from the same plasmids were separated in different cluster because they only shared at most 9 bp TSD region. Now that we joined X and Y, those different cluster can be merged together (that is why we had 2 clusters from G merged into only 1 representative contig).
  2. Extend representative contigs: We took the advantage of the existing huge amount of maize GSS sequences from the Methylation filtration and High Cot library to extend the contigs we built from (1).
  3. Verify parental sequences: The reason to build the above extended parental contigs is to verify all the parental sequences identified by our parental-filter. All the sequences that we claimed to be parental should be covered by such extended contigs. For each grid, we used BLASTN program (-e 1e-4) to compare the extended contigs with all the parental sequences from that grid.
  4. Deep Clean Parentals: Now that we have the parental contigs. We can use those contigs to screen the above PaCE/CAP3 contigs (from the above Step 5) by using the BLASTN program (-e 1e-4). The result further identified 508 parental assemblies (Alignment; representing 16,331 GSS sequences from the PaCE/CAP3 assembly sequences). All the parental sequence Stanford IDs (42,738 from Step 3 + 16,331 from parentals deep clean; total 24,875 plasmids).
    Table 2: Parental plasmids
    GridParental Plasmid Count
    G2,048
    H5,231
    I5,246
    K3,091
    M1,422
    P7,837

Repeats re-visit: as stated above, the GSSs were masked by maize repeat elements. The above contigs were assembled from repeat-trimmed GSS. In order to estimate the repeat contents from the original sequences obtained from RescueMu, we put back the trimmed repeats into the contigs. Specifically, 318 contigs contain repeat-trimmed GSS ans were re-assembled with CAP3 using the original GSS. The re-assembling resulted in 367 repeat-containing contigs. In addition, the above 3,153 repeat-containing GSSs (the minor pool) were also assembled into contigs.

Final Assembly results

Genomic loci: The above PaCE and CAP3 assembly is purely based on sequence overlap (i.e., if sequences share some significant overlap, they will be clustered and assembled into a contig). Then ideally, each assembly would correspond to a unique genomic location on the maize genome. Unfortunately, there are many cases where sequences that were indeed from the same genomic loci DO NOT get clustered together and are separated into different assembly. For example, contigs ZM_RM_GSStuc03-10-31.895 and ZM_RM_GSStuc03-10-31.896 do not share significant sequence overlap. But they both clearly represent the same genomic locus, because they were derived from sequences from the same plasmids. In this particular case, those sequences can not be clustered together because the 9 bp TSD can not be identified to connect the left (Y seq) and right (X seq).

Therefore, in addition to the sequence overlap, our analysis includes another single-linkage clustering using clone-pair constraints. Any contigs that share sequences from the same plasmids will be assigned to the same genomic locus. Such single-linkage clustering was carried out using a perl script. As a result, the above 25,068 contigs were clustered into 14,265 genomic loci (note: 13,187 loci were clustered from repeat-poor contigs and 1,078 loci were clustered from repeat-rich contigs). Single-linkage cluster: a member was added into a cluster if this member is a neighbor to at least one other member in the cluster.

Analysis

EST Matching: The contig sequences were matched to maize and other plant cDNA and EST contigs assembled by PlantGDB. The GeneSeqer program was used to perform spliced-alignment. Only high quality alignment outputs by GeneSeqer (hqPGS line) were considered as a valid match.

Protein Matching: The GSS contigs were BLASTX (<=1e-20) against SPTR protein collection. And GO ID were assigned to each top protein hit.

Ab initio Gene Prediction: As a complementary approach to the EST matching, we also used GENSCAN program with its default maize model. Only the predicated exons with high probablity (>=0.90) were counted.

Table 3: Genic regions
CategoryNo. of genomic loci
Total maize genomic loci discovered14,265
*Number of genic loci identified by:
Maize EST/cDNA7,555
Other Plant EST/cDNA1,253
SPTR protein database84
GENSCAN prediction708
Number of genic loci (percentage of total) 9,600 (67.3%)
*Numbers are cumulative, i.e., GSSs were first matched to maize EST/cDNAs, then the unmatched GSSs were screened against other plant EST/cDNA, and so forth.


tDNA prediction: tRNA genes were predicted from the RescueMu contigs by tRNAscan program with its default parameters.

Repeat Matching: RescueMu contigs were masked aganist TIGR Cereal Repeat Database (we used the version 2 consisting of maize, rice, barley, sorghum, and wheat repeats). The masking was carried out by VMATCH program (-seedlength 14 -hxdrop 3 -l 30 -identity 70). The same parameters gave us 24.5% and 16.2% masking, respectively, on the GSSs generated by methylation filtration (MF) and high Cot selection (HC), which is consistant with TIGR's analysis. Then, a loci was considered as a "repeat loci" if 75% or greater of its nucleotides were masked.

Table 4: Repeat regions
CategoryNo. of genomic loci
Total maize genomic loci discovered14,265
*Number of genic loci matching repeats:
Retrotransposon1,074
DNA transposon212
MITEs193
Centromere-related repeats57
Telomere-related repeats 3
Unknown repeats 221
Other repeats 8
45S rDNA 23
5S rDNA 10
tDNA25
Number of repeat loci (percentage of total)1,113 (8%)
*Numbers are not cumulative, i.e., some loci could match to both retrotransposon and DNA transposon.


If maize genome is 2/3 repetitive elements, then we have an about 8-fold insertion bias by RescueMu. Because RescueMu contigs in general is not very long, we want to determine whether the contig length affects repeat contents (the shorter you sequenced, the less likely you hit into the repeat regions?) As shown below (fig, A, B), when we compared the repeat contents of MF and HC gss contigs (version3.0 from TIGR), which is on average longer than RescueMu, longer sequences in fact show a decrease in repeat contents. Therefore, we are confident to conclude RescueMu approach indeed enrich against repetitive elements.

Figure 1. Repeat content analysis. (A). The percentage of "repeat-rich" contigs (>=75% nucleotides masked) was plotted at each length intervals.(B). Contig length frequency was plotted for HC, MF, and RM.

(A)

(B)

Genetic Mapping: All the ESTs that were genetically mapped (total 1,775 unique ESTs) on the maize genome were downloaded from the MaizeGDB. Their sequences were aligned to the above GSS contigs as described above. Total 131 matched GSS contigs were able to be plotted on IBM neighbor map coordinates system. The result clearly shows that RescueMu jumps to all the maize chromosomes as expected.

Figure 2. RescueMu GSS contigs were plotted on maize genetic map.

RescueMu Insertion Site: The RescueMu tagging approach generates not only maize genomic sequences but also potential maize mutants for functional analysis. We have located 3,999 independent RescueMu insertion sites on 3,688 GSS contigs. Note that in order to locate the RescueMu insertion site, the X and Y sequences from the same plasmids must be available and their 9 bp TSD could also be identified as described above. Since the vast majority of the sequencing plasmids do not have both X and Y sequences, we can not locate all the insertion site. However, these 3,686 GSS contigs contain 3,999 confirmed insertion site and still provide a good sample size for our estimation. Out of all the 3,999 insertion sites, 968 insertion sites are WITHIN confirmed gene structure (exon-intron structure) provided by EST/cDNA matching. 849 insertion sites are within exons (119 insertion sites are within introns), showing a strong perference for RescueMu to insert into exons, which is consistent with Mu being an efficient mutagen.

Figure 3. Sequence logo and position-specific score matrix generated from the TSD sites and its up- and down- stream three bases.

Figure 4. Information contents of TSD region profiles derived from exon, intron and unknown regions.