Sailgene Technology

Language
- English

Home page
Resources
BLOG
Genome Survey Analysis: Three Key Parameters You Must Know Before Starting Your Next Genome Project

Genome Survey Analysis: Three Key Parameters You Must Know Before Starting Your Next Genome Project

Release time：2026-06-11 08:49:34

Before you invest heavily in sequencing a new species, do you know its genome size, heterozygosity, repeat content or ploidy? Guessing can lead to wasted budget, failed assemblies, and wrong tool choices. This guide helps you understand what a genome survey analysis can do for you and introduces our one‑stop service to get it done efficiently.

1. Are You Really Ready to Start a New Genome Project?

Have you ever run into these situations?

You know nothing about the genome size of a new species. How much sequencing data should you order?
Your sample might be highly heterozygous or repeat-rich. Will standard assemblers work?
You suspect polyploidy but aren’t sure which assembly strategy to choose?

If you just sequence with “rule of thumb”, you risk:

Insufficient depth leads to fragmented assembly and missing key genes.
Excessive depth leads to wasted budget and longer turnaround.
Wrong assembler leads to unusable results when a diploid assembler is used on a polyploid gives unusable results.

A simple solution is perform a genome survey analysis (also called genome profiling) before investing in full-scale sequencing. With just a small amount of data, you can estimate genome size, heterozygosity, repeat content and ploidy – a roadmap for your entire project.

2. Our Solution: 30-50× NGS Data + k-mer Analysis to Estimate Three Core Parameters

Core method: k-mer frequency distribution analysis

Platform: NGS (Illumina or DNBSEQ), 30-50× depth
Cut sequencing reads into short fragments of length *k* (usually k=19) (Fig. 1).
Count how many times each kmer appears and plot the frequency histogram.

4d5d07d5-9b0d-4505-a459-74d40d311461_看图王.jpg

Figure 1. 4-mer example from the ACGAGGTACGA sequence.

(Source: https://medium.com/swlh/bioinformatics-1-k-mer-counting-8c1283a07e29)

From this plot we obtain (Fig. 2):

554b8d0b-86cc-4f3a-a946-fc8376755f5a_看图王.jpg

Figure 2. Schematic diagram of k‑mer frequency distribution.

(Source: http://www.zhangzhiyuan.site/archives/kmer-ping-gu-ji-yin-zu-da-xiao)

With only 30–50× short-read data, you get a complete blueprint for your project.

3. How to Do It: Platform Recommendations + Handling Unknown Genome Size

a.Platform & Depth Recommendations

Platform: Illumina or DNBSEQ (pairedend, 150–300 bp insert)
Recommended depth: 30–50× (suitable for most animal/plant genomes)
Library type: short-insert library (~350 bp)

b.What If the Genome Size Is Completely Unknown?

You can estimate it in two ways:

Method 1. Search related species
Visit public databases for genome sizes of closely related species:

NCBI Genome Database
Ensemble
Plant DNA Cvalues Database (plants)
Animal Genome Size Database (animals)

Method 2. Experimental estimation
Use flow cytometry or Feulgen densitometry to directly measure nuclear DNA content. This is the most reliable pre-estimation method.

If you are unsure how to proceed, we can help with database searches or arrange flow cytometry services.

4. Case Study Gallery: Real k-mer Plots of Diploid, Triploid, Tetraploid and Hexaploid Genomes

By looking at the shape and peaks of a kmer distribution plot, you can quickly judge the ploidy and complexity of your species. The following k‑mer distribution plots illustrate characteristic patterns observed in diploid, triploid, tetraploid and hexaploid genomes. All results presented here were analyzed using GenomeScope2 (2) with data from previously published studies (see references).

Case 1. Diploid Genome – One Major Peak

Example: Unio delphinus (freshwater mussel)(1)

Genome size: 2.31 Gb
Heterozygosity: 0.64%
Repeat content: 46.8%

9b6d810a-8f3b-4c73-92f3-a80709f51cc3_看图王.jpg

Figure 3. Diploid k‑mer plot of Unio delphinus (k=25).
(Source: https://doi.org/10.1038/s41597-023-02251-7 )

Case 2. Triploid Genome – Three Peaks

Example: Meloidogyne enterolobii (rootknot nematode)

Genome size: 89.57 Mb
Heterozygosity: 0.935%
Repeat content: 38.1%

109fbf92-5ec9-48b7-ab97-85ea81f4ab20_看图王.jpg

Figure 4. Triploid k‑mer plot of Meloidogyne enterolobii (k=25).

(source: https://doi.org/10.1038/s41467-020-14998-3 )

Case 3. Tetraploid Genomes – Allotetraploid vs Autotetraploid

Allotetraploid(e.g. Gossypium barbadense, seaisland cotton): peak pattern shows aaab < aabb
Autotetraploid(e.g. Solanum tuberosum, potato): aaab > aabb

c9f533d1-7797-413a-b5d6-030fd6ff3fe3_看图王.jpg

Figure 5. K‑mer plots of allotetraploid Gossypium barbadense and autotetraploid Solanum tuberosum (k=21).

(source: https://doi.org/10.1038/s41467-020-14998-3)

Case 4. Hexaploid Genome – Broad, Complex Distribution

Example: Triticum aestivum (bread wheat)
Multiple overlapping peaks with a wide frequency range, reflecting high complexity.

79fee7f1-5b17-425a-ba26-5070a0fb0a09_看图王.jpg

Figure 6. Hexaploid k‑mer plot of Triticum aestivum (k=21).

(source: https://doi.org/10.1038/s41467-020-14998-3)

Note: If your k-mer plot looks messy or suggests polyploidy, do not rely solely on GenomeScope2. We recommend smudgeplot or PloidyFrost for more accurate ploidy inference (see next section).

5. Contamination & Polyploid Handling

a.Sample Contamination Assessment

Before interpreting kmer results, you must verify sample purity.

Method:Map reads against the NT database (NCBI)
Threshold:If contamination > 5% (bacteria, fungi, nontarget DNA), reprepare the sample and resequence.
Our survey service includes contamination screening by default.

How to Handle a Polyploid Genome?

When the kmer plot suggests polyploidy but there is no literature support, follow a twostep strategy:

Step	Method	Output
Computational prediction	smudgeplot / PloidyFrost on the same kmer data	Ploidy estimate (triploid, autotetraploid, etc.)
Experimental validation	Flow cytometry + karyotyping	Cytological confirmation

Figure 7. Example of smudgeplot output for ploidy inference.

(source: https://github.com/KamilSJaron/smudgeplot)

We can help design a combined “sequencing + cytogenetic validation” package to ensure your assembly strategy is correct.

6. Frequently Asked Questions (FAQ)

Q1: How exactly are genome size, heterozygosity and repeat content calculated from kmers?

Genome size≈ (total number of distinct kmers) / (mean kmer coverage)
Heterozygosity≈ (heterozygous kmers) / (total kmers) – visible as a secondary peak in the histogram
Repeat content= 1 – (unique kmers / total kmers)

These calculations are built into tools like GenomeScope2 and GCE, so you don’t need to compute them manually.

Q2: Can different software (GCE vs GenomeScope2) give different results?

A: Yes, because they use different statistical models. We recommend:

Run both tools and compare results
Visually inspect the raw kmer histogram to locate the main peak
If needed, calibrate with flow cytometry data

Q3: My kmer plot looks like a polyploid but there is no literature support, what should I do?

Run smudgeplotfor a computational prediction
If still uncertain, perform flow cytometryand chromosome counting
Use the experimental results to constrain your software parameters

Q4: How do I remove contamination effects from kmer analysis?

A: Use strict quality trimming (fastp or Trimmomatic) and map to a contaminant database. If contamination >5%, do not proceed – repurify the sample and resequence.

Q5: How to choose kmer length? Why is it usually odd?

For most animals and plants: k=19(long enough to avoid most sequencing errors, short enough to cover unique regions)
Adjustable range: 17, 19, 21
Odd kprevents a kmer from being its own reverse complement and allows a clear central base – important for many alignment algorithms.

7.Our Service Package: OneStop Genome Survey Analysis

We offer a standardised Genome Survey Service that saves you time and effort:

Service Component	Description
DNA extraction & QC	Ensure sample quality
Shortread sequencing	Illumina / DNBSEQ, 30–50×, pairedend 150 bp
Standard kmer analysis	Dual software: GenomeScope2 + GCE
Contamination screening	NT database alignment; alert if >5% contamination
Ploidy inference	Smudgeplot added if needed
Final report	Genome size, heterozygosity, repeat content, recommended assembly strategy
Optional addons	Flow cytometry, karyotyping (additional turnaround)

Turnaround time: 2–3 weeks (from sample receipt to report)
Delivery format: PDF report + optional raw data

Related Services

Reference:

Gomes-Dos-Santos A, et al. PacBio Hi-Fi genome assembly of the Iberian dolphin freshwater mussel Unio delphinus Spengler, 1793. Sci Data. 2023.
Ranallo-Benavidez TR, et al. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat Commun. 2020.

None

ONT-Only Strategy for T2T Genome Sequencing

Pore-C vs. Hi-C: Better Scaffolding, Lower Cost

2026/03/09

ONT flow cell support for well-planned pangenome projects

2026/03/09

$1900 for Complete Single-Cell Workflow (Plant & Animal Samples)

2026/03/09

If you are interested in our long-read sequencing services or potential collaboration, please contact us. Our team is ready to support your research with tailored solutions. We also welcome feedback from users to help us improve our services.

Submit

%{tishi_zhanwei}%

Cookie

Our website uses cookies and similar technologies to personalize the advertising shown to you and to help you get the best experience on our website. For more information, see our Privacy & Cookie Policy

Cookie

Required

These cookies are necessary for basic functions such as payment. Standard cookies cannot be turned off and do not store any of your information.

Analyze

These cookies collect information, such as how many people are using our site or which pages are popular, to help us improve the customer experience. Turning these cookies off will mean we can't collect information to improve your experience.

Feature

These cookies enable the website to provide enhanced functionality and personalization. They may be set by us or by third-party providers whose services we have added to our pages. If you do not allow these cookies, some or all of these services may not function properly.

Advertise

These cookies help us understand what you are interested in so that we can show you relevant advertising on other websites. Turning these cookies off will mean we are unable to show you any personalized advertising.

WhatsApp：1-(617)-223-7544

Tel：1-(617)-223-7544

E-mail：service@sailgene.com

Add: One Innovation Drive, Suite B3-406, Worcester, MA 01605, USA