Grep unique reads

12/13/2023

The zcat to /dev/null reference is the following: $ time zcat SRR077487_2.filt.fastq. Using gzip we gain about another 10 seconds: gzip -dc ERR047740_1. |Īverage run-time is 116.69 seconds Konrad's gzip awk wc variant fix_base_count() ))" This is the slowest method with an average run-time of 125.35 seconds gzip awk Next we want to find the fastest way possible to count these, all timings are the average wall-clock time (real) of 10 runs collected with the bash time on an otherwise unloaded system: zgrep zgrep. I've chosen this file:Īs my test file, the correct answers being: Number of reads: 67051220 Most of the subcommands do not read whole FASTA/Q records in to memory, including stat, fq2fa, fx2tab, tab2fx, grep, locate, replace, seq, sliding. I need to make a list in a selection box of just one name of each make.

An example is: Audi:Warranty Audi:Pricing Audi:Colors Acura:Warranty Acura:Pricing Acura:Colors and so on through a bunch of makes. So most recent version of kseq.h is faster than simply zcat-ing the file (consistently in my tests.).įirst off for benchmarks with FASTQ it's best to use a specific real-world example with a known answer. 11-30-2010 macbb1117 Registered User 4, 0 Grep Unique Hello, I have a file with a list of car makes and specific information for each make. My machine is under different load this morning, so I've retested. Same test, with kseq.h from Github, as suggested in the comments: Also this solution gives you more flexibility with what you can do with the data.Īnd my horrible C can almost certainly be optimised. So, I get pretty close in speed, but am likely to be more standards compliant. (By the way, just zcat-ing the data file to /dev/null): real 0m38.736s Konrad's solution (in my hands): real 0m39.682s Printf("Number of bases in sequences: %ld\n", seqlen) įor my example file (~35m reads of ~75bp) this took: real 0m49.670sĬompared with your example: real 0m43.616s Printf("Number of sequences: %d\n", seqcount) Seqlen = seqlen + (long)strlen(seq->seq.s) I downloaded the example tarball and modified the example code (excuse my C.): #include įprintf(stderr, "Usage: %s \n", argv) People deride them too often, but this is where a well-written parser is worth it's weight in gold. We could instead focus on making sure we are getting the right answer. It's difficult to get this to go massively quicker I think - as with this question working with large gzipped FASTQ files is mostly IO-bound.

0 Comments

Grep unique reads

Leave a Reply.

Author

Archives

Categories