A Java-based FASTQ file generator for bioinformatics testing and development.
FastqGen is a lightweight tool that generates simulated FASTQ files with random DNA sequences. FASTQ files are text-based format for storing both biological sequence data (usually nucleotide sequence) and its corresponding quality scores. This tool is particularly useful for:
- Testing bioinformatics pipelines
- Benchmarking sequence analysis tools
- Development and debugging of NGS data processing applications
- Educational purposes in bioinformatics
- Generates valid FASTQ format files
- Command-line configurable sequence length and file size
- Random DNA sequence generation using standard nucleotides (A, T, C, G)
- Phred quality score simulation
- Automatic timestamp-based file naming
The program requires two command-line arguments:
java -jar fastqgen.jar <sequenceLength> <fileSizeInMB>Parameters:
sequenceLength: Length of each DNA sequence in base pairs (e.g., 100)fileSizeInMB: Desired output file size in megabytes (e.g., 10)
Example:
java -jar fastqgen.jar 150 20 # Generates sequences of 150bp with a total file size of 20MBThe output filename is automatically generated with format: simulated_YYYYMMDDHHMMSS.fastq
Each entry in the generated FASTQ file consists of four lines:
- Sequence identifier (starts with '@')
- Raw sequence letters
- Separator line (starts with '+')
- Quality scores (encoded in ASCII)
Example:
@SEQ1
ATCGATCG...
+
IIIIIII...
- Java JDK 8 or higher
- Maven (for building)
mvn clean packageAfter building, you can run the application using the generated jar in the target directory:
java -jar fastqgen.jar <sequenceLength> <fileSizeInMB>Example:
java -jar fastqgen.jar 100 10 # Generates 100bp sequences in a 10MB file- Quality scores are generated using Phred-like quality values
- Sequences are randomly generated using equal probabilities for A, T, C, G
- Each generated file includes a unique timestamp in its name for easy identification
MIT License
Contributions are welcome! Please feel free to submit a Pull Request.