Efficient Software for Archiving and Retrieving Results of Massive Bioinformatics Analyses in High-Performance Computing Environments
TimeFriday, 19 November 20219:15am - 9:45am CST
DescriptionModern sequencing and computers in biomedical and agricultural areas generate thousands of samples every day. Bioinformatics workflows produce vast amounts of data, which need to be managed:. This creates bottlenecks when multiple workflows are running simultaneously and accessing the file system. The difficulty stems chiefly from the frequently very large number of files, with a highly nested directory structure, and a heterogeneous distribution of file sizes, with emphasis on large numbers of very small files. Parallel file systems, such as Lustre, GPFS, and tape archives, can perform poorly under these circumstances due to overabundance of metadata. Packaging files into archives ease the I/O burden when the collection is moved or archived. Standard packaging utilities, such as tar and zip, do not scale well with the size of the data for this particular use case. This manuscript reviews three parallel alternatives, showing their performance on high performance computing systems.