Software Tools for Big-data Analyses


K-shuff is a powerful computer program designed to identify spatial clustering in a given dataset based on the reduced second moment measure, or K-function (Jangid et al., 2016). In essence, K-shuff can be adapted for comparing any data from two (or more) samples to understand their relationship with each other. As an example, we adapt this technique to compare 16S rRNA gene sequence libraries from different environmental samples by treating gene sequences as points in space with hundreds of dimensions. Inspired by Ripley’s K-function for spatial point pattern analysis, the Intra K-function or IKF measures the structural diversity, including both the richness and overall similarity of the sequences, within a library. The Cross K-function or CKF measures the compositional diversity between gene libraries, reflecting both the number of OTUs shared as well as the overall similarity in OTUs. A Monte Carlo testing procedure then enables statistical evaluation of both the structural and compositional diversity between gene libraries.

K-shuff is available for download here

An opportunity is available in the MICROBIAL LAB to develop an online submission tool for K-shuff. So, get in touch with Kamlesh.

Data Extraction Tools (DETools)

DETools is a set of three scripts (SeqEx, DistEx and CutOff) which can be used to extract a small data from a larger dataset. Be it sequences in the FASTA format, or a distance matrix prepared using the PHYLIP package, preparing SAMPLE files for running multiple LIBSHUFF comparisons, or if you simply wish to know which pairs of organisms in a distance matrix share distance values less than or above a user determined cutoff, these tools are for you. You can use these tools and save yourself lots of time and energy doing those realignments, simply extract the data from one BIG dataset.

The v1 of these scripts were originally written in SCILAB, the open source platform for numerical computation, by Rajesh and Kamlesh Jangid. Later on, we modified and rewrote them in C++ for faster and user-friendly computation as v2.

Where is it available?

The Windows Installer of DETools or separate exe files for the three scripts may be obtained from Kamlesh.

Sequence Extraction (SeqEx)

The SeqEx utility can be used to extract a set of sequences from a larger sequence dataset. It is specifically written for extracting sequences which are no more than 7682 characters in length. If you have sequences that are shorter, you need not worry. It works with both aligned and un-aligned sequence files in the FASTA format. The length limitation makes it usable with the Nearest Alignment Space Termination (NAST) aligner of the Greengenes database.

How does it work?

For the SeqEx utility to work, you must have two INPUT files: 1) The BIG sequence file in FASTA format from which you want to extract a set of sequences, and 2) A list file in TEXT format containing the list of sequences names for which you wish to extract the sequences with one sequence ID per line. Make sure that all the sequence IDs listed in this list file are present in the BIG dataset. Execution will be easier if the SeqEx executable is in the same folder as the two input files. Simply double click on the executable, enter the name of the list file, the name of the BIG sequence file, followed by the output file name. Your smaller sequence FASTA file is now ready.

Matrix Extraction (DistEx)

The DistEx utility is used to extract a distance matrix for a set of sequences which form part of a larger distance matrix. It is very useful when you have to run multiple LIBSHUFF comparisons for a set of libraries in different combinations

How does it work?

The DistEx utility works similar to SeqEx as described above as there are two INPUT files: 1) The BIG distance matrix file in the PHYLIP format, and 2) A sequence list file in TEXT format. The list file contains the sequence IDs (one ID per line) for which you wish to extract the distance matrix. Make sure that all the sequence IDs listed in this list file are present in the BIG matrix.

DistEx will generate two output files: a PHYLIP formatted distance matrix and a LIBSHUFF compatible SAMPLE file. The SAMPLE file can be directly used as an input in LIBSHUFF. The DistEx output is in the order of the input sequence list and is independent of their order of appearance in the BIGGER matrix. Similar to SeqEx, execution will be simpler if the DistEx executable is in the same folder as the two input files. Simply double click on the executable, enter the name of the list file, the name of the BIG matrix file, followed by the output file names. Your smaller distance matrix is now ready.

Sequence CutOff (CutOff)

The CutOff utility works on either a similarity matrix or a distance matrix. This tool allows FILTERING OUT those pairs of sequences which have similarity or distance values lower than the given cutoff. Hence, the output file contains the most similar sequences if using a similarity matrix as an input. In contrast, one selects for the most dissimilar sequences while using a distance matrix as an Input.

How does it work?

For the CutOff utility to work all you need is a PHYLIP generated distance matrix or a similarity matrix. The similarity matrix could be prepared from the PHYLIP distance matrix using the JukesCantor MSword macro written by Jose Gonzalez. CutOff will generate a single output file in the tab-delimited file format. Simply double click on the executable, enter the name of the matrix file, the cutoff value, followed by the output file name. Your sequence pairs are now ready.