MCL clustering for *Base

 

Docs PDF attached below.

Start with all gb files from a *Base in a dir called "genomes".
Convert them all to fasta:
dir_loop.py -c 'genbank_to_fasta.py -q locus_tag -i' -d ./

Cat them into one file:
cat *.fasta > ../features.fasta 

 

Search and replace to remove all "_" endings to sequences. I think these are translations of stop conds?
Search and replace all "GI:" with "GI_".

create_blast_databases.py -f features.fasta -o ./ -F -p

Reciprocal Blast:
blastp -query features.fasta -db features_aa -out features.blast -outfmt 6 -num_threads 4

cut -f 1,2,11 features.blast > features.abc

Process and prepare data for mcl. Transform evalues to more useful metric.
mcxload -abc features.abc -write-tab features.dict -o features.mci --stream-mirror --stream-neg-log10 -stream-tf 'ceil(200)' 

mcl features.mci -I 1.4 && mcl features.mci -I 2 && mcl features.mci -I 6 && mcl features.mci -I 1.2

For each of the 4 outfiles:
mcxdump -icl out.features.mci.I12 -o dump.features.mci.I12 -tabr features.dict

 

If using the dump files for the db, make sure to search and replace "GI_" with "GI:"

May want to make mci files strictly nesting (fine granularity will always conform to course granularity)
clm order -prefix P out.features.mci.I{12,14,20,60}

Make a tree:
clm order -o features.mcltree out.features.mci.I{12,14,20,60}

AttachmentSize
mimb.pdf380.19 KB