admin管理员组

文章数量:1431770

I have a large text file with FASTA sequence (basically text) of multiple genes. I would like to split the txt file into multiple files according to file name of genes.

The structure of the file looks like this:

file1.txt

>PDGFRB|ENST00000522466.1
TCAGTCATCCTTTCCCTCTCTAGCCCCCTACCCTATCCCCAAGCTGAAGTGCTAGTGGCT
GGTGGTGACTTCCCCAGACCTAAGCCAATCTCTCTCTACCAGTGTCATCCATCAACGTCT
CTGTGAACGCAGTGCAGACTGTGGTCCGCCAGGGTGAGAACATCACCCTCATGTGCATTG
TGATCGGGAATGAGGTGGTCAACTTCGAGTGGACATACCCCCGCAAAGAAGTAATGTGGG
GCCAGGCAGGGGTCGGAGGAGGGGCCAGGAACGGGTGGATATCTGGCTTGCAGGCTGATT
TCTCCCCGGCCCCTCCTGATTTGGGGGGCCTGCCCAACCTGTTGCTGCAGAGTGGGCGGC
TGGTGGAGCCGGTGACTGACTTCCTCTTGGATATGCCTTACCACATCCGCTCCATC
>DGAT2|ENST00000604935.5
AGAAAGGCCGGGCGCGGCGAGGCTGGGCGCTGGGCGGCTGCGGCGCGCGGTGCGCGGTGC
GTAGTCTGGAGCTATGGTGGTGGTGGCAGCCGCGCCGAACCCGGCCGACGGGACCCCTAA
AGTTCTGCTTCTGTCGGGGCAGCCCGCCTCCGCCGCCGGAGCCCCGGCCGGCCAGGCCCT
GCCGCTCATGGTGCCAGCCCAGAGAGGGGCCAGCCCGGAGGCAGCGAGCGGGGGGCTGCC
CCAGGCGCGCAAGCGACAGCGCCTCACGCACCTGAGCCCCGAGGAGAAGGCGCTGAGGAG
GTGGGCGAGGGGCCGGGGTCTGGGGCCAGATCTGAAGCCGGGACTAGGGACAGGGGCAGG

I want two files with the outputs as:

PDGFRB|ENST00000522466.1.txt

>PDGFRB|ENST00000522466.1
TCAGTCATCCTTTCCCTCTCTAGCCCCCTACCCTATCCCCAAGCTGAAGTGCTAGTGGCT
GGTGGTGACTTCCCCAGACCTAAGCCAATCTCTCTCTACCAGTGTCATCCATCAACGTCT
CTGTGAACGCAGTGCAGACTGTGGTCCGCCAGGGTGAGAACATCACCCTCATGTGCATTG
TGATCGGGAATGAGGTGGTCAACTTCGAGTGGACATACCCCCGCAAAGAAGTAATGTGGG
GCCAGGCAGGGGTCGGAGGAGGGGCCAGGAACGGGTGGATATCTGGCTTGCAGGCTGATT
TCTCCCCGGCCCCTCCTGATTTGGGGGGCCTGCCCAACCTGTTGCTGCAGAGTGGGCGGC
TGGTGGAGCCGGTGACTGACTTCCTCTTGGATATGCCTTACCACATCCGCTCCATC

and, DGAT2|ENST00000604935.5.txt

>DGAT2|ENST00000604935.5
AGAAAGGCCGGGCGCGGCGAGGCTGGGCGCTGGGCGGCTGCGGCGCGCGGTGCGCGGTGC
GTAGTCTGGAGCTATGGTGGTGGTGGCAGCCGCGCCGAACCCGGCCGACGGGACCCCTAA
AGTTCTGCTTCTGTCGGGGCAGCCCGCCTCCGCCGCCGGAGCCCCGGCCGGCCAGGCCCT
GCCGCTCATGGTGCCAGCCCAGAGAGGGGCCAGCCCGGAGGCAGCGAGCGGGGGGCTGCC
CCAGGCGCGCAAGCGACAGCGCCTCACGCACCTGAGCCCCGAGGAGAAGGCGCTGAGGAG
GTGGGCGAGGGGCCGGGGTCTGGGGCCAGATCTGAAGCCGGGACTAGGGACAGGGGCAGG

I tried this, it splits the files but does not save into separate files with gene names. It also gives the error 'ambiguous redirect'.

#!/bin/bash

IFS=">" read -r -d '' -a my_array < file1.txt

for element in "${my_array[@]}";
do
    gene_name=$(echo "$element" | awk '{print $1}')
    gene_name=$(echo "$gene_name" | cut -d $'\n' -f 1)
    echo "$gene_name"
    echo $"element" > $gene_name.txt
done

I have a large text file with FASTA sequence (basically text) of multiple genes. I would like to split the txt file into multiple files according to file name of genes.

The structure of the file looks like this:

file1.txt

>PDGFRB|ENST00000522466.1
TCAGTCATCCTTTCCCTCTCTAGCCCCCTACCCTATCCCCAAGCTGAAGTGCTAGTGGCT
GGTGGTGACTTCCCCAGACCTAAGCCAATCTCTCTCTACCAGTGTCATCCATCAACGTCT
CTGTGAACGCAGTGCAGACTGTGGTCCGCCAGGGTGAGAACATCACCCTCATGTGCATTG
TGATCGGGAATGAGGTGGTCAACTTCGAGTGGACATACCCCCGCAAAGAAGTAATGTGGG
GCCAGGCAGGGGTCGGAGGAGGGGCCAGGAACGGGTGGATATCTGGCTTGCAGGCTGATT
TCTCCCCGGCCCCTCCTGATTTGGGGGGCCTGCCCAACCTGTTGCTGCAGAGTGGGCGGC
TGGTGGAGCCGGTGACTGACTTCCTCTTGGATATGCCTTACCACATCCGCTCCATC
>DGAT2|ENST00000604935.5
AGAAAGGCCGGGCGCGGCGAGGCTGGGCGCTGGGCGGCTGCGGCGCGCGGTGCGCGGTGC
GTAGTCTGGAGCTATGGTGGTGGTGGCAGCCGCGCCGAACCCGGCCGACGGGACCCCTAA
AGTTCTGCTTCTGTCGGGGCAGCCCGCCTCCGCCGCCGGAGCCCCGGCCGGCCAGGCCCT
GCCGCTCATGGTGCCAGCCCAGAGAGGGGCCAGCCCGGAGGCAGCGAGCGGGGGGCTGCC
CCAGGCGCGCAAGCGACAGCGCCTCACGCACCTGAGCCCCGAGGAGAAGGCGCTGAGGAG
GTGGGCGAGGGGCCGGGGTCTGGGGCCAGATCTGAAGCCGGGACTAGGGACAGGGGCAGG

I want two files with the outputs as:

PDGFRB|ENST00000522466.1.txt

>PDGFRB|ENST00000522466.1
TCAGTCATCCTTTCCCTCTCTAGCCCCCTACCCTATCCCCAAGCTGAAGTGCTAGTGGCT
GGTGGTGACTTCCCCAGACCTAAGCCAATCTCTCTCTACCAGTGTCATCCATCAACGTCT
CTGTGAACGCAGTGCAGACTGTGGTCCGCCAGGGTGAGAACATCACCCTCATGTGCATTG
TGATCGGGAATGAGGTGGTCAACTTCGAGTGGACATACCCCCGCAAAGAAGTAATGTGGG
GCCAGGCAGGGGTCGGAGGAGGGGCCAGGAACGGGTGGATATCTGGCTTGCAGGCTGATT
TCTCCCCGGCCCCTCCTGATTTGGGGGGCCTGCCCAACCTGTTGCTGCAGAGTGGGCGGC
TGGTGGAGCCGGTGACTGACTTCCTCTTGGATATGCCTTACCACATCCGCTCCATC

and, DGAT2|ENST00000604935.5.txt

>DGAT2|ENST00000604935.5
AGAAAGGCCGGGCGCGGCGAGGCTGGGCGCTGGGCGGCTGCGGCGCGCGGTGCGCGGTGC
GTAGTCTGGAGCTATGGTGGTGGTGGCAGCCGCGCCGAACCCGGCCGACGGGACCCCTAA
AGTTCTGCTTCTGTCGGGGCAGCCCGCCTCCGCCGCCGGAGCCCCGGCCGGCCAGGCCCT
GCCGCTCATGGTGCCAGCCCAGAGAGGGGCCAGCCCGGAGGCAGCGAGCGGGGGGCTGCC
CCAGGCGCGCAAGCGACAGCGCCTCACGCACCTGAGCCCCGAGGAGAAGGCGCTGAGGAG
GTGGGCGAGGGGCCGGGGTCTGGGGCCAGATCTGAAGCCGGGACTAGGGACAGGGGCAGG

I tried this, it splits the files but does not save into separate files with gene names. It also gives the error 'ambiguous redirect'.

#!/bin/bash

IFS=">" read -r -d '' -a my_array < file1.txt

for element in "${my_array[@]}";
do
    gene_name=$(echo "$element" | awk '{print $1}')
    gene_name=$(echo "$gene_name" | cut -d $'\n' -f 1)
    echo "$gene_name"
    echo $"element" > $gene_name.txt
done
Share Improve this question edited Nov 19, 2024 at 12:53 Ed Morton 206k18 gold badges87 silver badges207 bronze badges asked Nov 19, 2024 at 11:00 user23441879user23441879 311 silver badge2 bronze badges 3
  • Always check your scripts in shellcheck as directed by the bash tag you used - "NOTE: Do not ask a bash question until you have copy/pasted your script into shellcheck and fixed all of the issues it tells you about.". – Ed Morton Commented Nov 19, 2024 at 12:54
  • 2 I strongly recommend against using a | symbol in your file names. There are many good reasons for sticking with the portable filename character set and restricting your names to use A-Za-z0-9._- Personally, I also recommend against uppercase letters, since they will burn you when you go go a case-insensitive filesystem. Putting a pipe symbol in your filenames is just asking for trouble. – William Pursell Commented Nov 19, 2024 at 13:34
  • $"element" should be "$element" ; with that fix your code works for me; and while not a problem in this case I'd opt to wrap the target file in double quotes, ie, "$gene_name.txt" just to be safe; invoking 4 subshells for each pass through the loop isn't very efficient and while there are a few ways to address this in bash I'd opt for one of the awk solutions – markp-fuso Commented Nov 19, 2024 at 14:17
Add a comment  | 

3 Answers 3

Reset to default 5

Using any awk:

$ awk -F'>' 'NF>1{ close(out); out=$2".txt" } { print > out }' file1.txt

$ head *\|*
==> DGAT2|ENST00000604935.5.txt <==
>DGAT2|ENST00000604935.5
AGAAAGGCCGGGCGCGGCGAGGCTGGGCGCTGGGCGGCTGCGGCGCGCGGTGCGCGGTGC
GTAGTCTGGAGCTATGGTGGTGGTGGCAGCCGCGCCGAACCCGGCCGACGGGACCCCTAA
AGTTCTGCTTCTGTCGGGGCAGCCCGCCTCCGCCGCCGGAGCCCCGGCCGGCCAGGCCCT
GCCGCTCATGGTGCCAGCCCAGAGAGGGGCCAGCCCGGAGGCAGCGAGCGGGGGGCTGCC
CCAGGCGCGCAAGCGACAGCGCCTCACGCACCTGAGCCCCGAGGAGAAGGCGCTGAGGAG
GTGGGCGAGGGGCCGGGGTCTGGGGCCAGATCTGAAGCCGGGACTAGGGACAGGGGCAGG

==> PDGFRB|ENST00000522466.1.txt <==
>PDGFRB|ENST00000522466.1
TCAGTCATCCTTTCCCTCTCTAGCCCCCTACCCTATCCCCAAGCTGAAGTGCTAGTGGCT
GGTGGTGACTTCCCCAGACCTAAGCCAATCTCTCTCTACCAGTGTCATCCATCAACGTCT
CTGTGAACGCAGTGCAGACTGTGGTCCGCCAGGGTGAGAACATCACCCTCATGTGCATTG
TGATCGGGAATGAGGTGGTCAACTTCGAGTGGACATACCCCCGCAAAGAAGTAATGTGGG
GCCAGGCAGGGGTCGGAGGAGGGGCCAGGAACGGGTGGATATCTGGCTTGCAGGCTGATT
TCTCCCCGGCCCCTCCTGATTTGGGGGGCCTGCCCAACCTGTTGCTGCAGAGTGGGCGGC
TGGTGGAGCCGGTGACTGACTTCCTCTTGGATATGCCTTACCACATCCGCTCCATC

Did you consider awk for this task?

awk -F'\n' -v RS='>' '
    FNR > 1 {
      outFile = $1 ".txt";
      printf("%s", RS $0) > outFile;
      close(outFile);
    }
' file1.txt

The idea is to consume the input file using > as record separator (instead of the linefeed character). Each record will then contain the header (stripped from its leading >) in the first line and the whole sequence in the remainder lines. That makes the processing quite straightforward.

Now, the very first record is expected to be empty (or containing comments), so you skip it using the condition FNR > 1


ASIDE

Not that it is wrong, but do you really want to keep the | in the filenames?

awk is the better way to do this, but if you're going to try using a while/read loop, you probably want to structure it like:

while read line; do f=${line#>}; 
    if ! test "$f" = "$line"; then exec > $f.txt; fi; 
    printf '%s\n' "$line"; 
done < input

Note that if you do that in an interactive terminal, you'll want to either run it in a subshell or followup with exec > /dev/tty or similar.

本文标签: bashHow to split text file into multiple files in specific a pattern in terminalStack Overflow