Tag: cat

awk and tabs in input and output

Why does awk incorrectly detect tab delimited data boundaries (input field separator)

The following command will return an empty result instead of the expected third column:

echo '1     2     3     4     5     6' | awk -F'\t' '{ print $3 }'

In the command, instead of the standard FS (Input field separator), which is a space by default, the -F'\t' option is set to a new separator, which is specified as “\t”, which means a tab character.

The problem with the previous command is that in the input, the fields are not actually tab-delimited, but separated by multiple spaces.

That is, using the -F option is not necessary in the previous command:

echo '1     2     3     4     5     6' | awk '{ print $3 }'

Even though the data is separated by multiple spaces, you don't need to specify this with the -F option, as it correctly interprets the input. The default field separator in awk is one or more spaces (space or tab), which matches [ \t]+ or if you use the posix classes [[:blank:]]+

This is why, even if the data is actually tab delimited, the awk command handles it correctly:

echo '1	2	3	4' | awk '{ print $3 }'

In this case, the -F'\t' option works as expected:

echo '1	2	3	4' | awk -F'\t' '{ print $3 }'

It should be noted that the field separator in awk is a regular expression. Therefore, consecutive repeating characters chosen as column separators are treated as a single split between two adjacent fields.

To check which non-printable characters are present in your input, use cat -A. For example:

echo '1	2	3    4' | cat -A
1M-bM-^PM-^A^IM-oM-?M-=2^I3    4$

How to make awk output fields separated by tabs

The following command will output the third and fourth columns separated by a space:

echo '1	2	3	4	5' | awk '{ print $3,$4 }'
3 4

If you want the output data to be separated by tabs (or any other character), then it must be set as the value of OFS (output field separator). For example:

echo '1	2	3	4	5' | awk 'BEGIN {OFS="\t"}; { print $2,$3 }'
2	3

OFS is inserted between fields separated by commas, that is, the following command will not display tabs between fields (and will not even display a space):

echo '1	2	3	4	5' | awk 'BEGIN {OFS="\t"}; { print $2 $3 }'

In addition to changing the OFS (output field separator) value, you can specify a tab character in the output template. For example, the following command will use standard OFS (that is, a space) to separate the second and third fields, and a tab character will be inserted between the third and fourth columns:

echo '1	2	3	4	5' | awk '{ print $2,$3"\t"$4 }'
2 3	4

How to split a large file (text or binary) into smaller files (SOLVED)

split command instructions

I have a large (by the number of lines) text file that I would like to split into smaller files, also by the number of lines. So if my file has about 2 million lines, I would like to split it into 10 files containing 200k lines, or 100 files containing 20k lines (plus one file with the remainder; evenness of division does not matter).

To do this, you can write a script in PHP or Python, but if you are using Bash, then you can use the ready-made split utility, which can split both text and binary files into pieces of a specified size. If it is a text file, then you can split a large file into files of equal size with a certain number of lines. This article will show you how to use the split command.

How to split a text file into files with a certain number of lines

To split the file by the number of lines, run a command like this:

split -l NUMBER FILE

For instance:

split -l 200000 filename

will create files, each with 200,000 lines named xaa xab xac …

How to split files into volumes of a certain size

If you want to split files by size, then use the -C option (files will be split on lines, there will be no line breaks)

split -C 20m --numeric-suffixes input_filename output_prefix

This command creates files of the form output_prefix01 output_prefix02 output_prefix03 … each with a maximum size of 20 megabytes.

The split command usually works on lines of input (that is, from a text file). But if we use the -b option, we make split treat the file as binary input, and lines are ignored. We can specify the size of the files we want along with the prefix we want for the output files. split can also use the -d option to give us a numerical numbering (*.01 , *.02 , *.03, etc.) for the output files, rather than the default alphabetical numbering (*.aa , *.ab , *.ac, etc.). The -a parameter specifies the length of the suffix. The command looks like this:


where NUMBER is the length of the extension (or suffix) that we will use, and SIZE is the size of the resulting files with a unit modifier (K, M, G, etc.). For example, divide the disk image into 4GB files using the following command (the size of the last file will fit the rest of the volume, unless it is an exact multiple of the size you choose):

split -d -a 3 -b 4G case1.disk1.raw case1.disk1.split.

This will create a group of files (4 GB in size), each named with the prefix case1.split1, as specified in the command, followed by .000, .001, .002, etc. The -a option with 3 indicates, that we want the extension to be at least 3 digits. Without -a 3, our files will be named .00 , .01 , .02, and so on. Notice the endpoint in our output file name. We do this so that the suffix is added as a file extension rather than at the end of the name line.

How to assemble a file divided into parts into one file

The process can be reversed. If we want to merge the split image, we can use the cat command and redirect the output to a new file. Remember that cat simply prints the specified files to standard output. If you redirect this output, the files will be bundled into one.

cat case1.disk1.split* > case1.disk1.new.raw

In the command above, we reassembled the split pieces into a new 80GB image file. The original split files are not deleted.