How to split a large file (text or binary) into smaller files (SOLVED)
February 20, 2021
split command instructions
I have a large (by the number of lines) text file that I would like to split into smaller files, also by the number of lines. So if my file has about 2 million lines, I would like to split it into 10 files containing 200k lines, or 100 files containing 20k lines (plus one file with the remainder; evenness of division does not matter).
To do this, you can write a script in PHP or Python, but if you are using Bash, then you can use the ready-made split utility, which can split both text and binary files into pieces of a specified size. If it is a text file, then you can split a large file into files of equal size with a certain number of lines. This article will show you how to use the split command.
How to split a text file into files with a certain number of lines
To split the file by the number of lines, run a command like this:
split -l NUMBER FILE
split -l 200000 filename
will create files, each with 200,000 lines named xaa xab xac …
How to split files into volumes of a certain size
If you want to split files by size, then use the -C option (files will be split on lines, there will be no line breaks)
split -C 20m --numeric-suffixes input_filename output_prefix
This command creates files of the form output_prefix01 output_prefix02 output_prefix03 … each with a maximum size of 20 megabytes.
The split command usually works on lines of input (that is, from a text file). But if we use the -b option, we make split treat the file as binary input, and lines are ignored. We can specify the size of the files we want along with the prefix we want for the output files. split can also use the -d option to give us a numerical numbering (*.01 , *.02 , *.03, etc.) for the output files, rather than the default alphabetical numbering (*.aa , *.ab , *.ac, etc.). The -a parameter specifies the length of the suffix. The command looks like this:
split -d -a NUMBER -b SIZEG FILE_FOR_SPLITING OUTPUT_PREFIX
where NUMBER is the length of the extension (or suffix) that we will use, and SIZE is the size of the resulting files with a unit modifier (K, M, G, etc.). For example, divide the disk image into 4GB files using the following command (the size of the last file will fit the rest of the volume, unless it is an exact multiple of the size you choose):
split -d -a 3 -b 4G case1.disk1.raw case1.disk1.split.
This will create a group of files (4 GB in size), each named with the prefix case1.split1, as specified in the command, followed by .000, .001, .002, etc. The -a option with 3 indicates, that we want the extension to be at least 3 digits. Without -a 3, our files will be named .00 , .01 , .02, and so on. Notice the endpoint in our output file name. We do this so that the suffix is added as a file extension rather than at the end of the name line.
How to assemble a file divided into parts into one file
The process can be reversed. If we want to merge the split image, we can use the cat command and redirect the output to a new file. Remember that cat simply prints the specified files to standard output. If you redirect this output, the files will be bundled into one.
cat case1.disk1.split* > case1.disk1.new.raw
In the command above, we reassembled the split pieces into a new 80GB image file. The original split files are not deleted.