Extracting duplicate lines in a file manually is not an efficient method to manage datasets. Cleaning up your datasets is better off automated to save time and valuable resources. This is where the uniq command in Linux is very helpful. It allows you to do a lot of data processing when paired with other commands like sort.
What is uniq Command in Linux?
uniq command is a tool that allows users to filter out and display adjacent matching lines from the standard output or file. Take note that it only shows duplicate lines which are adjacent to each other, so you have to sort your documents for the command to work.
You can use
uniq to clean up data sets before running a statistical analysis tool or to store your documents better. For a better result, you can use the
sort file content before and after
uniq. With this method, you can get more accurate data-processing results.
Default uniq Syntax
uniq command syntax is by running the command, followed by any options, input file (where to read the file), and output file (where to display the results). It is pretty much straightforward unless you plan to use more elaborate commands such as
$ uniq option <input file> <output file>
uniq displays its results in the standard output. But you can use
> to output the result into a file for later access. Results would vary as well depending on options and other commands applied.
Relevant Options for uniq Command in Linux
Options for the
uniq Command usually affect the information shown from the output. When no options are included,
uniq will automatically remove all the duplicates except for its first instance.
-c: Shows all present lines and their instances
-d: Only shows the duplicated lines once (not all instances)
-D: Shows all instances of duplicated lines
-f: Skip characters based on indicated number of words
-s: Same with
-f, but uses characters instead of words
-i: Shows duplicated lines regardless of the case
-u: Only displays unique lines (omitting duplicate lines altogether)
Example Uses of uniq Command
Below are some of the fundamental usages of the
uniq command. These instances are only basic; you can use other commands to make more elaborate usages. Another command-line utility such as
uniq a powerful tool when utilized correctly.
Let’s take a look at how
uniq command works through the example below. David Bowie’s “Space Oddity” song has some repetitive lyrics per line basis. We use the
-D function to display all the instances wherein a line is duplicated.
-D is different than
-d because it shows all the duplicated lines while the latter only shows the first instance.
2. Count Number of Repeated Lines
Expanding on the example below, you can also use the
-c option to list out all the content inside the file (even the files which are not duplicated). Using the same lyrics, we see in one glance that all other lines have one occurrence while the “Can you hear me, Major Tom?” lyrics have three.
3. Using uniq with sort
sort can make data cleaning and processing easier, especially with
uniq, and other text-processing commands. A simple usage of
sort is the example below. Listed are some of the ingredients for a bbq pork roast. As seen below, some items such as onions and ketchup were duplicated but are not in adjacent positions.
uniq at the example above will not give your intended result because it sees each line as “unique” when all adjacent content is not similar. Using the
sort command before
uniq is a simple method to fix this problem.
In this example, we used
sort and funneled the result to
uniq. This allows
uniq to scan sorted files directly in just one line. Since the line doesn’t specify additional options, all the duplicated lines are simply removed.
4. Skip Characters & Ignore Cases
Lists can become cluttered with additional information which may or may not have a purpose (such as numbers, codes, etc.) However, removing these data manually is not optimal and might affect the integrity of your dataset. Just like other problems, your best course of action is just to ignore these words or characters.
You can do that with uniq
-f. Let’s use Taylor Swift’s album tracklist for the example below. In this example, there is a numbering system in all the lines.
A simple solution is to use
-s to skip the first 2 characters of each line and remove duplicate entries. You can then funnel the result into another file and use
sed to remove all numbers. This will leave you with the raw data without duplicates and extra numbers.
Adjustments are also needed for entries that have different character cases. As seen below, using the
uniq command on words with different cases won’t work. You need to use the
-i option to disregard the character cases completely.
uniq will think that both lines are unique even if they have the same content (such as the resPeCt and ResPECt below.)
5. Show Unique Lines Only
And lastly, you can use the
-u option to completely delete all duplicate lines. Even the first instances of the line will be deleted when the
-u option is applied. This option is very useful if there is a persistent duplicating line that somehow gets included within your datasets.
And that’s all for the
uniq command in Linux. It doesn’t offer complex and customizable options as a text processor such as
sed. But due to its simplicity, it becomes easier especially when integrated with automation for massive data sets.
If you want to know more about other related commands such as sed and sort, you can check out our other Linux Command guides. You can also check our various installation guides within the How-To section. Thank you for reading, and wishing you good luck!
If this guide helped you, please share it. ?