uniq command in linux 1

How to Use uniq Command in Linux

Extracting duplicate lines in a file manually is not an efficient method to manage datasets. Cleaning up your datasets is better off automated to save time and valuable resources. This is where the uniq command in Linux is very helpful. It allows you to do a lot of data processing when paired with other commands like sort.

What is uniq Command in Linux?

The uniq command is a tool that allows users to filter out and display adjacent matching lines from the standard output or file. Take note that it only shows duplicate lines which are adjacent to each other, so you have to sort your documents for the command to work.

You can use uniq to clean up data sets before running a statistical analysis tool or to store your documents better. For a better result, you can use the sort file content before and after uniq. With this method, you can get more accurate data-processing results.

Default uniq Syntax

The default uniq command syntax is by running the command, followed by any options, input file (where to read the file), and output file (where to display the results). It is pretty much straightforward unless you plan to use more elaborate commands such as sed.

$ uniq option <input file> <output file>

By default, uniq displays its results in the standard output. But you can use > to output the result into a file for later access.  Results would vary as well depending on options and other commands applied.

Relevant Options for uniq Command in Linux

Options for the uniq Command usually affect the information shown from the output. When no options are included, uniq will automatically remove all the duplicates except for its first instance.

-c: Shows all present lines and their instances

-d: Only shows the duplicated lines once (not all instances)

-D: Shows all instances of duplicated lines

-f: Skip characters based on indicated number of words

-s: Same with -f, but uses characters instead of words

-i: Shows duplicated lines regardless of the case

-u: Only displays unique lines (omitting duplicate lines altogether)

Example Uses of uniq Command

Below are some of the fundamental usages of the uniq command. These instances are only basic; you can use other commands to make more elaborate usages. Another command-line utility such as sed makes uniq a powerful tool when utilized correctly.

1. Quick Display of Repeated Lines

Let’s take a look at how uniq command works through the example below. David Bowie’s “Space Oddity” song has some repetitive lyrics per line basis. We use the -D function to display all the instances wherein a line is duplicated. -D is different than -d because it shows all the duplicated lines while the latter only shows the first instance.

uniq command in Linux

2. Count Number of Repeated Lines

Expanding on the example below, you can also use the -c option to list out all the content inside the file (even the files which are not duplicated). Using the same lyrics, we see in one glance that all other lines have one occurrence while the “Can you hear me, Major Tom?” lyrics have three.

uniq command with -c option

3. Using uniq with sort

sort can make data cleaning and processing easier, especially with sed, uniq, and other text-processing commands. A simple usage of sort is the example below. Listed are some of the ingredients for a bbq pork roast. As seen below, some items such as onions and ketchup were duplicated but are not in adjacent positions.

bbq recipe

Using uniq at the example above will not give your intended result because it sees each line as “unique” when all adjacent content is not similar. Using the sort command before uniq is a simple method to fix this problem.

In this example, we used sort and funneled the result to uniq. This allows uniq to scan sorted files directly in just one line. Since the line doesn’t specify additional options, all the duplicated lines are simply removed.

uniq command with sort

4. Skip Characters & Ignore Cases

Lists can become cluttered with additional information which may or may not have a purpose (such as numbers, codes, etc.) However, removing these data manually is not optimal and might affect the integrity of your dataset. Just like other problems, your best course of action is just to ignore these words or characters.

You can do that with uniq -s or -f. Let’s use Taylor Swift’s album tracklist for the example below. In this example, there is a numbering system in all the lines.

tracklist

A simple solution is to use -s to skip the first 2 characters of each line and remove duplicate entries. You can then funnel the result into another file and use sed to remove all numbers. This will leave you with the raw data without duplicates and extra numbers.

uniq with sed

Adjustments are also needed for entries that have different character cases. As seen below, using the uniq command on words with different cases won’t work. You need to use the -i option to disregard the character cases completely.

Without the -i option, uniq will think that both lines are unique even if they have the same content (such as the resPeCt and ResPECt below.)

uniq command with different cases

5. Show Unique Lines Only

And lastly, you can use the -u option to completely delete all duplicate lines. Even the first instances of the line will be deleted when the -u option is applied. This option is very useful if there is a persistent duplicating line that somehow gets included within your datasets.

uniq command with -u

And that’s all for the uniq command in Linux. It doesn’t offer complex and customizable options as a text processor such as sed. But due to its simplicity, it becomes easier especially when integrated with automation for massive data sets.

If you want to know more about other related commands such as sed and sort, you can check out our other Linux Command guides. You can also check our various installation guides within the How-To section. Thank you for reading, and wishing you good luck!

If this guide helped you, please share it. ?

Leave a Reply
Related Posts