How to remove duplicate lines from a file?
Jul 15, 2025 am 01:25 AMWhen deduplication of duplicate lines in files, you need to pay attention to key points such as retaining order and processing large files. 1. Use sort and uniq combination to quickly deduplicate, but it will disrupt the original order; 2. If the original order is to be preserved, it can be implemented by the awk command; 3. When processing large files, you can use chunking processing, database import or memory optimization scripts; 4. Python scripts are suitable for medium-sized files and support more custom details; 5. It is recommended to back up the file and check the impact of hidden characters before deduplication. Just choose the appropriate method according to the specific needs.
It is actually not difficult to deduplicate duplicate lines in the file, but you have to pay attention to a few key points. The most direct way is to use command line tools, such as sort
and uniq
combinations on Linux or macOS, or write a simple script to process. If you just want to quickly remove the exact same repeat lines, it doesn't have to be too complicated; but if you want to consider order retention, large file processing or partial matching, you have to choose the right method.
Quickly deduplicate with the command line
This is the most common and fastest way to suit most text files. The basic idea is to sort first and then merge duplicates:
sort filename.txt | uniq > output.txt
-
sort
is to alphabetically sort the contents so that the same guilds will be next to each other. -
uniq
is responsible for merging adjacent repeat rows into one row. - The output result is saved to
output.txt
, and the original file remains unchanged.
Note: This method will disrupt the original order. If you want to preserve the original order, you can't use it directly.
Keep the original order to deduplicate
If you want to keep each line that appears in the first place and remove the repeated ones afterwards, you can use awk
:
awk '!seen[$0] ' filename.txt > output.txt
The meaning of this command is:
- Every time a line is read, it is recorded (
seen[$0]
). - If this line has not appeared (
!seen[$0]
), output and count up one. - The subsequent repeated lines will not be output.
This method is very practical, especially when log and list data need to be kept in order.
How to handle large files more efficiently?
If the file is particularly large (such as a few hundred MB or over GB), it may be unrealistic to load it into memory at once. You can consider this at this time:
- Block processing: First split the file into multiple small files, deduplicate it separately and then merge it.
- Use database: Import to a lightweight database like SQLite and use
DISTINCT
to deduplicate it. - Memory optimization script: Use Python generator to read line by line to avoid loading all content at once.
Python example (for medium-sized files):
see = set() with open('output.txt', 'w') as out_file: with open('filename.txt', 'r') as in_file: for line in in_file: if line not in see: seen.add(line) out_file.write(line)
Although this method is a little slower, it can control more details, such as ignoring blank lines, comparing case insensitive comparisons, etc.
A few tips
- It is best to back up the original file before removing the heavy load.
- Make sure there are no hidden characters that affect judgment, such as end-of-line spaces and line break differences.
- If you are not sure if it is really duplicated, you can first use
diff
to compare the original file with the deduplication file.
Basically all is it, the method is not complicated but the details are easy to ignore. Just choose the right method according to your specific needs.
The above is the detailed content of How to remove duplicate lines from a file?. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

File Uploading and Processing in Laravel: Managing User Uploaded Files Introduction: File uploading is a very common functional requirement in modern web applications. In the Laravel framework, file uploading and processing becomes very simple and efficient. This article will introduce how to manage user-uploaded files in Laravel, including verification, storage, processing, and display of file uploads. 1. File upload File upload refers to uploading files from the client to the server. In Laravel, file uploads are very easy to handle. first,

Getting started with PHP file processing: Step-by-step guide for reading and writing In web development, file processing is a common task, whether it is reading files uploaded by users or writing the results to files for subsequent use. Understand how to use PHP Document processing is very important. This article will provide a simple guide to introduce the basic steps of reading and writing files in PHP, and attach code examples for reference. File reading in PHP, you can use the fopen() function to open a file and return a file resource (file

To read the last line of a file from PHP, the code is as follows -$line='';$f=fopen('data.txt','r');$cursor=-1;fseek($f,$cursor, SEEK_END);$char=fgetc($f);//Trimtrailingnewlinecharactersinthefilewhile($char===""||$char==="\r"){&

Title: PHP file processing: English writing is allowed but Chinese characters are not supported. When using PHP for file processing, sometimes we need to restrict the content in the file to only allow writing in English and not support Chinese characters. This requirement may be to maintain file encoding consistency, or to avoid garbled characters caused by Chinese characters. This article will introduce how to use PHP for file writing operations, ensure that only English content is allowed to be written, and provide specific code examples. First of all, we need to be clear that PHP itself does not actively limit

With the continuous development of Internet technology, the file upload function has become an essential part of many websites. In the PHP language, we can handle file uploads through some class libraries and functions. This article will focus on the file upload processing method in PHP. 1. Form settings In the HTML form, we need to set the enctype attribute to "multipart/form-data" to support file upload. The code is as follows: <formaction="upload.

Linux file processing skills: Master the trick of decompressing gz format files. In Linux systems, you often encounter files compressed using gz (Gzip) format. This file format is very common in network transmission and file storage. If we want to process these .gz format files, we need to learn how to decompress them. This article will introduce several methods of decompressing .gz files and provide specific code examples to help readers master this technique. Method 1: Use the gzip command to decompress in Linux systems, the most common

Introduction to how to use PHP to find and replace files on the FTP server: In the process of website maintenance and updates, we often need to find and replace files on the FTP server. Using PHP language can help us realize this function, simplify the operation process and improve efficiency. This article will introduce how to use PHP to find and replace files on an FTP server, and provide corresponding code examples. Step 1: Connect to the FTP server First, we need to connect to the FTP server. Use PH

The file stream in C++ is a convenient data input and output method. Data in the file can be read and written through the file stream. In C++, file streams mainly involve the iostream library and the fstream library. The iostream library is mainly responsible for console input and output, while the fstream library is responsible for file input and output. The fstream library is included within the iostream library, so we only need to include the header file <iostream> or <fstream
