1.前言
某次需要将大的压缩包分割传输,并恢复。找到了一段有用的python程序。
这个软件包可以压缩和分割大文件。它从一个根目录开始,遍历子目录,并扫描其中的每个文件。如果某个文件的大小超过了阈值大小,那么它们会被压缩和分割成多个归档文件,每个归档文件的最大大小为分区大小。压缩/分割适用于任何文件扩展名。
举例:
对于目录
$ tree --du -h ~/MyFolder
└── [415M] My Datasets
│ ├── [6.3K] Readme.txt
│ └── [415M] Data on Leaf-Tailed Gecko
│ ├── [ 35M] DatasetA.zip
│ ├── [ 90M] DatasetB.zip
│ ├── [130M] DatasetC.zip
│ └── [160M] Books
│ ├── [ 15M] RegularBook.pdf
│ └── [145M] BookWithPictures.pdf
└── [818M] Video Conference Meetings
├── [817M] Discussion_on_Fermi_Paradox.mp4
└── [1.1M] Notes_on_Discussion.pdf
使用
$ python3 src/main.py --root_dir ~/MyFolder
目录变成
$ tree --du -h ~/MyFolder
└── [371M] My Datasets
│ ├── [6.3K] Readme.txt
│ └── [371M] Data on Leaf-Tailed Gecko
│ ├── [ 35M] DatasetA.zip
│ ├── [ 90M] DatasetB.zip
│ ├── [ 95M] DatasetC.zip.7z.001
│ ├── [ 18M] DatasetC.zip.7z.002
│ └── [133M] Books
│ ├── [ 15M] RegularBook.pdf
│ ├── [ 95M] BookWithPictures.pdf.7z.001
│ └── [ 23M] BookWithPictures.pdf.7z.002
└── [794M] Video Conference Meetings
├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.001
├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.002
├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.003
├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.004
├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.005
├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.006
├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.007
├── [ 95M] Discussion_on_Fermi_Paradox.mp4.7z.008
├── [ 33M] Discussion_on_Fermi_Paradox.mp4.7z.009
└── [1.1M] Notes_on_Discussion.pdf
使用
$ python3 src/reverse.py --root_dir ~/MyFolder
则恢复到原始文件。
2.环境准备
2.1 python3
本地已经安装 Python 3.x.x.
2.2 7z库文件下载安装
虽然在src/main.py中遍历目录是串行的,但是通过7z压缩/分割每个文件在默认情况下是并行的。
使用src/reverse.py进行反转完全是串行的。
3.分割
用于分割大文件的代码main.py如下:
import sys # 导入sys模块,用于退出程序
import os # 导入os模块,用于文件和目录操作
import shutil # 导入shutil模块,用于文件操作
import subprocess # 导入subprocess模块,用于执行shell命令
import argparse # 导入argparse模块,用于解析命令行参数
def parse_arguments():
# 解析命令行参数
parser = argparse.ArgumentParser(description='GitHub-ForceLargeFiles')
parser.add_argument('--root_dir', type=str, default=os.getcwd(),
help="Root directory to start traversing. Defaults to current working directory.")
parser.add_argument('--delete_original', type=bool, default=True,
help="Do you want to delete the original (large) file after compressing to archives?")
parser.add_argument('--partition_ext', type=str, default="7z", choices=["7z", "xz", "bzip2", "gzip", "tar", "zip", "wim"],
help="Extension of the partitions. Recommended: 7z due to compression ratio and inter-OS compatibility.")
parser.add_argument('--cmds_into_7z', type=str, default="a",
help="Commands to pass in to 7z.")
parser.add_argument('--threshold_size', type=int, default=100,
help="Max threshold of the original file size to split into archive. I.e. files with sizes below this arg are ignored.")
parser.add_argument('--threshold_size_unit', type=str, default='m', choices=['b', 'k', 'm', 'g'],
help="Unit of the threshold size specified (bytes, kilobytes, megabytes, gigabytes).")
parser.add_argument('--partition_size', type=int, default=95,
help="Max size of an individual archive. May result in actual partition size to be higher than this value due to disk formatting. In that case, reduce this arg value.")
parser.add_argument('--partition_size_unit', type=str, default='m', choices=['b', 'k', 'm', 'g'],
help="Unit of the partition size specified (bytes, kilobytes, megabytes, gigabytes).")
args = parser.parse_args()
return args
def check_7z_install():
# 检查是否安装了7z,如果没有安装则退出程序
if shutil.which("7z"):
return True
else:
sys.exit("ABORTED. You do not have 7z properly installed at this time. Make sure it is added to PATH.")
def is_over_threshold(f_full_dir, args):
# 判断文件是否超过阈值大小
size_dict = {
"b": 1e-0,
"k": 1e-3,
"m": 1e-6,
"g": 1e-9
}
return os.stat(f_full_dir).st_size * size_dict[args.threshold_size_unit] >= args.threshold_size
def traverse_root_dir(args):
# 遍历指定目录下的文件,并进行压缩
for root, _, files in os.walk(args.root_dir):
for f in files:
f_full_dir = os.path.join(root, f)
if is_over_threshold(f_full_dir, args):
f_full_dir_noext, ext = os.path.splitext(f_full_dir)
# 使用7z命令进行压缩
prc = subprocess.run(["7z", "-v" + str(args.partition_size) + args.partition_size_unit, args.cmds_into_7z,
f_full_dir_noext + "." + ext[1:] + "." + args.partition_ext, f_full_dir])
if args.delete_original and prc.returncode == 0:
os.remove(f_full_dir)
if __name__ == '__main__':
check_7z_install() # 检查是否安装了7z
traverse_root_dir(parse_arguments()) # 压缩文件
这段代码会从root_dir开始遍历所有子目录,并将所有超过100MB的文件压缩为最大大小约为95MB的较小存档文件。默认选项是在压缩后删除原始(大)文件,但可以关闭此选项。
执行记录
D:\tmp\git_di>python main.py --root_dir "D:\tmp\git_di"
7-Zip 23.01 (x64) : Copyright (c) 1999-2023 Igor Pavlov : 2023-06-20
Scanning the drive:
1 file, 3329165073 bytes (3175 MiB)
Creating archive: D:\tmp\git_di\testfile.zip.7z
Add new data to archive: 1 file, 3329165073 bytes (3175 MiB)
Files read from disk: 1
Archive size: 3304152719 bytes (3152 MiB)
Volumes: 34
Everything is Ok
可以当前目录下生成了多个压缩包分块(testfile.zip.7z.001, testfile.zip.7z.002 ......)
4.恢复
用于恢复大文件的代码reverse.py 如下:
import sys # 导入sys模块,用于退出程序
import os # 导入os模块,用于文件和目录操作
import shutil # 导入shutil模块,用于文件操作
import subprocess # 导入subprocess模块,用于执行shell命令
import argparse # 导入argparse模块,用于解析命令行参数
def parse_arguments():
# 解析命令行参数
parser = argparse.ArgumentParser(description='GitHub-ForceLargeFiles_reverse')
parser.add_argument('--root_dir', type=str, default=os.getcwd(),
help="Root directory to start traversing. Defaults to current working directory.")
parser.add_argument('--delete_partitions', type=bool, default=True,
help="Do you want to delete the partition archives after extracting the original files?")
args = parser.parse_args()
return args
def check_7z_install():
# 检查是否安装了7z,如果没有安装则退出程序
if shutil.which("7z"):
return True
else:
sys.exit("ABORTED. You do not have 7z properly installed at this time. Make sure it is added to PATH.")
def is_partition(f_full_dir):
# 判断文件是否是分卷文件
return any(f_full_dir.endswith(ext) for ext in
[".7z.001", ".xz.001", ".bzip2.001", ".gzip.001", ".tar.001", ".zip.001", ".wim.001"])
def reverse_root_dir(args):
# 遍历指定目录下的文件,并进行解压
for root, _, files in os.walk(args.root_dir):
for f in files:
f_full_dir = os.path.join(root, f)
if is_partition(f_full_dir):
# 使用7z解压文件
prc = subprocess.run(["7z", "e", f_full_dir, "-o" + root])
if args.delete_partitions and prc.returncode == 0:
f_noext, _ = os.path.splitext(f)
os.chdir(root)
os.system("rm" + " \"" + f_noext + "\"*")
if __name__ == '__main__':
check_7z_install() # 检查是否安装了7z
reverse_root_dir(parse_arguments()) # 解压分卷文件
测试
将压缩包分块(testfile.zip.7z.001, testfile.zip.7z.002 ......)放置与目录 D:\tmp\git_di 下,reverse.py 也放在同级目录下。
执行记录
D:\tmp\git_di>python reverse.py --root_dir "D:\tmp\git_di"
7-Zip 23.01 (x64) : Copyright (c) 1999-2023 Igor Pavlov : 2023-06-20
Scanning the drive for archives:
1 file, 99614720 bytes (95 MiB)
Extracting archive: D:\tmp\git_di\testfile.zip.7z.001
--
Path = D:\tmp\git_di\testfile.zip.7z.001
Type = Split
Physical Size = 99614720
Volumes = 34
Total Physical Size = 3304152719
----
Path = testfile.zip.7z
Size = 3304152719
--
Path = testfile.zip.7z
Type = 7z
Physical Size = 3304152719
Headers Size = 162
Method = LZMA2:24
Solid = -
Blocks = 1
Everything is Ok
Size: 3329165073
Compressed: 3304152719
'rm' 不是内部或外部命令,也不是可运行的程序
或批处理文件。
可以看到新生成了文件 testfile.zip。
5.最后
参考github链接
https://github.com/sisl/GitHub-ForceLargeFiles
over.
文章评论