File name encoding problem in File Compression Method
- Problem: specially when cross-platform, file name encoding is difference in different compression tool, like Zip, Gz, Bz2.
- Solution 1: using tar to compress
- Solution 2 in linux: unzip -O CP936 non_english_name.zip (means using GBK, GB18030 Chinese code)
- Solution 3 use Java: jar xvf non_english_name.zip
- Problem: when zip or winrar uncompress a non-english encoding archive file, sometimes require set system locale language and reboot to get name uncompressed right.
- Solution 1 (windows method): using already-to-use-build zip and unzip tool from DotNetZip library, which support encoding and decode option
Unzip.exe -cp 936 chinese_name_content.zip
- download and it's under its tool folder: https://dotnetzip.codeplex.com/
- windows code page: https://en.wikipedia.org/wiki/Windows_code_page
- 936 and 1386 for GBK
- 932 and 943 for shift JIS
- Solution 2 (cross platform) using Python (sometime works, sometimes error encoding):
python xZip.py non_english.zip decode_language
(such as gbk for Chinese, decode_language code refer to this https://docs.python.org/2/library/codecs.html )
- here is the python code for xZip.py
- xZip.py
# full list of codec: https://docs.python.org/2/library/codecs.html # note: # - input from command line is using commandline system default locale encoding # - it read the zip file path in unicode format with the given decode method # - if you use python print method to print those unicode path in window command windows, # it may error when system default locale codec can't print those unicode characters import zipfile import os.path import os import sys class ZFile(object): def __init__(self, filename, mode='r', basedir=''): self.filename = filename self.mode = mode if self.mode in ('w', 'a'): self.zfile = zipfile.ZipFile(filename, self.mode, compression=zipfile.ZIP_DEFLATED) else: self.zfile = zipfile.ZipFile(filename, self.mode) self.basedir = basedir if not self.basedir: self.basedir = os.path.dirname(filename) def addfile(self, path, arcname=None): path = path.replace('//', '/') if not arcname: if path.startswith(self.basedir): arcname = path[len(self.basedir):] else: arcname = '' self.zfile.write(path, arcname) def addfiles(self, paths): for path in paths: if isinstance(path, tuple): self.addfile(*path) else: self.addfile(path) def close(self): self.zfile.close() def extract_to(self, path, decode): for p in self.zfile.namelist(): self.extract(p, path, decode) def extract(self, filename, path, decode): if not filename.endswith('/'): f = os.path.join(path, filename.decode(decode)) #gbk,gb18030, GB2312, utf-8 dir = os.path.dirname(f) if not os.path.exists(dir): os.makedirs(dir) file(f, 'wb').write(self.zfile.read(filename)) def create(zfile, files): z = ZFile(zfile, 'w') z.addfiles(files) z.close() def extract(zfile, path, decode): z = ZFile(zfile) z.extract_to(path, decode) z.close() if __name__=="__main__": extract(unicode(sys.argv[1]), u'.', sys.argv[2])
- Alternative solution: extract normally with wrong-encoding names, then fixing those name using python decode and encode
- Site Notes:
- in windows commands, chcp is used to change display page code (file name encoding) page code list
- additional reading:
Common Problem on compressed File and Solution
- Problem: Winrar has update the version recently, only winrar can't open some new winrar file.
- Solution: get latest 7z to uncompress it, will be fine. http://www.7-zip.org/
Winrar
- extract with cmd
unrar x Pack.rar