====== File name encoding problem in File Compression Method ====== * Problem: specially when cross-platform, file name encoding is difference in different compression tool, like Zip, Gz, Bz2. * Solution 1: using tar to compress * Solution 2 in linux: unzip -O CP936 non_english_name.zip (means using GBK, GB18030 Chinese code) * Solution 3 use Java: jar xvf non_english_name.zip * ref: http://www.111cn.net/sys/linux/72590.htm * Problem: when zip or winrar uncompress a non-english encoding archive file, sometimes require set system locale language and reboot to get name uncompressed right. * Solution 1 (windows method): using already-to-use-build zip and unzip tool from DotNetZip library, which support encoding and decode option Unzip.exe -cp 936 chinese_name_content.zip * download and it's under its tool folder: https://dotnetzip.codeplex.com/ * ref: http://www.chengxuyuans.com/Ruby/41584.html * windows code page: https://en.wikipedia.org/wiki/Windows_code_page * 936 and 1386 for GBK * 932 and 943 for shift JIS * Solution 2 (cross platform) using Python (sometime works, sometimes error encoding): python xZip.py non_english.zip decode_language (such as gbk for Chinese, decode_language code refer to this https://docs.python.org/2/library/codecs.html ) * here is the python code for xZip.py # full list of codec: https://docs.python.org/2/library/codecs.html # note: # - input from command line is using commandline system default locale encoding # - it read the zip file path in unicode format with the given decode method # - if you use python print method to print those unicode path in window command windows, # it may error when system default locale codec can't print those unicode characters import zipfile import os.path import os import sys class ZFile(object): def __init__(self, filename, mode='r', basedir=''): self.filename = filename self.mode = mode if self.mode in ('w', 'a'): self.zfile = zipfile.ZipFile(filename, self.mode, compression=zipfile.ZIP_DEFLATED) else: self.zfile = zipfile.ZipFile(filename, self.mode) self.basedir = basedir if not self.basedir: self.basedir = os.path.dirname(filename) def addfile(self, path, arcname=None): path = path.replace('//', '/') if not arcname: if path.startswith(self.basedir): arcname = path[len(self.basedir):] else: arcname = '' self.zfile.write(path, arcname) def addfiles(self, paths): for path in paths: if isinstance(path, tuple): self.addfile(*path) else: self.addfile(path) def close(self): self.zfile.close() def extract_to(self, path, decode): for p in self.zfile.namelist(): self.extract(p, path, decode) def extract(self, filename, path, decode): if not filename.endswith('/'): f = os.path.join(path, filename.decode(decode)) #gbk,gb18030, GB2312, utf-8 dir = os.path.dirname(f) if not os.path.exists(dir): os.makedirs(dir) file(f, 'wb').write(self.zfile.read(filename)) def create(zfile, files): z = ZFile(zfile, 'w') z.addfiles(files) z.close() def extract(zfile, path, decode): z = ZFile(zfile) z.extract_to(path, decode) z.close() if __name__=="__main__": extract(unicode(sys.argv[1]), u'.', sys.argv[2]) * Alternative solution: extract normally with wrong-encoding names, then fixing those name using python decode and encode * Site Notes: * in windows commands, chcp is used to change display page code (file name encoding) [[https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396|page code list]] * additional reading: * http://www.docin.com/p-739332424.html * http://www.cnblogs.com/qq78292959/archive/2013/03/27/2985310.html * https://allencch.wordpress.com/2010/12/06/how-to-extract-zip-file-which-contains-filenames-with-shift_jis-encoding-in-ubuntu/ * https://www.mkssoftware.com/docs/man1/unzip.1.asp ====== Common Problem on compressed File and Solution ====== * Problem: Winrar has update the version recently, only winrar can't open some new winrar file. * Solution: get latest 7z to uncompress it, will be fine. http://www.7-zip.org/ ====== Winrar ====== * extract with cmd unrar x Pack.rar