解压乱码，判断 zip 压缩包编码是 GBK，还是 UTF-8

问题
如何鉴别文件编码
鉴别编码失败，怎么办？
整体代码
更新

问题

java 程序解压 zip 类型的压缩包，使用 WinRAR 压缩的解压出来就会乱码。
我使用的是 zip4j 解压的，根本原因还是解压时候编码问题。

如何鉴别文件编码

网络上提供的几种都以失败告终：

1、判断文件首字节
2、使用 cpdetector

鉴别编码失败，怎么办？

有个有趣的现象：如果自己写解压程序的话，编码不对会报异常，而不是乱码了。
因此我们可以使用异常来判断是否可行。

自己写解压

public static void doUnArchiver(File srcfile, String destpath) throws IOException {
    byte[] buf = new byte[1024];
    FileInputStream fis = new FileInputStream(srcfile);
    BufferedInputStream bis = new BufferedInputStream(fis);
    ZipInputStream zis = new ZipInputStream(bis);
    ZipEntry zn = null;
    while ((zn = zis.getNextEntry()) != null) {
        File f = new File(destpath + "/" + zn.getName());
        if (zn.isDirectory()) {
            f.mkdirs();
        } else {
            /*
             * 父目录不存在则创建
             */
            File parent = f.getParentFile();
            if (!parent.exists()) {
                parent.mkdirs();
            }
            FileOutputStream fos = new FileOutputStream(f);
            BufferedOutputStream bos = new BufferedOutputStream(fos);
            int len;
            while ((len = zis.read(buf)) != -1) {
                bos.write(buf, 0, len);
            }
            bos.flush();
            bos.close();
        }
        zis.closeEntry();
    }
    zis.close();
}

对其改造，进行判断

private boolean testEncoding(String filepath, Charset charset) throws FileNotFoundException {
    FileInputStream fis = new FileInputStream(new File(filepath));
    BufferedInputStream bis = new BufferedInputStream(fis);
    ZipInputStream zis = new ZipInputStream(bis, charset);
    ZipEntry zn = null;
    try {
        while ((zn = zis.getNextEntry()) != null) {
            // do nothing
        }
    } catch (Exception e) {
        return false;
    }finally {
        try {
            zis.close();
            bis.close();
            fis.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    return true;
}

测试结果如下：

System.out.println(testEncoding("d:/UTF8的文件.zip", Charset.forName("UTF-8"))); // true
System.out.println(testEncoding("d:/UTF8的文件.zip", Charset.forName("GBK"))); //true
System.out.println(testEncoding("d:/GBK的文件.zip", Charset.forName("UTF-8"))); //false
System.out.println(testEncoding("d:/GBK的文件.zip", Charset.forName("GBK"))); //true

注意：
对于 UTF-8 的文件，使用 GBK 返回的居然是 true
所以我们在使用 testEncoding 方法的时候，只能是测试 “UTF-8” 返回的是不是 false
如果是 false，则使用 GBK 编码

if (testEncoding(file, Charset.forName("UTF-8")) == false){
    zipFile.setCharset(Charset.forName("GBK"));
}

整体代码

pom.xml 引入 zip4j

<dependency>
    <groupId>net.lingala.zip4j</groupId>
    <artifactId>zip4j</artifactId>
    <version>2.6.1</version>
</dependency>

解压代码

// ZipFile 默认是使用 UTF-8 编码处理文件
ZipFile zipFile = new ZipFile(file);
gl>if (testEncoding(file, Charset.forName("UTF-8")) == false){
    gl>zipFile.setCharset(Charset.forName("GBK"));
gl>}
zipFile.extractAll(baseDir);
private boolean testEncoding(String filepath, Charset charset) throws FileNotFoundException {
    FileInputStream fis = new FileInputStream(new File(filepath));
    BufferedInputStream bis = new BufferedInputStream(fis);
    ZipInputStream zis = new ZipInputStream(bis, charset);
    ZipEntry zn = null;
    try {
        while ((zn = zis.getNextEntry()) != null) {
            // do nothing
        }
    } catch (Exception e) {
        return false;
    }finally {
        try {
            zis.close();
            bis.close();
            fis.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    return true;
}

更新

在使用上面的解压程序时，又出现了新的问题。

有一压缩包 shuju.zip ，里面的内容为：

├── shuju
———└── 新建文件夹
——————-└── 李贽与晚明文学思想.pdf
———└── China-Africa Relations， Political Conditions，and Ngũgĩ Wa Thiong’o’s Wizard of the Crow Peter Leman.PDF
———└── 一手数据.xlsx

shuju.zip 文件下载地址：链接：https://pan.baidu.com/s/12ugoP2Vx8Ui7chKVGaq-1w 提取码：w6qq

解压之后，中文文件没有乱码，但其中一个英文文件乱码了。如下图：

备用：http://www.sqber.com:8007/weibo/pic?url=https://img-blog.csdnimg.cn/20210419103212615.png

解决办法：

改用 apache 下的一款解压缩工具： Apache Commons Compress

https://commons.apache.org/proper/commons-compress/examples.html
https://github.com/apache/commons-compress

具体使用如下：

pom.xml 文件引入

<!-- 解压缩 https://mvnrepository.com/artifact/org.apache.commons/commons-compress -->
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-compress</artifactId>
    <version>1.18</version>
</dependency>

解压代码：

public static void decompressor(String zipFile, String targetDir) throws IOException, ArchiveException {
        File archiveFile = new File(zipFile);
        // 文件不存在，跳过
        if (!archiveFile.exists())
            return;
        ArchiveInputStream input = new ArchiveStreamFactory().createArchiveInputStream(new BufferedInputStream(new FileInputStream(zipFile)));
        ArchiveEntry entry = null;
        while ((entry = input.getNextEntry()) != null) {
            if (!input.canReadEntryData(entry)) {
                // log something?
                continue;
            }
            String name = Paths.get(targetDir, entry.getName()).toString();
            File f = new File(name);
            if (entry.isDirectory()) {
                if (!f.isDirectory() && !f.mkdirs()) {
                    throw new IOException("failed to create directory " + f);
                }
            } else {
                File parent = f.getParentFile();
                if (!parent.isDirectory() && !parent.mkdirs()) {
                    throw new IOException("failed to create directory " + parent);
                }
                try (OutputStream o = Files.newOutputStream(f.toPath())) {
                    IOUtils.copy(input, o);
                }
            }
        }
        input.close();
    }