【文件格式探究】EP.1 对ePub文件格式的初探

这是“文件格式探究”专题的第 1 期——初探 “ePub” 文件格式。

这是“文件格式探究”专题的第 1 期——初探 “ePub” 文件格式。这个专题将会给各位读者呈现笔者探索各种文件格式的过程,具体则是文件的内容是如何呈现出来的。原则上我们假定仅对于这些文件格式的用途有所了解,但具体实现的细节并不清楚 (如果提前掌握了部分内容,笔者全当其不存在) 。探究过程中我们会尝试使用各种方法来逐渐初步掌握其概貌。

文件格式简介

根据维基百科大陆简体版本的相关描述:

EPub 是一个自由的开放标准,属于一种可以“自动重新排版”的内容;也就是文字内容可以根据阅读设备的特性,以最适于阅读的方式显示。

之所以后面不截是因为再截就剧透了。简单来说,ePub 就是类似于 PDF 那样的“文档型”文件格式,常用于分发电子读物等。

探究过程

环境

现在笔者手头上有一份用于测试的 ePub 文件,文件路径为 ~/Downloads/咖啡馆推理事件簿系列(全四本).epub (趁机夹带私货,反正很合我胃口就是了) ,后续所有的探究活动均建立于此文件上。笔者目前的操作系统环境为 Manjaro 21.1.0 on amd64,终端环境为 GNU bash 5.1.8(1)-release 。为了方便,我们先把文件改个名字 (那你还把原来的名字给出来干嘛?!) :

1
2
3
4
5
[littleye233@lymjrolt Downloads]$ cd ~
[littleye233@lymjrolt ~]$ cd Downloads
[littleye233@lymjrolt Downloads]$ mv 咖啡馆推理事件簿系列(全四本).epub test.epub
[littleye233@lymjrolt Downloads]$ ll test.epub
-rw-r--r-- 1 littleye233 littleye233 1253964 Aug 22 23:24 test.epub

Round I. 文件类型

首先我们先尝试用 Linux 系统的内置命令 file 试试水,看看会输出什么东西。键入 file test.epub后执行:

1
2
[littleye233@lymjrolt Downloads]$ file test.epub
test.epub: EPUB document EPUB document

哎呀,真可惜! file 命令几乎什么有效信息都没给我们。 file 命令的 man 页面明确给出此命令可以判断文件格式,但其实它能做到的有很多,例如如果对一个图片文件使用 file ,可能会出现类似下面的结果:

1
2
[littleye233@lymjrolt Downloads]$ file ~/.local/share/osu/screenshots/osu_2021-08-21_23-40-03.png
/home/littleye233/.local/share/osu/screenshots/osu_2021-08-21_23-40-03.png: PNG image data, 1920 x 961, 8-bit/color RGBA, non-interlaced

这样我们可以通过 file 中提供的相关信息顺藤摸瓜,尝试在文件的二进制编码内容中寻找其蛛丝马迹,进而推测对应“位点”所表达的含义 (因为一些文件格式要求在特定的位置表达某些含义) ,如果能提供类似注释的信息就再好不过了。

Round II. 文件结构

现在我们回到这个 ePub 文件上来。现在我们尝试能否直接获取其内容,目的是通过文件头部的部分可见字符猜测其文件结构。输入 nano test.epub 直接预览,或使用 head --bytes=120 test.epub 查看前面 120 个字节的内容:

1
2
3
[littleye233@lymjrolt Downloads]$ head --bytes=120 test.epub
PK!oa�mimetypeapplication/epub+zipPU�N�;�ʯ�META-INF/container.xml]�A
�0E�=

果不其然,我们看到了一些有趣的字眼: “mimetypeapplication/epub+zip” ,凭经验猜测,这应该是 ePub 文件格式的文件头,而其中的 “zip” 也说明—— ePub 文件可能本质上就是一个压缩档。

其实很多文件格式 (例如 Word 文档 “*.docx”) 其本质都是在一个压缩档中加入各种资源文件和配置文件,只要有对应的软件进行读取并重新加工,用户即能看到效果。

Round III. 目录树结构

现在我们可以使用解压缩程序解出 ePub 文件中的内容了。在终端中执行 unzip -l test.epub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
[littleye233@lymjrolt Downloads]$ unzip -l test.epub
Archive: test.epub
Length Date Time Name
--------- ---------- ----- ----
20 1980-01-01 00:00 mimetype
251 2019-06-27 10:40 META-INF/container.xml
12307 2019-06-27 10:40 OEBPS/content.opf
112368 2019-06-27 10:40 OEBPS/Images/cover00464.jpeg
128680 2019-06-27 10:40 OEBPS/Images/image00456.jpeg
120936 2019-06-27 10:40 OEBPS/Images/image00457.jpeg
1392 2019-06-27 10:40 OEBPS/Images/image00458.jpeg
101948 2019-06-27 10:40 OEBPS/Images/image00459.jpeg
119124 2019-06-27 10:40 OEBPS/Images/image00460.jpeg
1268 2019-06-27 10:40 OEBPS/Images/image00461.jpeg
42944 2019-06-27 10:40 OEBPS/Images/image00462.jpeg
121284 2019-06-27 10:40 OEBPS/Images/image00463.jpeg
2251 2019-06-27 10:40 OEBPS/Styles/style0001.css
9816 2019-06-27 10:40 OEBPS/Styles/style0002.css
2251 2019-06-27 10:40 OEBPS/Styles/style0003.css
9789 2019-06-27 10:40 OEBPS/Styles/style0004.css
2251 2019-06-27 10:40 OEBPS/Styles/style0005.css
29245 2019-06-27 10:40 OEBPS/Styles/style0006.css
2235 2019-06-27 10:40 OEBPS/Styles/style0007.css
29914 2019-06-27 10:40 OEBPS/Styles/style0008.css
2251 2019-06-27 10:40 OEBPS/Styles/style0009.css
624 2019-06-27 10:40 OEBPS/Text/cover_page.xhtml
851 2019-06-27 10:40 OEBPS/Text/part0000.xhtml
561 2019-06-27 10:40 OEBPS/Text/part0001.xhtml
428 2019-06-27 10:40 OEBPS/Text/part0002.xhtml
1518 2019-06-27 10:40 OEBPS/Text/part0003.xhtml
661 2019-06-27 10:40 OEBPS/Text/part0004.xhtml
2311 2019-06-27 10:40 OEBPS/Text/part0005.xhtml
55157 2019-06-27 10:40 OEBPS/Text/part0006.xhtml
58266 2019-06-27 10:40 OEBPS/Text/part0007.xhtml
59953 2019-06-27 10:40 OEBPS/Text/part0008.xhtml
49789 2019-06-27 10:40 OEBPS/Text/part0009.xhtml
66870 2019-06-27 10:40 OEBPS/Text/part0010.xhtml
57342 2019-06-27 10:40 OEBPS/Text/part0011.xhtml
67449 2019-06-27 10:40 OEBPS/Text/part0012.xhtml
16183 2019-06-27 10:40 OEBPS/Text/part0013.xhtml
561 2019-06-27 10:40 OEBPS/Text/part0014.xhtml
428 2019-06-27 10:40 OEBPS/Text/part0015.xhtml
1575 2019-06-27 10:40 OEBPS/Text/part0016.xhtml
496 2019-06-27 10:40 OEBPS/Text/part0017.xhtml
1446 2019-06-27 10:40 OEBPS/Text/part0018.xhtml
52358 2019-06-27 10:40 OEBPS/Text/part0019.xhtml
75746 2019-06-27 10:40 OEBPS/Text/part0020.xhtml
63420 2019-06-27 10:40 OEBPS/Text/part0021.xhtml
57399 2019-06-27 10:40 OEBPS/Text/part0022.xhtml
58590 2019-06-27 10:40 OEBPS/Text/part0023.xhtml
40263 2019-06-27 10:40 OEBPS/Text/part0024.xhtml
66099 2019-06-27 10:40 OEBPS/Text/part0025.xhtml
15143 2019-06-27 10:40 OEBPS/Text/part0026.xhtml
561 2019-06-27 10:40 OEBPS/Text/part0027.xhtml
612 2019-06-27 10:40 OEBPS/Text/part0028.xhtml
1344 2019-06-27 10:40 OEBPS/Text/part0029.xhtml
640 2019-06-27 10:40 OEBPS/Text/part0030.xhtml
6144 2019-06-27 10:40 OEBPS/Text/part0031.xhtml
25197 2019-06-27 10:40 OEBPS/Text/part0032.xhtml
54594 2019-06-27 10:40 OEBPS/Text/part0033.xhtml
87394 2019-06-27 10:40 OEBPS/Text/part0034.xhtml
97557 2019-06-27 10:40 OEBPS/Text/part0035.xhtml
109901 2019-06-27 10:40 OEBPS/Text/part0036.xhtml
17181 2019-06-27 10:40 OEBPS/Text/part0037.xhtml
5238 2019-06-27 10:40 OEBPS/Text/part0038.xhtml
561 2019-06-27 10:40 OEBPS/Text/part0039.xhtml
644 2019-06-27 10:40 OEBPS/Text/part0040.xhtml
1163 2019-06-27 10:40 OEBPS/Text/part0041.xhtml
1473 2019-06-27 10:40 OEBPS/Text/part0042.xhtml
38427 2019-06-27 10:40 OEBPS/Text/part0043.xhtml
90589 2019-06-27 10:40 OEBPS/Text/part0044.xhtml
51278 2019-06-27 10:40 OEBPS/Text/part0045.xhtml
58321 2019-06-27 10:40 OEBPS/Text/part0046.xhtml
29670 2019-06-27 10:40 OEBPS/Text/part0047.xhtml
12903 2019-06-27 10:40 OEBPS/Text/part0048.xhtml
7364 2019-06-27 10:40 OEBPS/toc.ncx
--------- -------
2422768 72 files

同时可以直接解压:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
[littleye233@lymjrolt Downloads]$ unzip test.epub -d test_epub
Archive: test.epub
extracting: test_epub/mimetype
inflating: test_epub/META-INF/container.xml
inflating: test_epub/OEBPS/content.opf
inflating: test_epub/OEBPS/Images/cover00464.jpeg
inflating: test_epub/OEBPS/Images/image00456.jpeg
inflating: test_epub/OEBPS/Images/image00457.jpeg
inflating: test_epub/OEBPS/Images/image00458.jpeg
inflating: test_epub/OEBPS/Images/image00459.jpeg
inflating: test_epub/OEBPS/Images/image00460.jpeg
inflating: test_epub/OEBPS/Images/image00461.jpeg
inflating: test_epub/OEBPS/Images/image00462.jpeg
inflating: test_epub/OEBPS/Images/image00463.jpeg
inflating: test_epub/OEBPS/Styles/style0001.css
inflating: test_epub/OEBPS/Styles/style0002.css
inflating: test_epub/OEBPS/Styles/style0003.css
inflating: test_epub/OEBPS/Styles/style0004.css
inflating: test_epub/OEBPS/Styles/style0005.css
inflating: test_epub/OEBPS/Styles/style0006.css
inflating: test_epub/OEBPS/Styles/style0007.css
inflating: test_epub/OEBPS/Styles/style0008.css
inflating: test_epub/OEBPS/Styles/style0009.css
inflating: test_epub/OEBPS/Text/cover_page.xhtml
inflating: test_epub/OEBPS/Text/part0000.xhtml
inflating: test_epub/OEBPS/Text/part0001.xhtml
inflating: test_epub/OEBPS/Text/part0002.xhtml
inflating: test_epub/OEBPS/Text/part0003.xhtml
inflating: test_epub/OEBPS/Text/part0004.xhtml
inflating: test_epub/OEBPS/Text/part0005.xhtml
inflating: test_epub/OEBPS/Text/part0006.xhtml
inflating: test_epub/OEBPS/Text/part0007.xhtml
inflating: test_epub/OEBPS/Text/part0008.xhtml
inflating: test_epub/OEBPS/Text/part0009.xhtml
inflating: test_epub/OEBPS/Text/part0010.xhtml
inflating: test_epub/OEBPS/Text/part0011.xhtml
inflating: test_epub/OEBPS/Text/part0012.xhtml
inflating: test_epub/OEBPS/Text/part0013.xhtml
inflating: test_epub/OEBPS/Text/part0014.xhtml
inflating: test_epub/OEBPS/Text/part0015.xhtml
inflating: test_epub/OEBPS/Text/part0016.xhtml
inflating: test_epub/OEBPS/Text/part0017.xhtml
inflating: test_epub/OEBPS/Text/part0018.xhtml
inflating: test_epub/OEBPS/Text/part0019.xhtml
inflating: test_epub/OEBPS/Text/part0020.xhtml
inflating: test_epub/OEBPS/Text/part0021.xhtml
inflating: test_epub/OEBPS/Text/part0022.xhtml
inflating: test_epub/OEBPS/Text/part0023.xhtml
inflating: test_epub/OEBPS/Text/part0024.xhtml
inflating: test_epub/OEBPS/Text/part0025.xhtml
inflating: test_epub/OEBPS/Text/part0026.xhtml
inflating: test_epub/OEBPS/Text/part0027.xhtml
inflating: test_epub/OEBPS/Text/part0028.xhtml
inflating: test_epub/OEBPS/Text/part0029.xhtml
inflating: test_epub/OEBPS/Text/part0030.xhtml
inflating: test_epub/OEBPS/Text/part0031.xhtml
inflating: test_epub/OEBPS/Text/part0032.xhtml
inflating: test_epub/OEBPS/Text/part0033.xhtml
inflating: test_epub/OEBPS/Text/part0034.xhtml
inflating: test_epub/OEBPS/Text/part0035.xhtml
inflating: test_epub/OEBPS/Text/part0036.xhtml
inflating: test_epub/OEBPS/Text/part0037.xhtml
inflating: test_epub/OEBPS/Text/part0038.xhtml
inflating: test_epub/OEBPS/Text/part0039.xhtml
inflating: test_epub/OEBPS/Text/part0040.xhtml
inflating: test_epub/OEBPS/Text/part0041.xhtml
inflating: test_epub/OEBPS/Text/part0042.xhtml
inflating: test_epub/OEBPS/Text/part0043.xhtml
inflating: test_epub/OEBPS/Text/part0044.xhtml
inflating: test_epub/OEBPS/Text/part0045.xhtml
inflating: test_epub/OEBPS/Text/part0046.xhtml
inflating: test_epub/OEBPS/Text/part0047.xhtml
inflating: test_epub/OEBPS/Text/part0048.xhtml
inflating: test_epub/OEBPS/toc.ncx

为了更清楚地显示文件树结构,我们也可以使用 tree 命令 (这个命令在 Windows 中是内置的,在 Linux 中需要安装 tree 这个包,使用软件包管理器或编译安装均可) :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
[littleye233@lymjrolt test_epub]$ tree
.
├── META-INF
│   └── container.xml
├── mimetype
└── OEBPS
├── content.opf
├── Images
│   ├── cover00464.jpeg
│   ├── image00456.jpeg
│   ├── image00457.jpeg
│   ├── image00458.jpeg
│   ├── image00459.jpeg
│   ├── image00460.jpeg
│   ├── image00461.jpeg
│   ├── image00462.jpeg
│   └── image00463.jpeg
├── Styles
│   ├── style0001.css
│   ├── style0002.css
│   ├── style0003.css
│   ├── style0004.css
│   ├── style0005.css
│   ├── style0006.css
│   ├── style0007.css
│   ├── style0008.css
│   └── style0009.css
├── Text
│   ├── cover_page.xhtml
│   ├── part0000.xhtml
│   ├── part0001.xhtml
│   ├── part0002.xhtml
│   ├── part0003.xhtml
│   ├── part0004.xhtml
│   ├── part0005.xhtml
│   ├── part0006.xhtml
│   ├── part0007.xhtml
│   ├── part0008.xhtml
│   ├── part0009.xhtml
│   ├── part0010.xhtml
│   ├── part0011.xhtml
│   ├── part0012.xhtml
│   ├── part0013.xhtml
│   ├── part0014.xhtml
│   ├── part0015.xhtml
│   ├── part0016.xhtml
│   ├── part0017.xhtml
│   ├── part0018.xhtml
│   ├── part0019.xhtml
│   ├── part0020.xhtml
│   ├── part0021.xhtml
│   ├── part0022.xhtml
│   ├── part0023.xhtml
│   ├── part0024.xhtml
│   ├── part0025.xhtml
│   ├── part0026.xhtml
│   ├── part0027.xhtml
│   ├── part0028.xhtml
│   ├── part0029.xhtml
│   ├── part0030.xhtml
│   ├── part0031.xhtml
│   ├── part0032.xhtml
│   ├── part0033.xhtml
│   ├── part0034.xhtml
│   ├── part0035.xhtml
│   ├── part0036.xhtml
│   ├── part0037.xhtml
│   ├── part0038.xhtml
│   ├── part0039.xhtml
│   ├── part0040.xhtml
│   ├── part0041.xhtml
│   ├── part0042.xhtml
│   ├── part0043.xhtml
│   ├── part0044.xhtml
│   ├── part0045.xhtml
│   ├── part0046.xhtml
│   ├── part0047.xhtml
│   └── part0048.xhtml
└── toc.ncx

5 directories, 72 files

Round IV. 内部文件

到这里我们大概就能猜出来:

  • META-INF 文件夹:里面存放的应该是“容器” (也就是这个 ePub 文件) 的相关配置文件;
  • mimetype 文件:里面定义了这个文件的类型为 “ePub” (其中 “MIME” 是 “Multipurpose Internet Mail Extensions” 的缩写,从字面上也能看出其具有指示 “Extension” 的机能) ;
  • OEBPS 文件夹:虽暂不知其确切含义,但应存放 ePub 的文字、图片以及其他的界面数据;
    • content.opf 文件:里面存放的应该是目录信息——或是定义各种文件的“次序”;
    • Images StylesText 文件夹:明显分别存放图片、层叠样式表和文字数据;
    • toc.ncx 文件:可能是真正的目录 (“toc” 是 “table of contents” 的缩写)。

接下来我们将挨个分析。

Round IV.I. 容器

先看 META-INF/container.xml

1
2
[littleye233@lymjrolt test_epub]$ file META-INF/container.xml
META-INF/container.xml: XML 1.0 document, ASCII text

输出其内容:

1
2
3
4
5
<?xml version="1.0" encoding="UTF-8"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
<rootfiles>
<rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/> </rootfiles>
</container>

显然是一个标准的 XML 文件,其中我们可以注意到 /container/rootfiles/rootfile[@class='full-path'][^1] 中定义了一个我们之前认定的目录文件,但此处可以规范化,故这个文件在大多数 ePub 档中应该是相同的。

Round IV.II. 文件类型定性

接下来看 mimetype 文件:

1
2
[littleye233@lymjrolt test_epub]$ cat mimetype
application/epub+zip

这也是相当显然的,也不再赘述。

Round IV.III. 目录?

再看 OEBPS/content.opf

1
2
[littleye233@lymjrolt test_epub]$ file OEBPS/content.opf
OEBPS/content.opf: XML 1.0 document, Unicode text, UTF-8 text, with very long lines (504)

这也是一个 XML 文件,令人惊讶的是 file 命令竟能看出这个文件中最长的行有 504 个字符,属实让人害怕。

点此查看 `OEBPS/content.opf` 的全部内容 (已经过格式化)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
<?xml version="1.0" encoding="utf-8"?>
<package xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="uid">
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
<dc:title opf:file-as="kafeiguantuilishijianbuxilie(quansiben)">咖啡馆推理事件簿系列(全四本)</dc:title>
<dc:language>zh</dc:language>
<dc:identifier id="uid">3899198450</dc:identifier>
<dc:creator opf:file-as="(ri)gangqizuomo">(日)冈崎琢磨</dc:creator>
<dc:date opf:event="publication">2018-03-15</dc:date>
<!-- Extra MetaData from RESC
<dc:coverage/>
-->
<meta name="cover" content="x_cover-image"/>
<meta name="output encoding" content="utf-8"/>
<meta name="primary-writing-mode" content="horizontal-lr"/>
<!-- BEGIN INFORMATION ONLY
<meta name="Cover ThumbNail Image" content="Images/image00466.jpeg" />
<meta name="Drm Ebookbase Book Id" content="0006008690412" />
<meta name="ASIN" content="B07BFTVX98" />
<meta name="Creator-Software" content="201" />
<meta name="Author-Pronunciation" content="(ri)gangqizuomo" />
<meta name="Embedded-Record-Count" content="11" />
<meta name="Unknown_(403)_(hex)" content="00" />
<meta name="HasFakeCover" content="0" />
<meta name="Creator-Major-Version" content="2" />
<meta name="cdeType" content="EBOK" />
<meta name="override-kindle-fonts" content="false" />
<meta name="CDEContentKey" content="B07BFTVX98" />
<meta name="Compression-Upgraded" content="Source-Target:c1-c2 KT_Version:2.9 Build:0805-4a0c57c" />
<meta name="HD-Media-Containers-Info" content="2400x3840:0-11|" />
<meta name="548 (hex)" content="496e4d656d6f7279" />
<meta name="Unknown_(407)_(hex)" content="0000000000000000" />
<meta name="Amazon_Creator_Info" content="kjw" />
<meta name="Clipping-Limit" content="100" />
<meta name="Tamper-Proof-Keys_(hex)" content="01000000d000000001940000000191000000019500000001960000000197" />
<meta name="Title-Pronunciation" content="kafeiguantuilishijianbuxilie(quansiben)" />
<meta name="Creator-Minor-Version" content="9" />
<meta name="MetadataResourceURI" content="kindle:embed:000A" />
<meta name="Updated_Title" content="咖啡馆推理事件簿系列(全四本)" />
<meta name="Ownership-Type_(hex)" content="00" />
<meta name="547 (hex)" content="496e4d656d6f7279" />
<meta name="Content-Language-Tag" content="zh" />
<meta name="sample" content="0" />
<meta name="Metadata-Record-Offset" content="4294967295" />
<meta name="Creator-Build-Tag" content="0721-dedaf5" />
<meta name="Publisher-Pronunciation" content="xiandaichubanshe" />
<meta name="StartOffset" content="4294967295" />
<meta name="Watermark_(hex)" content="6174763a6b696e3a323a49396e41307a4239625565766a514961583462736b66476a394535335a51616a696368585638364447746b65544379504d504c4d75445a35524f39676b584d35515a6433694f424b5531546643766f5a62507763705a6b49486f6f366a6639785944327a4158494263536c495879676b6a38616b566e4763327a2b2b50434c454c464b2b4e30495a4556437a6331516f656f4451546b3865374a6f61696251526d6f682b7574586b3661466a554477704a3165636c68665367414a35664745413a68614f61636b496839662b61786c457733397665774b32554a57453d" />
<meta name="Text-to-Speech-Disabled" content="0" />
<meta name="Font-Signature_(hex)" content="0300000000480f08100000000000008000200000000000000000000000000000bef4edec01b701d7409440984099409c409d40a64a804c9c608160826080608b60e8618b60cd60c661aa61f361d661e9629c73df0213c9021fd90173cd01429f037e9a037e8c037e81037e9f" />
<meta name="Rental-Expiration-Time" content="0000000000000000" />
<meta name="Container_Id" content="ZkM0" />
<meta name="Mobi8-Boundary-Section" content="420" />
<meta name="Creator-Build-Number" content="0" />
END INFORMATION ONLY -->
</metadata>
<manifest>
<item id="x_cover" media-type="application/xhtml+xml" href="Text/cover_page.xhtml"/>
<item id="x_TableOfContents" media-type="application/xhtml+xml" href="Text/part0000.xhtml"/>
<item id="x_a1cover.html" media-type="application/xhtml+xml" href="Text/part0001.xhtml"/>
<item id="x_a1bookname" media-type="application/xhtml+xml" href="Text/part0002.xhtml"/>
<item id="x_a1TableOfContents" media-type="application/xhtml+xml" href="Text/part0003.xhtml"/>
<item id="x_a1Chapter001" media-type="application/xhtml+xml" href="Text/part0004.xhtml"/>
<item id="x_a1Chapter002" media-type="application/xhtml+xml" href="Text/part0005.xhtml"/>
<item id="x_a1Chapter003" media-type="application/xhtml+xml" href="Text/part0006.xhtml"/>
<item id="x_a1Chapter004" media-type="application/xhtml+xml" href="Text/part0007.xhtml"/>
<item id="x_a1Chapter005" media-type="application/xhtml+xml" href="Text/part0008.xhtml"/>
<item id="x_a1Chapter006" media-type="application/xhtml+xml" href="Text/part0009.xhtml"/>
<item id="x_a1Chapter007" media-type="application/xhtml+xml" href="Text/part0010.xhtml"/>
<item id="x_a1Chapter008" media-type="application/xhtml+xml" href="Text/part0011.xhtml"/>
<item id="x_a1Chapter009" media-type="application/xhtml+xml" href="Text/part0012.xhtml"/>
<item id="x_a1Chapter010" media-type="application/xhtml+xml" href="Text/part0013.xhtml"/>
<item id="x_a2cover.html" media-type="application/xhtml+xml" href="Text/part0014.xhtml"/>
<item id="x_a2bookname" media-type="application/xhtml+xml" href="Text/part0015.xhtml"/>
<item id="x_a2TableOfContents" media-type="application/xhtml+xml" href="Text/part0016.xhtml"/>
<item id="x_a2Chapter001" media-type="application/xhtml+xml" href="Text/part0017.xhtml"/>
<item id="x_a2Chapter002" media-type="application/xhtml+xml" href="Text/part0018.xhtml"/>
<item id="x_a2Chapter003" media-type="application/xhtml+xml" href="Text/part0019.xhtml"/>
<item id="x_a2Chapter004" media-type="application/xhtml+xml" href="Text/part0020.xhtml"/>
<item id="x_a2Chapter005" media-type="application/xhtml+xml" href="Text/part0021.xhtml"/>
<item id="x_a2Chapter006" media-type="application/xhtml+xml" href="Text/part0022.xhtml"/>
<item id="x_a2Chapter007" media-type="application/xhtml+xml" href="Text/part0023.xhtml"/>
<item id="x_a2Chapter008" media-type="application/xhtml+xml" href="Text/part0024.xhtml"/>
<item id="x_a2Chapter009" media-type="application/xhtml+xml" href="Text/part0025.xhtml"/>
<item id="x_a2Chapter010" media-type="application/xhtml+xml" href="Text/part0026.xhtml"/>
<item id="x_a3cover.html" media-type="application/xhtml+xml" href="Text/part0027.xhtml"/>
<item id="x_a3bookname" media-type="application/xhtml+xml" href="Text/part0028.xhtml"/>
<item id="x_a3TableOfContents" media-type="application/xhtml+xml" href="Text/part0029.xhtml"/>
<item id="x_a3Chapter001" media-type="application/xhtml+xml" href="Text/part0030.xhtml"/>
<item id="x_a3Chapter002" media-type="application/xhtml+xml" href="Text/part0031.xhtml"/>
<item id="x_a3Chapter003" media-type="application/xhtml+xml" href="Text/part0032.xhtml"/>
<item id="x_a3Chapter004" media-type="application/xhtml+xml" href="Text/part0033.xhtml"/>
<item id="x_a3Chapter005" media-type="application/xhtml+xml" href="Text/part0034.xhtml"/>
<item id="x_a3Chapter006" media-type="application/xhtml+xml" href="Text/part0035.xhtml"/>
<item id="x_a3Chapter007" media-type="application/xhtml+xml" href="Text/part0036.xhtml"/>
<item id="x_a3Chapter008" media-type="application/xhtml+xml" href="Text/part0037.xhtml"/>
<item id="x_a3Chapter009" media-type="application/xhtml+xml" href="Text/part0038.xhtml"/>
<item id="x_a4cover.html" media-type="application/xhtml+xml" href="Text/part0039.xhtml"/>
<item id="x_a4bookname" media-type="application/xhtml+xml" href="Text/part0040.xhtml"/>
<item id="x_a4TableOfContents" media-type="application/xhtml+xml" href="Text/part0041.xhtml"/>
<item id="x_a4Chapter001" media-type="application/xhtml+xml" href="Text/part0042.xhtml"/>
<item id="x_a4Chapter002" media-type="application/xhtml+xml" href="Text/part0043.xhtml"/>
<item id="x_a4Chapter003" media-type="application/xhtml+xml" href="Text/part0044.xhtml"/>
<item id="x_a4Chapter004" media-type="application/xhtml+xml" href="Text/part0045.xhtml"/>
<item id="x_a4Chapter005" media-type="application/xhtml+xml" href="Text/part0046.xhtml"/>
<item id="x_a4Chapter006" media-type="application/xhtml+xml" href="Text/part0047.xhtml"/>
<item id="x_a4Chapter007" media-type="application/xhtml+xml" href="Text/part0048.xhtml"/>
<item id="item50" media-type="text/css" href="Styles/style0001.css"/>
<item id="item51" media-type="text/css" href="Styles/style0002.css"/>
<item id="item52" media-type="text/css" href="Styles/style0003.css"/>
<item id="item53" media-type="text/css" href="Styles/style0004.css"/>
<item id="item54" media-type="text/css" href="Styles/style0005.css"/>
<item id="item55" media-type="text/css" href="Styles/style0006.css"/>
<item id="item56" media-type="text/css" href="Styles/style0007.css"/>
<item id="item57" media-type="text/css" href="Styles/style0008.css"/>
<item id="item58" media-type="text/css" href="Styles/style0009.css"/>
<item id="item59" media-type="image/jpeg" href="Images/image00456.jpeg"/>
<item id="item60" media-type="image/jpeg" href="Images/image00457.jpeg"/>
<item id="item61" media-type="image/jpeg" href="Images/image00458.jpeg"/>
<item id="item62" media-type="image/jpeg" href="Images/image00459.jpeg"/>
<item id="item63" media-type="image/jpeg" href="Images/image00460.jpeg"/>
<item id="item64" media-type="image/jpeg" href="Images/image00461.jpeg"/>
<item id="item65" media-type="image/jpeg" href="Images/image00462.jpeg"/>
<item id="item66" media-type="image/jpeg" href="Images/image00463.jpeg"/>
<item id="x_cover-image" media-type="image/jpeg" href="Images/cover00464.jpeg"/>
<item id="ncx" media-type="application/x-dtbncx+xml" href="toc.ncx"/>
</manifest>
<spine toc="ncx">
<itemref idref="x_cover" linear="no"/>
<itemref idref="x_TableOfContents" linear="yes"/>
<itemref idref="x_a1cover.html" linear="yes"/>
<itemref idref="x_a1bookname" linear="yes"/>
<itemref idref="x_a1TableOfContents" linear="yes"/>
<itemref idref="x_a1Chapter001" linear="yes"/>
<itemref idref="x_a1Chapter002" linear="yes"/>
<itemref idref="x_a1Chapter003" linear="yes"/>
<itemref idref="x_a1Chapter004" linear="yes"/>
<itemref idref="x_a1Chapter005" linear="yes"/>
<itemref idref="x_a1Chapter006" linear="yes"/>
<itemref idref="x_a1Chapter007" linear="yes"/>
<itemref idref="x_a1Chapter008" linear="yes"/>
<itemref idref="x_a1Chapter009" linear="yes"/>
<itemref idref="x_a1Chapter010" linear="yes"/>
<itemref idref="x_a2cover.html" linear="yes"/>
<itemref idref="x_a2bookname" linear="yes"/>
<itemref idref="x_a2TableOfContents" linear="yes"/>
<itemref idref="x_a2Chapter001" linear="yes"/>
<itemref idref="x_a2Chapter002" linear="yes"/>
<itemref idref="x_a2Chapter003" linear="yes"/>
<itemref idref="x_a2Chapter004" linear="yes"/>
<itemref idref="x_a2Chapter005" linear="yes"/>
<itemref idref="x_a2Chapter006" linear="yes"/>
<itemref idref="x_a2Chapter007" linear="yes"/>
<itemref idref="x_a2Chapter008" linear="yes"/>
<itemref idref="x_a2Chapter009" linear="yes"/>
<itemref idref="x_a2Chapter010" linear="yes"/>
<itemref idref="x_a3cover.html" linear="yes"/>
<itemref idref="x_a3bookname" linear="yes"/>
<itemref idref="x_a3TableOfContents" linear="yes"/>
<itemref idref="x_a3Chapter001" linear="yes"/>
<itemref idref="x_a3Chapter002" linear="yes"/>
<itemref idref="x_a3Chapter003" linear="yes"/>
<itemref idref="x_a3Chapter004" linear="yes"/>
<itemref idref="x_a3Chapter005" linear="yes"/>
<itemref idref="x_a3Chapter006" linear="yes"/>
<itemref idref="x_a3Chapter007" linear="yes"/>
<itemref idref="x_a3Chapter008" linear="yes"/>
<itemref idref="x_a3Chapter009" linear="yes"/>
<itemref idref="x_a4cover.html" linear="yes"/>
<itemref idref="x_a4bookname" linear="yes"/>
<itemref idref="x_a4TableOfContents" linear="yes"/>
<itemref idref="x_a4Chapter001" linear="yes"/>
<itemref idref="x_a4Chapter002" linear="yes"/>
<itemref idref="x_a4Chapter003" linear="yes"/>
<itemref idref="x_a4Chapter004" linear="yes"/>
<itemref idref="x_a4Chapter005" linear="yes"/>
<itemref idref="x_a4Chapter006" linear="yes"/>
<itemref idref="x_a4Chapter007" linear="yes"/>
</spine>
<tours>
</tours>
<guide>
<reference type="text" title="Start" href="Text/part0004.xhtml"/>
<reference type="toc" title="Table of Contents" href="Text/part0000.xhtml"/>
<reference type="cover" title="Cover" href="Text/cover_page.xhtml"/>
</guide>
</package>

说明我之前并没有猜错,这个文件存放的是超越“目录”的东西,而是“次序”——更进一步说。是“索引”。这个文件类似于其他文件格式或目录树中的 index.* ,将 ePub 中的各种数据编上号码,同时这里也定义了标题、语言、作者、出版 (发布) 日期等元信息。至于之前看到的超长行,似乎是一种十六进制的水印 (watermark) ,或许是为了防侵权等。

其中的 /package/manifest/item 定义了所有的索引,以及文件对应的类型; /package/spine/itemref 暂不知进一步的作用,但从中可看出能定义是否“线性” (linear) ; /package/guide/reference 定义了 ePub 的封面等索引,可供文件管理器和 ePub 阅读器使用 (显示预览页) 。

Round IV.IV. 目录!

再看 OEBPS/toc.ncx

1
2
[littleye233@lymjrolt test_epub]$ file OEBPS/toc.ncx
OEBPS/toc.ncx: XML 1.0 document, Unicode text, UTF-8 text

感觉再讨论文件类型已经无关紧要了。再次查看内容:

点此查看 `OEBPS/toc.ncx` 的全部内容 (已经过格式化)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
<?xml version="1.0" encoding="utf-8"?>
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1" xml:lang="zh">
<head>
<meta content="3899198450" name="dtb:uid"/>
<meta content="2" name="dtb:depth"/>
<meta content="mobiunpack.py" name="dtb:generator"/>
<meta content="0" name="dtb:totalPageCount"/>
<meta content="0" name="dtb:maxPageNumber"/>
</head>
<docTitle>
<text>咖啡馆推理事件簿系列(全四本)</text>
</docTitle>
<navMap>
<navPoint id="np_1" playOrder="1">
<navLabel>
<text>总目录</text>
</navLabel>
<content src="Text/part0000.xhtml"/>
</navPoint>
<navPoint id="np_2" playOrder="2">
<navLabel>
<text>咖啡馆推理事件簿:下次见面时,请让我品尝你煮的咖啡</text>
</navLabel>
<content src="Text/part0001.xhtml"/>
<navPoint id="np_3" playOrder="3">
<navLabel>
<text>序章</text>
</navLabel>
<content src="Text/part0005.xhtml"/>
</navPoint>
<navPoint id="np_4" playOrder="4">
<navLabel>
<text>一 事件始于第二次光顾</text>
</navLabel>
<content src="Text/part0006.xhtml"/>
</navPoint>
<navPoint id="np_5" playOrder="5">
<navLabel>
<text>二 Bittersweet Black</text>
</navLabel>
<content src="Text/part0007.xhtml"/>
</navPoint>
<navPoint id="np_6" playOrder="6">
<navLabel>
<text>三 隐藏在乳白色中的心</text>
</navLabel>
<content src="Text/part0008.xhtml"/>
</navPoint>
<navPoint id="np_7" playOrder="7">
<navLabel>
<text>四 棋盘上的狩猎</text>
</navLabel>
<content src="Text/part0009.xhtml"/>
</navPoint>
<navPoint id="np_8" playOrder="8">
<navLabel>
<text>五 past,present,f******?</text>
</navLabel>
<content src="Text/part0010.xhtml"/>
</navPoint>
<navPoint id="np_9" playOrder="9">
<navLabel>
<text>六 Animals in the closed room</text>
</navLabel>
<content src="Text/part0011.xhtml"/>
</navPoint>
<navPoint id="np_10" playOrder="10">
<navLabel>
<text>七 下次见面时,请让我品尝你煮的咖啡</text>
</navLabel>
<content src="Text/part0012.xhtml"/>
</navPoint>
<navPoint id="np_11" playOrder="11">
<navLabel>
<text>终章</text>
</navLabel>
<content src="Text/part0013.xhtml"/>
</navPoint>
</navPoint>
<navPoint id="np_12" playOrder="12">
<navLabel>
<text>咖啡馆推理事件簿2:她梦到了欧蕾咖啡</text>
</navLabel>
<content src="Text/part0014.xhtml"/>
<navPoint id="np_13" playOrder="13">
<navLabel>
<text>序曲 她的梦</text>
</navLabel>
<content src="Text/part0018.xhtml"/>
</navPoint>
<navPoint id="np_14" playOrder="14">
<navLabel>
<text>第一章 敬启致未来的你</text>
</navLabel>
<content src="Text/part0019.xhtml"/>
</navPoint>
<navPoint id="np_15" playOrder="15">
<navLabel>
<text>第二章 狐狸的迷惑</text>
</navLabel>
<content src="Text/part0020.xhtml"/>
</navPoint>
<navPoint id="np_16" playOrder="16">
<navLabel>
<text>第三章 打碎乳白色的心</text>
</navLabel>
<content src="Text/part0021.xhtml"/>
</navPoint>
<navPoint id="np_17" playOrder="17">
<navLabel>
<text>第四章 咖啡侦探蕾拉事件簿</text>
</navLabel>
<content src="Text/part0022.xhtml"/>
</navPoint>
<navPoint id="np_18" playOrder="18">
<navLabel>
<text>第五章 (She Wanted To Be)WANTED</text>
</navLabel>
<content src="Text/part0023.xhtml"/>
</navPoint>
<navPoint id="np_19" playOrder="19">
<navLabel>
<text>第六章 the Sky Occluded in the Sun</text>
</navLabel>
<content src="Text/part0024.xhtml"/>
</navPoint>
<navPoint id="np_20" playOrder="20">
<navLabel>
<text>第七章 在星空之下同命相连</text>
</navLabel>
<content src="Text/part0025.xhtml"/>
</navPoint>
<navPoint id="np_21" playOrder="21">
<navLabel>
<text>终章 她梦到了欧蕾咖啡</text>
</navLabel>
<content src="Text/part0026.xhtml"/>
</navPoint>
</navPoint>
<navPoint id="np_22" playOrder="22">
<navLabel>
<text>咖啡馆推理事件簿3:扰人心神的咖啡</text>
</navLabel>
<content src="Text/part0027.xhtml"/>
<navPoint id="np_23" playOrder="23">
<navLabel>
<text>序曲 五年前</text>
</navLabel>
<content src="Text/part0031.xhtml"/>
</navPoint>
<navPoint id="np_24" playOrder="24">
<navLabel>
<text>第一章 参加大赛</text>
</navLabel>
<content src="Text/part0032.xhtml"/>
</navPoint>
<navPoint id="np_25" playOrder="25">
<navLabel>
<text>第二章 前夜</text>
</navLabel>
<content src="Text/part0033.xhtml"/>
</navPoint>
<navPoint id="np_26" playOrder="26">
<navLabel>
<text>第三章 第一天</text>
</navLabel>
<content src="Text/part0034.xhtml"/>
</navPoint>
<navPoint id="np_27" playOrder="27">
<navLabel>
<text>第四章 第二天</text>
</navLabel>
<content src="Text/part0035.xhtml"/>
</navPoint>
<navPoint id="np_28" playOrder="28">
<navLabel>
<text>第五章 真相</text>
</navLabel>
<content src="Text/part0036.xhtml"/>
</navPoint>
<navPoint id="np_29" playOrder="29">
<navLabel>
<text>第六章 日后</text>
</navLabel>
<content src="Text/part0037.xhtml"/>
</navPoint>
<navPoint id="np_30" playOrder="30">
<navLabel>
<text>尾声 五年前</text>
</navLabel>
<content src="Text/part0038.xhtml"/>
</navPoint>
</navPoint>
<navPoint id="np_31" playOrder="31">
<navLabel>
<text>咖啡馆推理事件簿4:休闲时光的五种风味</text>
</navLabel>
<content src="Text/part0039.xhtml"/>
<navPoint id="np_32" playOrder="32">
<navLabel>
<text>午后三点前的无聊风景</text>
</navLabel>
<content src="Text/part0043.xhtml"/>
</navPoint>
<navPoint id="np_33" playOrder="33">
<navLabel>
<text>帕列塔之恋</text>
</navLabel>
<content src="Text/part0044.xhtml"/>
</navPoint>
<navPoint id="np_34" playOrder="34">
<navLabel>
<text>消失的礼物飞镖</text>
</navLabel>
<content src="Text/part0045.xhtml"/>
</navPoint>
<navPoint id="np_35" playOrder="35">
<navLabel>
<text>可视化的原生艺术</text>
</navLabel>
<content src="Text/part0046.xhtml"/>
</navPoint>
<navPoint id="np_36" playOrder="36">
<navLabel>
<text>在塔列兰咖啡馆的庭院里</text>
</navLabel>
<content src="Text/part0047.xhtml"/>
</navPoint>
<navPoint id="np_37" playOrder="37">
<navLabel>
<text>特别篇 如释重负</text>
</navLabel>
<content src="Text/part0048.xhtml"/>
</navPoint>
</navPoint>
</navMap>
</ncx>

我们不妨将目光转向较为重要的“目录”的定义上。为了方便观察,笔者偷点懒,使用桌面环境中自带的阅读器观察:

图1 `test.epub` 在阅读器中显示出的目录

从中可以看出目录是二层结构,恰好和 OEBPS/toc.ncx 中的定义保持一致。而其中的部分重要属性均可“望文生义”,此处不再进一步研究。

Round IV.V. 其余部分

最后剩下的是图片、文字和层叠样式表。虽然这部分是在整个 ePub 文件中占比最大也可以说是最重要的部分,但由于这一块的内容实在是太过直白,再讲下去恐怕要开始补习 HTML 和 CSS 知识了,故同样略去。

总结

根据上文中的简要探究, ePub 是一种以 XML 文件格式为配置文件类型的、包含有图片及文字等数据的、以压缩档为本质的文件格式。查阅相关资料后可知其实质与上文中分析类似。

而通过上文的分析,我们初步体验到分析一种陌生文件格式的规律和技巧,可以用于后续对更复杂的文件格式的探究。






但最后,别忘了把那个 ePub 文件的名字改回来 XD :

1
[littleye233@lymjrolt Downloads]$ mv test.epub 咖啡馆推理事件簿系列(全四本).epub

【完】

脚注

[^1]: 此处为 XPath 语法,用于描述类 XML 文件各种元素的位置,后文类似者不再注明。

【文件格式探究】EP.1 对ePub文件格式的初探

https://blog.tamako.work/techdev/format/epub/

Posted on

2021-08-22

Updated on

2022-01-22

Licensed under

Comments