Documentation

2024-11-12 01:50:52 +01:00 · 2022-06-05 18:16:51 +02:00 · 2022-06-05 18:16:51 +02:00 · 10859be356
commit 10859be356
parent 187711fa29
7 changed files with 102 additions and 45 deletions
--- a/4
+++ b/4
@ -1,3 +1,7 @@
+mkdocs-material-8.3.2+insiders-4.17.2 (2022-06-05)
+
+  * Added support for custom jieba dictionaries (Chinese search)
+
 mkdocs-material-8.3.2+insiders-4.17.1 (2022-06-05)

  * Added support for cookie consent reject button
--- a/docs/blog/2021/search-better-faster-smaller.md
+++ b/docs/blog/2021/search-better-faster-smaller.md
@ -197,8 +197,8 @@ the following steps are taken:
    remain. Linking is necessary, as search results are grouped by page.

 2.  __Tokenization__: The `title` and `text` values of each section are split
-    into tokens by using the [separator] as configured in `mkdocs.yml`.
-    Tokenization itself is carried out by
+    into tokens by using the [`separator`][separator] as configured in
+    `mkdocs.yml`. Tokenization itself is carried out by
    [lunr's default tokenizer][default tokenizer], which doesn't allow for
    lookahead or separators spanning multiple characters.

@ -216,7 +216,7 @@ more magic involved, e.g., search results are [post-processed] and [rescored] to
 account for some shortcomings of [lunr], but in general, this is how data gets
 into and out of the index.

-  [separator]: ../../setup/setting-up-site-search.md#separator
+  [separator]: ../../setup/setting-up-site-search.md#search-separator
  [default tokenizer]: https://github.com/olivernn/lunr.js/blob/aa5a878f62a6bba1e8e5b95714899e17e8150b38/lunr.js#L413-L456
  [post-processed]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/_/index.ts#L249-L272
  [rescored]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/_/index.ts#L274-L275
@ -421,9 +421,9 @@ On to the next step in the process: __tokenization__.
 ### Tokenizer lookahead

 The [default tokenizer] of [lunr] uses a regular expression to split a given
-string by matching each character against the [separator] as defined in
-`mkdocs.yml`. This doesn't allow for more complex separators based on
-lookahead or multiple characters.
+string by matching each character against the [`separator`][separator] as
+defined in `mkdocs.yml`. This doesn't allow for more complex separators based
+on lookahead or multiple characters.

 Fortunately, __our new search implementation provides an advanced tokenizer__
 that doesn't have these shortcomings and supports more complex regular
@ -439,14 +439,14 @@ characters at which the string should be split, the following three sections
 explain the remainder of the regular expression.[^4]

  [^4]:
-    As a fun fact: the [separator default value] of the search plugin being
-    `[\s\-]+` always has been kind of irritating, as it suggests that multiple
-    characters can be considered being a separator. However, the `+` is
-    completely irrelevant, as regular expression groups involving multiple
-    characters were never supported by
+    As a fun fact: the [`separator`][separator] [default value] of the search
+    plugin being `[\s\-]+` always has been kind of irritating, as it suggests
+    that multiple characters can be considered being a separator. However, the
+    `+` is completely irrelevant, as regular expression groups involving
+    multiple characters were never supported by
    [lunr's default tokenizer][default tokenizer].

-  [separator default value]: https://www.mkdocs.org/user-guide/configuration/#separator
+  [default value]: https://www.mkdocs.org/user-guide/configuration/#separator

 #### Case changes

--- a/docs/blog/2022/chinese-search-support.md
+++ b/docs/blog/2022/chinese-search-support.md
@ -32,7 +32,7 @@ number of Chinese users.__
 ---

 After the United States and Germany, the third-largest country of origin of
-Material for MkDocs users is China. For a long time, the built-in search plugin
+Material for MkDocs users is China. For a long time, the [built-in search plugin]
 didn't allow for proper segmentation of Chinese characters, mainly due to 
 missing support in [lunr-languages] which is used for search tokenization and
 stemming. The latest Insiders release adds long-awaited Chinese language support
@ -58,10 +58,11 @@ through the segmenter. You can install [jieba] with:
 pip install jieba
 ```

-The next step is only required if you specified the [separator] configuration
-in `mkdocs.yml`. Text is segmented with [zero-width whitespace] characters, so
-it renders exactly the same in the search modal. Adjust `mkdocs.yml` so that
-the [separator] includes the `\u200b` character:
+The next step is only required if you specified the [`separator`][separator] 
+configuration in `mkdocs.yml`. Text is segmented with [zero-width whitespace] 
+characters, so it renders exactly the same in the search modal. Adjust
+`mkdocs.yml` so that the [`separator`][separator] includes the `\u200b`
+character:

 ``` yaml
 plugins:
--- a/docs/blog/index.md
+++ b/docs/blog/index.md
@ -33,11 +33,12 @@ number of Chinese users.__
 ---

 After the United States and Germany, the third-largest country of origin of
-Material for MkDocs users is China. For a long time, the built-in search plugin
+Material for MkDocs users is China. For a long time, the [built-in search plugin]
 didn't allow for proper segmentation of Chinese characters, mainly due to 
-missing support in [lunr-languages] which is used for search tokenization and 
-stemming. The latest Insiders release adds long-awaited Chinese language support 
-for the built-in search plugin, something that has been requested by many users.
+missing support in [`lunr-languages`][lunr-languages] which is used for search 
+tokenization and stemming. The latest Insiders release adds long-awaited Chinese 
+language support for the built-in search plugin, something that has been
+requested by many users.

  [:octicons-arrow-right-24: Continue reading][Chinese search support – 中文搜索支持]

--- a/docs/insiders/changelog.md
+++ b/docs/insiders/changelog.md
@ -6,6 +6,10 @@ template: overrides/main.html

 ## Material for MkDocs Insiders

+### 4.17.2 <small>_ June 5, 2022</small> { id="4.17.2" }
+
+- Added support for custom jieba dictionaries (Chinese search)
+
 ### 4.17.1 <small>_ June 5, 2022</small> { id="4.17.1" }

 - Added support for cookie consent reject button
--- a/docs/setup/ensuring-data-privacy.md
+++ b/docs/setup/ensuring-data-privacy.md
@ -104,15 +104,15 @@ The following properties are available:

 :   [:octicons-tag-24: insiders-4.17.1][Insiders] · :octicons-milestone-24: 
    Default: `[accept, manage]` – This property defines which buttons are shown
-    and in which order, e.g. to allow the user to manage settings and accept
-    the cookie:
+    and in which order, e.g. to allow the user to accept cookies and manage
+    settings:

    ``` yaml
    extra:
      consent:
        actions:
-          - manage
          - accept
+          - manage
    ```

    The cookie consent form includes three types of buttons:
--- a/docs/setup/setting-up-site-search.md
+++ b/docs/setup/setting-up-site-search.md
@ -92,12 +92,6 @@ The following configuration options are supported:
    part of this list by automatically falling back to the stemmer yielding the
    best result.

-    !!! tip "Chinese search support – 中文搜索支持"
-
-        Material for MkDocs recently added __experimental language support for 
-        Chinese__ as part of [Insiders]. [Read the blog article][chinese search]
-        to learn how to set up search for Chinese in a matter of minutes.
-
 `separator`{ #search-separator }

 :   :octicons-milestone-24: Default: _automatically set_ – The separator for
@ -112,10 +106,9 @@ The following configuration options are supported:
    ```

    1.  Tokenization itself is carried out by [lunr's default tokenizer], which 
-        doesn't allow for lookahead or separators spanning multiple characters.
-
-        For more finegrained control over the tokenization process, see the
-        section on [tokenizer lookahead].
+        doesn't allow for lookahead or multi-character separators. For more
+        finegrained control over the tokenization process, see the section on
+        [tokenizer lookahead].

 <div class="mdx-deprecated" markdown>

@ -142,14 +135,9 @@ The following configuration options are supported:

 </div>

-The other configuration options of this plugin are not officially supported
-by Material for MkDocs, which is why they may yield unexpected results. Use
-them at your own risk.
-
  [search support]: https://github.com/squidfunk/mkdocs-material/releases/tag/0.1.0
  [lunr]: https://lunrjs.com
  [lunr-languages]: https://github.com/MihaiValentin/lunr-languages
-  [chinese search]: ../blog/2022/chinese-search-support.md
  [lunr's default tokenizer]: https://github.com/olivernn/lunr.js/blob/aa5a878f62a6bba1e8e5b95714899e17e8150b38/lunr.js#L413-L456
  [site language]: changing-the-language.md#site-language
  [tokenizer lookahead]: #tokenizer-lookahead
@ -157,13 +145,72 @@ them at your own risk.
  [prebuilt index]: https://www.mkdocs.org/user-guide/configuration/#prebuild_index
  [50% smaller]: ../blog/2021/search-better-faster-smaller.md#benchmarks

+#### Chinese language support
+
+[:octicons-heart-fill-24:{ .mdx-heart } Sponsors only][Insiders]{ .mdx-insiders } ·
+[:octicons-tag-24: insiders-4.14.0][Insiders] ·
+:octicons-beaker-24: Experimental
+
+[Insiders] adds search support for the Chinese language (see our [blog article]
+[chinese search] from May 2022) by integrating with the text segmentation
+library [jieba], which can be installed with `pip`.
+
+``` sh
+pip install jieba
+```
+
+If [jieba] is installed, the [built-in search plugin] automatically detects
+Chinese characters and runs them through the segmenter. The following
+configuration options are available:
+
+`jieba_dict`{ #jieba-dict }
+
+:   [:octicons-tag-24: insiders-4.17.2][Insiders] · :octicons-milestone-24:
+    Default: _none_ – This option allows for specifying a [custom dictionary]
+    to be used by [jieba] for segmenting text, replacing the default dictionary:
+
+    ``` yaml
+    plugins:
+      - search:
+          jieba_dict: dict.txt # (1)!
+    ```
+
+    1.  The following alternative dictionaries are provided by [jieba]:
+
+        - [dict.txt.small] – 占用内存较小的词典文件
+        - [dict.txt.big] – 支持繁体分词更好的词典文件
+
+`jieba_dict_user`{ #jieba-dict-user }
+
+:   [:octicons-tag-24: insiders-4.17.2][Insiders] · :octicons-milestone-24:
+    Default: _none_ – This option allows for specifying an additional
+    [user dictionary] to be used by [jieba] for segmenting text, augmenting the
+    default dictionary:
+
+    ``` yaml
+    plugins:
+      - search:
+          jieba_dict_user: user_dict.txt
+    ```
+
+    User dictionaries can be used for tuning the segmenter to preserve
+    technical terms.
+
+  [chinese search]: ../blog/2022/chinese-search-support.md
+  [jieba]: https://pypi.org/project/jieba/
+  [built-in search plugin]: #built-in-search-plugin
+  [custom dictionary]: https://github.com/fxsjy/jieba#%E5%85%B6%E4%BB%96%E8%AF%8D%E5%85%B8
+  [dict.txt.small]: https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
+  [dict.txt.big]: https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
+  [user dictionary]: https://github.com/fxsjy/jieba#%E8%BD%BD%E5%85%A5%E8%AF%8D%E5%85%B8
+
 ### Rich search previews

 [:octicons-heart-fill-24:{ .mdx-heart } Sponsors only][Insiders]{ .mdx-insiders } ·
 [:octicons-tag-24: insiders-3.0.0][Insiders] ·
 :octicons-beaker-24: Experimental

-Insiders ships rich search previews as part of the [new search plugin], which
+[Insiders] ships rich search previews as part of the [new search plugin], which
 will render code blocks directly in the search result, and highlight all
 occurrences inside those blocks:

@ -186,7 +233,7 @@ occurrences inside those blocks:
 [:octicons-tag-24: insiders-3.0.0][Insiders] ·
 :octicons-beaker-24: Experimental

-Insiders allows for more complex configurations of the [`separator`][separator] 
+[Insiders] allows for more complex configurations of the [`separator`][separator] 
 setting as part of the [new search plugin], yielding more influence on the way 
 documents are tokenized: