diff --git a/CHANGELOG b/CHANGELOG index 589278135..d17a2042d 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -1,3 +1,7 @@ +mkdocs-material-8.3.2+insiders-4.17.2 (2022-06-05) + + * Added support for custom jieba dictionaries (Chinese search) + mkdocs-material-8.3.2+insiders-4.17.1 (2022-06-05) * Added support for cookie consent reject button diff --git a/docs/blog/2021/search-better-faster-smaller.md b/docs/blog/2021/search-better-faster-smaller.md index b5e4b430d..be2315e21 100644 --- a/docs/blog/2021/search-better-faster-smaller.md +++ b/docs/blog/2021/search-better-faster-smaller.md @@ -197,8 +197,8 @@ the following steps are taken: remain. Linking is necessary, as search results are grouped by page. 2. __Tokenization__: The `title` and `text` values of each section are split - into tokens by using the [separator] as configured in `mkdocs.yml`. - Tokenization itself is carried out by + into tokens by using the [`separator`][separator] as configured in + `mkdocs.yml`. Tokenization itself is carried out by [lunr's default tokenizer][default tokenizer], which doesn't allow for lookahead or separators spanning multiple characters. @@ -216,7 +216,7 @@ more magic involved, e.g., search results are [post-processed] and [rescored] to account for some shortcomings of [lunr], but in general, this is how data gets into and out of the index. - [separator]: ../../setup/setting-up-site-search.md#separator + [separator]: ../../setup/setting-up-site-search.md#search-separator [default tokenizer]: https://github.com/olivernn/lunr.js/blob/aa5a878f62a6bba1e8e5b95714899e17e8150b38/lunr.js#L413-L456 [post-processed]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/_/index.ts#L249-L272 [rescored]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/_/index.ts#L274-L275 @@ -421,9 +421,9 @@ On to the next step in the process: __tokenization__. ### Tokenizer lookahead The [default tokenizer] of [lunr] uses a regular expression to split a given -string by matching each character against the [separator] as defined in -`mkdocs.yml`. This doesn't allow for more complex separators based on -lookahead or multiple characters. +string by matching each character against the [`separator`][separator] as +defined in `mkdocs.yml`. This doesn't allow for more complex separators based +on lookahead or multiple characters. Fortunately, __our new search implementation provides an advanced tokenizer__ that doesn't have these shortcomings and supports more complex regular @@ -439,14 +439,14 @@ characters at which the string should be split, the following three sections explain the remainder of the regular expression.[^4] [^4]: - As a fun fact: the [separator default value] of the search plugin being - `[\s\-]+` always has been kind of irritating, as it suggests that multiple - characters can be considered being a separator. However, the `+` is - completely irrelevant, as regular expression groups involving multiple - characters were never supported by + As a fun fact: the [`separator`][separator] [default value] of the search + plugin being `[\s\-]+` always has been kind of irritating, as it suggests + that multiple characters can be considered being a separator. However, the + `+` is completely irrelevant, as regular expression groups involving + multiple characters were never supported by [lunr's default tokenizer][default tokenizer]. - [separator default value]: https://www.mkdocs.org/user-guide/configuration/#separator + [default value]: https://www.mkdocs.org/user-guide/configuration/#separator #### Case changes diff --git a/docs/blog/2022/chinese-search-support.md b/docs/blog/2022/chinese-search-support.md index 657d98abb..243d09209 100644 --- a/docs/blog/2022/chinese-search-support.md +++ b/docs/blog/2022/chinese-search-support.md @@ -32,10 +32,10 @@ number of Chinese users.__ --- After the United States and Germany, the third-largest country of origin of -Material for MkDocs users is China. For a long time, the built-in search plugin +Material for MkDocs users is China. For a long time, the [built-in search plugin] didn't allow for proper segmentation of Chinese characters, mainly due to -missing support in [lunr-languages] which is used for search tokenization and -stemming. The latest Insiders release adds long-awaited Chinese language support +missing support in [lunr-languages] which is used for search tokenization and +stemming. The latest Insiders release adds long-awaited Chinese language support for the built-in search plugin, something that has been requested by many users. _Material for MkDocs終於​支持​中文​了!文本​被​正確​分割​並且​更​容易​找到。_ @@ -50,18 +50,19 @@ search plugin in a few minutes._ ## Configuration Chinese language support for Material for MkDocs is provided by [jieba], an -excellent Chinese text segmentation library. If [jieba] is installed, the -built-in search plugin automatically detects Chinese characters and runs them +excellent Chinese text segmentation library. If [jieba] is installed, the +built-in search plugin automatically detects Chinese characters and runs them through the segmenter. You can install [jieba] with: ``` pip install jieba ``` -The next step is only required if you specified the [separator] configuration -in `mkdocs.yml`. Text is segmented with [zero-width whitespace] characters, so -it renders exactly the same in the search modal. Adjust `mkdocs.yml` so that -the [separator] includes the `\u200b` character: +The next step is only required if you specified the [`separator`][separator] +configuration in `mkdocs.yml`. Text is segmented with [zero-width whitespace] +characters, so it renders exactly the same in the search modal. Adjust +`mkdocs.yml` so that the [`separator`][separator] includes the `\u200b` +character: ``` yaml plugins: diff --git a/docs/blog/index.md b/docs/blog/index.md index af2d39a5c..7330c027f 100644 --- a/docs/blog/index.md +++ b/docs/blog/index.md @@ -33,11 +33,12 @@ number of Chinese users.__ --- After the United States and Germany, the third-largest country of origin of -Material for MkDocs users is China. For a long time, the built-in search plugin +Material for MkDocs users is China. For a long time, the [built-in search plugin] didn't allow for proper segmentation of Chinese characters, mainly due to -missing support in [lunr-languages] which is used for search tokenization and -stemming. The latest Insiders release adds long-awaited Chinese language support -for the built-in search plugin, something that has been requested by many users. +missing support in [`lunr-languages`][lunr-languages] which is used for search +tokenization and stemming. The latest Insiders release adds long-awaited Chinese +language support for the built-in search plugin, something that has been +requested by many users. [:octicons-arrow-right-24: Continue reading][Chinese search support – 中文搜索​支持] diff --git a/docs/insiders/changelog.md b/docs/insiders/changelog.md index 0e514d50a..a03132747 100644 --- a/docs/insiders/changelog.md +++ b/docs/insiders/changelog.md @@ -6,6 +6,10 @@ template: overrides/main.html ## Material for MkDocs Insiders +### 4.17.2 _ June 5, 2022 { id="4.17.2" } + +- Added support for custom jieba dictionaries (Chinese search) + ### 4.17.1 _ June 5, 2022 { id="4.17.1" } - Added support for cookie consent reject button diff --git a/docs/setup/ensuring-data-privacy.md b/docs/setup/ensuring-data-privacy.md index 049ac3e83..a044b337c 100644 --- a/docs/setup/ensuring-data-privacy.md +++ b/docs/setup/ensuring-data-privacy.md @@ -104,15 +104,15 @@ The following properties are available: : [:octicons-tag-24: insiders-4.17.1][Insiders] · :octicons-milestone-24: Default: `[accept, manage]` – This property defines which buttons are shown - and in which order, e.g. to allow the user to manage settings and accept - the cookie: + and in which order, e.g. to allow the user to accept cookies and manage + settings: ``` yaml extra: consent: actions: - - manage - accept + - manage ``` The cookie consent form includes three types of buttons: diff --git a/docs/setup/setting-up-site-search.md b/docs/setup/setting-up-site-search.md index aeadf8479..28a6b2ab1 100644 --- a/docs/setup/setting-up-site-search.md +++ b/docs/setup/setting-up-site-search.md @@ -92,12 +92,6 @@ The following configuration options are supported: part of this list by automatically falling back to the stemmer yielding the best result. - !!! tip "Chinese search support – 中文搜索​支持" - - Material for MkDocs recently added __experimental language support for - Chinese__ as part of [Insiders]. [Read the blog article][chinese search] - to learn how to set up search for Chinese in a matter of minutes. - `separator`{ #search-separator } : :octicons-milestone-24: Default: _automatically set_ – The separator for @@ -112,10 +106,9 @@ The following configuration options are supported: ``` 1. Tokenization itself is carried out by [lunr's default tokenizer], which - doesn't allow for lookahead or separators spanning multiple characters. - - For more finegrained control over the tokenization process, see the - section on [tokenizer lookahead]. + doesn't allow for lookahead or multi-character separators. For more + finegrained control over the tokenization process, see the section on + [tokenizer lookahead].
@@ -142,14 +135,9 @@ The following configuration options are supported:
-The other configuration options of this plugin are not officially supported -by Material for MkDocs, which is why they may yield unexpected results. Use -them at your own risk. - [search support]: https://github.com/squidfunk/mkdocs-material/releases/tag/0.1.0 [lunr]: https://lunrjs.com [lunr-languages]: https://github.com/MihaiValentin/lunr-languages - [chinese search]: ../blog/2022/chinese-search-support.md [lunr's default tokenizer]: https://github.com/olivernn/lunr.js/blob/aa5a878f62a6bba1e8e5b95714899e17e8150b38/lunr.js#L413-L456 [site language]: changing-the-language.md#site-language [tokenizer lookahead]: #tokenizer-lookahead @@ -157,13 +145,72 @@ them at your own risk. [prebuilt index]: https://www.mkdocs.org/user-guide/configuration/#prebuild_index [50% smaller]: ../blog/2021/search-better-faster-smaller.md#benchmarks +#### Chinese language support + +[:octicons-heart-fill-24:{ .mdx-heart } Sponsors only][Insiders]{ .mdx-insiders } · +[:octicons-tag-24: insiders-4.14.0][Insiders] · +:octicons-beaker-24: Experimental + +[Insiders] adds search support for the Chinese language (see our [blog article] +[chinese search] from May 2022) by integrating with the text segmentation +library [jieba], which can be installed with `pip`. + +``` sh +pip install jieba +``` + +If [jieba] is installed, the [built-in search plugin] automatically detects +Chinese characters and runs them through the segmenter. The following +configuration options are available: + +`jieba_dict`{ #jieba-dict } + +: [:octicons-tag-24: insiders-4.17.2][Insiders] · :octicons-milestone-24: + Default: _none_ – This option allows for specifying a [custom dictionary] + to be used by [jieba] for segmenting text, replacing the default dictionary: + + ``` yaml + plugins: + - search: + jieba_dict: dict.txt # (1)! + ``` + + 1. The following alternative dictionaries are provided by [jieba]: + + - [dict.txt.small] – 占用内存较小的词典文件 + - [dict.txt.big] – 支持繁体分词更好的词典文件 + +`jieba_dict_user`{ #jieba-dict-user } + +: [:octicons-tag-24: insiders-4.17.2][Insiders] · :octicons-milestone-24: + Default: _none_ – This option allows for specifying an additional + [user dictionary] to be used by [jieba] for segmenting text, augmenting the + default dictionary: + + ``` yaml + plugins: + - search: + jieba_dict_user: user_dict.txt + ``` + + User dictionaries can be used for tuning the segmenter to preserve + technical terms. + + [chinese search]: ../blog/2022/chinese-search-support.md + [jieba]: https://pypi.org/project/jieba/ + [built-in search plugin]: #built-in-search-plugin + [custom dictionary]: https://github.com/fxsjy/jieba#%E5%85%B6%E4%BB%96%E8%AF%8D%E5%85%B8 + [dict.txt.small]: https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small + [dict.txt.big]: https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big + [user dictionary]: https://github.com/fxsjy/jieba#%E8%BD%BD%E5%85%A5%E8%AF%8D%E5%85%B8 + ### Rich search previews [:octicons-heart-fill-24:{ .mdx-heart } Sponsors only][Insiders]{ .mdx-insiders } · [:octicons-tag-24: insiders-3.0.0][Insiders] · :octicons-beaker-24: Experimental -Insiders ships rich search previews as part of the [new search plugin], which +[Insiders] ships rich search previews as part of the [new search plugin], which will render code blocks directly in the search result, and highlight all occurrences inside those blocks: @@ -186,7 +233,7 @@ occurrences inside those blocks: [:octicons-tag-24: insiders-3.0.0][Insiders] · :octicons-beaker-24: Experimental -Insiders allows for more complex configurations of the [`separator`][separator] +[Insiders] allows for more complex configurations of the [`separator`][separator] setting as part of the [new search plugin], yielding more influence on the way documents are tokenized: