mkdocs-material/docs/blog/posts/search-better-faster-smaller.md

---
date: 2021-09-13
authors: [squidfunk]
readtime: 15
description: >
  How we rebuilt client-side search, delivering a better user experience while
  making it faster and smaller at the same time
categories:
  - Search
  - Performance
links:
  - plugins/search.md
  - insiders/how-to-sponsor.md
---

# Search: better, faster, smaller

__This is the story of how we managed to completely rebuild client-side search,
delivering a significantly better user experience while making it faster and
smaller at the same time.__

The [search] of Material for MkDocs is by far one of its best and most-loved
assets: [multilingual], [offline-capable], and most importantly: _all
client-side_. It provides a solution to empower the users of your documentation
to find what they're searching for instantly without the headache of managing
additional servers. However, even though several iterations have been made,
there's still some room for improvement, which is why we rebuilt the search
plugin and integration from the ground up. This article shines some light on the
internals of the new search, why it's much more powerful than the previous
version, and what's about to come.

<!-- more -->

_The next section discusses the architecture and issues of the current search
implementation. If you immediately want to learn what's new, skip to the
[section just after that][what's new]._

  [search]: ../../setup/setting-up-site-search.md
  [multilingual]: ../../plugins/search.md#config.lang
  [offline-capable]: ../../setup/building-for-offline-usage.md
  [what's new]: #whats-new

## Architecture

Material for MkDocs uses [lunr] together with [lunr-languages] to implement
its client-side search capabilities. When a documentation page is loaded and
JavaScript is available, the search index as generated by the
[built-in search plugin] during the build process is requested from the
server:

``` ts
const index$ = document.forms.namedItem("search")
  ? __search?.index || requestJSON<SearchIndex>(
    new URL("search/search_index.json", config.base)
  )
  : NEVER
```

  [lunr]: https://lunrjs.com
  [lunr-languages]: https://github.com/MihaiValentin/lunr-languages
  [built-in search plugin]: ../../plugins/search.md

### Search index

The search index includes a stripped-down version of all pages. Let's take a
look at an example to understand precisely what the search index contains from
the original Markdown file:

??? example "Expand to inspect example"

    === ":octicons-file-code-16: `docs/page.md`"

        ```` markdown
        # Example

        ## Text

        It's very easy to make some words **bold** and other words *italic*
        with Markdown. You can even add [links](#), or even `code`:

        ```
        if (isAwesome) {
          return true
        }
        ```

        ## Lists

        Sometimes you want numbered lists:

        1. One
        2. Two
        3. Three

        Sometimes you want bullet points:

        * Start a line with a star
        * Profit!
        ````

    === ":octicons-codescan-16: `search_index.json`"

        ``` json
        {
          "config": {
            "indexing": "full",
            "lang": [
              "en"
            ],
            "min_search_length": 3,
            "prebuild_index": false,
            "separator": "[\\s\\-]+"
          },
          "docs": [
            {
              "location": "page/",
              "title": "Example",
              "text": "Example Text It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true } Lists Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
            },
            {
              "location": "page/#example",
              "title": "Example",
              "text": ""
            },
            {
              "location": "page/#text",
              "title": "Text",
              "text": "It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true }"
            },
            {
              "location": "page/#lists",
              "title": "Lists",
              "text": "Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
            }
          ]
        }
        ```

If we inspect the search index, we immediately see several problems:

  1.  __All content is included twice__: the search index contains one entry
      with the entire contents of the page, and one entry for each section of
      the page, i.e., each block preceded by a headline or subheadline. This
      significantly contributes to the size of the search index.

  2.  __All structure is lost__: when the search index is built, all structural
      information like HTML tags and attributes are stripped from the content.
      While this approach works well for paragraphs and inline formatting, it
      might be problematic for lists and code blocks. An excerpt:

    ```
    … links , or even code : if (isAwesome) { … } Lists Sometimes you want …
    ```

    - __Context__: for an untrained eye, the result can look like gibberish, as
      it's not immediately apparent what classifies as text and what as code.
      Furthermore, it's not clear that `Lists` is a headline as it's merged
      with the code block before and the paragraph after it.

    - __Punctuation__: inline elements like links that are immediately followed
      by punctuation are separated by whitespace (see `,` and `:` in the
      excerpt). This is because all extracted text is joined with a whitespace
      character during the construction of the search index.

It's not difficult to see that it can be quite challenging to implement a good
search experience for theme authors, which is why Material for MkDocs (up to
now) did some [monkey patching] to be able to render slightly more
meaningful search previews.

  [monkey patching]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/document/index.ts#L68-L71

### Search worker

The actual search functionality is implemented as part of a web worker[^1],
which creates and manages the [lunr] search index. When search is initialized,
the following steps are taken:

  [^1]:
    Prior to <!-- md:version 5.0.0 -->, search was carried out in the main
    thread  which locked up the browser, rendering it unusable. This problem was
    first reported in #904 and, after some back and forth, fixed and released in
    <!-- md:version 5.0.0 -->.

1.  __Linking sections with pages__: The search index is parsed, and each
    section is linked to its parent page. The parent page itself is _not
    indexed_, as it would lead to duplicate results, so only the sections
    remain. Linking is necessary, as search results are grouped by page.

2.  __Tokenization__: The `title` and `text` values of each section are split
    into tokens by using the [`separator`][separator] as configured in
    `mkdocs.yml`. Tokenization itself is carried out by
    [lunr's default tokenizer][default tokenizer], which doesn't allow for
    lookahead or separators spanning multiple characters.

    > Why is this important and a big deal? We will see later how much more we
    > can achieve with a tokenizer that is capable of separating strings with
    > lookahead.

3.  __Indexing__: As a final step, each section is indexed. When querying the
    index, if a search query includes one of the tokens as returned by step 2.,
    the section is considered to be part of the search result and passed to the
    main thread.

Now, that's basically how the search worker operates. Sure, there's a little
more magic involved, e.g., search results are [post-processed] and [rescored] to
account for some shortcomings of [lunr], but in general, this is how data gets
into and out of the index.

  [separator]: ../../plugins/search.md#config.separator
  [default tokenizer]: https://github.com/olivernn/lunr.js/blob/aa5a878f62a6bba1e8e5b95714899e17e8150b38/lunr.js#L413-L456
  [post-processed]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/_/index.ts#L249-L272
  [rescored]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/_/index.ts#L274-L275

### Search previews

Users should be able to quickly scan and evaluate the relevance of a search
result in the given context, which is why a concise summary with highlighted
occurrences of the search terms found is an essential part of a great search
experience.

This is where the current search preview generation falls short, as some of the
search previews appear not to include any occurrence of any of the search
terms. This was due to the fact that search previews were [truncated after a
maximum of 320 characters][truncated], as can be seen here:

<figure markdown>

![search preview]

  <figcaption markdown>

The first two results look like they're not relevant, as they don't seem to
include the query string the user just searched for. Yet, they are.

  </figcaption>
</figure>

A better solution to this problem has been on the roadmap for a very, very long
time, but in order to solve this once and for all, several factors need to be
carefully considered:

1. __Word boundaries__: some themes[^2] for static site generators generate
   search previews by expanding the text left and right next to an occurrence,
   stopping at a whitespace character when enough words have been consumed. A
   preview might look like this:

    ```
    … channels, e.g., or which can be configured via mkdocs.yml …
    ```

    While this may work for languages that use whitespace as a separator
    between words, it breaks down for languages like Japanese or Chinese[^3],
    as they have non-whitespace word boundaries and use dedicated segmenters to
    split strings into tokens.

  [^2]:
    At the time of writing, [Just the Docs] and [Docusaurus] use this method
    for generating search previews. Note that the latter also integrates with
    Algolia, which is a fully managed server-based solution.

  [^3]:
    China and Japan are both within the top 5 countries of origin of users of
    Material for MkDocs.

  [truncated]: https://github.com/squidfunk/mkdocs-material/blob/master/src/templates/assets/javascripts/templates/search/index.tsx#L90
  [search preview]: search-better-faster-smaller/search-preview.png
  [Just the Docs]: https://pmarsceill.github.io/just-the-docs/
  [Docusaurus]: https://github.com/lelouch77/docusaurus-lunr-search

1.   __Context-awareness__: Although whitespace doesn't work for all languages,
    one could argue that it could be a good enough solution. Unfortunately, this
    is not necessarily true for code blocks, as the removal of whitespace might
    change meaning in some languages.

3.  __Structure__: Preserving structural information is not a must, but
    apparently beneficial to build more meaningful search previews which allow
    for a quick evaluation of relevance. If a word occurrence is part of a code
    block, it should be rendered as a code block.

## What's new?

After we built a solid understanding of the problem space and before we dive
into the internals of our new search implementation to see which of the
problems it already solves, a quick overview of what features and improvements
it brings:

- __Better__: support for [rich search previews], preserving the structural
  information of code blocks, inline code, and lists, so they are rendered
  as-is, as well as [lookahead tokenization], [more accurate highlighting], and
  improved stability of typeahead. Also, a [slightly better UX].
- __Faster__ and __smaller__: significant decrease in search index size of up
  to 48% due to improved extraction and construction techniques, resulting in a
  search experience that is up to 95% faster, which is particularly helpful for
  large documentation projects.

  [rich search previews]: #rich-search-previews
  [lookahead tokenization]: #tokenizer-lookahead
  [more accurate highlighting]: #accurate-highlighting
  [slightly better UX]: #user-interface

### Rich search previews

As we rebuilt the search plugin from scratch, we reworked the construction of
the search index to preserve the structural information of code blocks, inline
code, as well as unordered and ordered lists. Using the example from the
[search index] section, here's how it looks:

=== "Now"

    ![search preview now]

=== "Before"

    ![search preview before]

Now, __code blocks are first-class citizens of search previews__, and even
inline code formatting is preserved. Let's take a look at the new structure of
the search index to understand why:

??? example "Expand to inspect search index"

    === "Now"

        ``` json
        {
          ...
          "docs": [
            {
              "location": "page/",
              "title": "Example",
              "text": ""
            },
            {
              "location": "page/#text",
              "title": "Text",
              "text": "<p>It's very easy to make some words bold and other words italic with Markdown. You can even add links, or even <code>code</code>:</p> <pre><code>if (isAwesome){\n  return true\n}\n</code></pre>"
            },
            {
              "location": "page/#lists",
              "title": "Lists",
              "text": "<p>Sometimes you want numbered lists:</p> <ol> <li>One</li> <li>Two</li> <li>Three</li> </ol> <p>Sometimes you want bullet points:</p> <ul> <li>Start a line with a star</li> <li>Profit!</li> </ul>"
            }
          ]
        }
        ```

    === "Before"

        ``` json
        {
          ...
          "docs": [
            {
              "location": "page/",
              "title": "Example",
              "text": "Example Text It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true } Lists Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
            },
            {
              "location": "page/#example",
              "title": "Example",
              "text": ""
            },
            {
              "location": "page/#text",
              "title": "Text",
              "text": "It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true }"
            },
            {
              "location": "page/#lists",
              "title": "Lists",
              "text": "Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
            }
          ]
        }
        ```

If we inspect the search index again, we can see how the situation improved:

1.  __Content is included only once__: the search index does not include the
    content of the page twice, as only the sections of a page are part of the
    search index. This leads to a significant reduction in size, fewer bytes to
    transfer, and a smaller search index.

2.  __Some structure is preserved__: each section of the search index includes
    a small subset of HTML to provide the necessary structure to allow for more
    sophisticated search previews. Revisiting our example from before, let's
    look at an excerpt:

    === "Now"

        ``` html
        … links, or even <code>code</code>:</p> <pre><code>if (isAwesome){ … }\n</code></pre>
        ```

    === "Before"

        ```
        … links , or even code : if (isAwesome) { … }
        ```

    The punctuation issue is gone, as no additional whitespace is inserted, and
    the preserved markup yields additional context to make scanning search
    results more effective.

On to the next step in the process: __tokenization__.

  [search index]: #search-index
  [search preview now]: search-better-faster-smaller/search-preview-now.png
  [search preview before]: search-better-faster-smaller/search-preview-before.png

### Tokenizer lookahead

The [default tokenizer] of [lunr] uses a regular expression to split a given
string by matching each character against the [`separator`][separator] as
defined in `mkdocs.yml`. This doesn't allow for more complex separators based
on lookahead or multiple characters.

Fortunately, __our new search implementation provides an advanced tokenizer__
that doesn't have these shortcomings and supports more complex regular
expressions. As a result, Material for MkDocs just changed its own separator
configuration to the following value:

```
[\s\-,:!=\[\]()"/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;
```

While the first part up to the first `|` contains a list of single control
characters at which the string should be split, the following three sections
explain the remainder of the regular expression.[^4]

  [^4]:
    As a fun fact: the [`separator`][separator] [default value] of the search
    plugin being `[\s\-]+` always has been kind of irritating, as it suggests
    that multiple characters can be considered being a separator. However, the
    `+` is completely irrelevant, as regular expression groups involving
    multiple characters were never supported by
    [lunr's default tokenizer][default tokenizer].

  [default value]: https://www.mkdocs.org/user-guide/configuration/#separator

#### Case changes

Many programming languages use `PascalCase` or `camelCase` naming conventions.
When a user searches for the term `case`, it's quite natural to expect for
`PascalCase` and `camelCase` to show up. By adding the following match group to
the separator, this can now be achieved with ease:

```
(?!\b)(?=[A-Z][a-z])
```

This regular expression is a combination of a negative lookahead (`\b`, i.e.,
not a word boundary) and a positive lookahead (`[A-Z][a-z]`, i.e., an uppercase
character followed by a lowercase character), and has the following behavior:

- `PascalCase` :octicons-arrow-right-24: `Pascal`, `Case`
- `camelCase` :octicons-arrow-right-24: `camel`, `Case`
- `UPPERCASE` :octicons-arrow-right-24: `UPPERCASE`

Searching for [:octicons-search-24: searchHighlight][q=searchHighlight]
now brings up the section discussing the `search.highlight` feature flag, which
also demonstrates that this now even works properly for search queries.[^5]

  [^5]:
    Previously, the search query was not correctly tokenized due to the way
    [lunr] treats wildcards, as it disables the pipeline for search terms that
    contain wildcards. In order to provide a good typeahead experience,
    Material for MkDocs adds wildcards to the end of each search term not
    explicitly preceded with `+` or `-`, effectively disabling tokenization.

  [q=searchHighlight]: ?q=searchHighlight

#### Version numbers

Indexing version numbers is another problem that can be solved with a small
lookahead. Usually, `.` should be considered a separator to split words like
`search.highlight`. However, splitting version numbers at `.` will make them
undiscoverable. Thus, the following expression:

```
\.(?!\d)
```

This regular expression matches a `.` only if not immediately followed by a
digit `\d`, which leaves version numbers discoverable. Searching for
[:octicons-search-24: 7.2.6][q=7.2.6] brings up the [7.2.6] release notes.

  [q=7.2.6]: ?q=7.2.6
  [7.2.6]: ../../changelog/index.md#7.2.6

#### HTML/XML tags

If your documentation includes HTML/XML code examples, you may want to allow
users to find specific tag names. Unfortunately, the `<` and `>` control
characters are encoded in code blocks as `&lt;` and `&gt;`. Now, adding the
following expression to the separator allows for just that:

```
&[lg]t;
```

---

_We've only just begun to scratch the surface of the new possibilities
tokenizer lookahead brings. If you found other useful expressions, you're
invited to share them in the comment section._

### Accurate highlighting

Highlighting is the last step in the process of search and involves the
highlighting of all search term occurrences in a given search result. For a
long time, highlighting was implemented through dynamically generated
[regular expressions].[^6]

This approach has some problems with non-whitespace languages like Japanese or
Chinese[^3] since it only works if the highlighted term is at a word boundary.
However, Asian languages are tokenized using a [dedicated segmenter], which
cannot be modeled with regular expressions.

  [^6]:
    Using the separator as defined in `mkdocs.yml`, a regular expression was
    constructed that was trying to mimic the tokenizer. As an example, the
    search query `search highlight` was transformed into the rather cumbersome
    regular expression `(^|<separator>)(search|highlight)`, which only matches
    at word boundaries.

Now, as a direct result of the [new tokenization approach], __our new search
implementation uses token positions for highlighting__, making it exactly as
powerful as tokenization:

1.  __Word boundaries__: as the new highlighter uses token positions, word
    boundaries are equal to token boundaries. This means that more complex cases
    of tokenization (e.g., [case changes], [version numbers], [HTML/XML tags]),
    are now all highlighted accurately.

2.  __Context-awareness__: as the new search index preserves some of the
    structural information of the original document, the content of a section
    is now divided into separate content blocks – paragraphs, code blocks, and
    lists.

    Now, only the content blocks that actually contain occurrences of one of
    the search terms are considered for inclusion into the search preview. If a
    term only occurs in a code block, it's the code block that gets rendered,
    see, for example, the results of
    [:octicons-search-24: twitter][q=twitter].

  [regular expressions]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/highlighter/index.ts#L61-L91
  [dedicated segmenter]: http://chasen.org/~taku/software/TinySegmenter/
  [new tokenization approach]: #tokenizer-lookahead
  [case changes]: #case-changes
  [version numbers]: #version-numbers
  [HTML/XML tags]: #htmlxml-tags
  [q=twitter]: ?q=twitter

### Benchmarks

We conducted two benchmarks – one with the documentation of Material for MkDocs
itself, and one with a very massive corpus of Markdown files with more than
800,000 words – a size most documentation projects will likely never
reach:

<figure markdown>

|                         |   Before |            Now |     Relative |
| ----------------------- | -------: | -------------: | -----------: |
| __Material for MkDocs__ |          |                |              |
| Index size              |   573 kB |     __335 kB__ |     __–42%__ |
| Index size (`gzip`)     |   105 kB |      __78 kB__ |     __–27%__ |
| Indexing time[^7]       |   265 ms |     __177 ms__ |     __–34%__ |
| __KJV Markdown[^8]__    |          |                |              |
| Index size              |   8.2 MB |     __4.4 MB__ |     __–47%__ |
| Index size (`gzip`)     |   2.3 MB |     __1.2 MB__ |     __–48%__ |
| Indexing time           | 2,700 ms |   __1,390 ms__ |     __–48%__ |

<figcaption>
  <p>Benchmark results</p>
</figcaption>

</figure>

  [^7]:
    Smallest value of ten distinct runs.

  [^8]:
    We agnostically use [KJV Markdown] as a tool for testing to learn how
    Material for MkDocs behaves on large corpora, as it's a very large set of
    Markdown files with over 800k words.

The results show that indexing time, which is the time that it takes to set up
the search when the page is loaded, has dropped by up to 48%, which means __the
new search is up to 95% faster__. This is a significant improvement,
particularly relevant for large documentation projects.

While 1,3s still may sound like a long time, using the new client-side search
together with [instant loading] only creates the search index on the initial
page load. When navigating, the search index is preserved across pages, so the
cost does only have to be paid once.

  [KJV Markdown]: https://github.com/arleym/kjv-markdown
  [instant loading]: ../../setup/setting-up-navigation.md#instant-loading

### User interface

Additionally, some small improvements have been made, most prominently the
__more results on this page__ button, which now sticks to the top of the search
result list when open. This enables the user to jump out of the list more
quickly.

## What's next?

Our new search implementation is a big improvement to Material for MkDocs. It
solves some long-standing issues which needed to be tackled for years. Yet,
it's only the start of a search experience that is going to get better and
better. Next up:

- __Context-aware search summarization__: currently, the first two matching
  content blocks are rendered as a search preview. With the new tokenization
  technique, we laid the groundwork for more sophisticated shortening and
  summarization methods, which we're tackling next.

- __User interface improvements__: as we now gained full control over the
  search plugin, we can now add meaningful metadata to provide more context and
  a better experience. We'll explore some of those paths in the future.

If you've made it this far, thank you for your time and interest in Material
for MkDocs! This is the first blog article that I decided to write after a
short [Twitter survey] made me to. You're invited to leave a comment
to share your experiences with the new search implementation.

  [X survey]: https://x.com/squidfunk/status/1434477478823743488
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
+								---
-												Updated documentation

											
										
										
											2022-09-11 19:25:40 +02:00
+								date: 2021-09-13
 								authors: [squidfunk]
 								readtime: 15
-												Improved social card preview for latest blog article

											
										
										
											2021-09-13 18:38:16 +02:00
+								description: >
-												Updated blog article

											
										
										
											2021-09-13 19:06:33 +02:00
+								  How we rebuilt client-side search, delivering a better user experience while
-												Improved social card preview for latest blog article

											
										
										
											2021-09-13 18:38:16 +02:00
+								  making it faster and smaller at the same time
-												Updated documentation

											
										
										
											2022-09-11 19:25:40 +02:00
+								categories:
 								  - Search
 								  - Performance
 								links:
-												Documentation

											
										
										
											2023-09-15 09:25:50 +02:00
+								  - plugins/search.md
-												Documentation

											
										
										
											2024-10-10 10:29:50 +02:00
+								  - insiders/how-to-sponsor.md
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
+								---
 								# Search: better, faster, smaller
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								__This is the story of how we managed to completely rebuild client-side search,
-												Updated blog article

											
										
										
											2021-09-13 19:06:33 +02:00
+								delivering a significantly better user experience while making it faster and
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								smaller at the same time.__
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								The [search] of Material for MkDocs is by far one of its best and most-loved
 								assets: [multilingual], [offline-capable], and most importantly: _all
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								client-side_. It provides a solution to empower the users of your documentation
 								to find what they're searching for instantly without the headache of managing
 								additional servers. However, even though several iterations have been made,
 								there's still some room for improvement, which is why we rebuilt the search
 								plugin and integration from the ground up. This article shines some light on the
 								internals of the new search, why it's much more powerful than the previous
-												Updated blog article

											
										
										
											2021-09-13 19:06:33 +02:00
+								version, and what's about to come.
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated documentation

											
										
										
											2022-09-11 19:25:40 +02:00
+								<!-- more -->
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								_The next section discusses the architecture and issues of the current search
 								implementation. If you immediately want to learn what's new, skip to the
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								[section just after that][what's new]._
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								  [search]: ../../setup/setting-up-site-search.md
-												Fixed all anchors after turning on validation

											
										
										
											2024-04-25 05:51:05 +02:00
+								  [multilingual]: ../../plugins/search.md#config.lang
-												Documentation

											
										
										
											2022-02-27 17:07:10 +01:00
+								  [offline-capable]: ../../setup/building-for-offline-usage.md
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								  [what's new]: #whats-new
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
 								## Architecture
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								Material for MkDocs uses [lunr] together with [lunr-languages] to implement
 								its client-side search capabilities. When a documentation page is loaded and
 								JavaScript is available, the search index as generated by the
 								[built-in search plugin] during the build process is requested from the
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								server:
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
 								``` ts
 								const index$ = document.forms.namedItem("search")
 								  ? __search?.index || requestJSON<SearchIndex>(
 								    new URL("search/search_index.json", config.base)
 								  )
 								  : NEVER
 								```
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								  [lunr]: https://lunrjs.com
 								  [lunr-languages]: https://github.com/MihaiValentin/lunr-languages
-												Documentation

											
										
										
											2023-09-15 09:25:50 +02:00
+								  [built-in search plugin]: ../../plugins/search.md
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
 								### Search index
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								The search index includes a stripped-down version of all pages. Let's take a
-												Updated blog article

											
										
										
											2021-09-13 19:06:33 +02:00
+								look at an example to understand precisely what the search index contains from
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								the original Markdown file:
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Added section on rich search results

											
										
										
											2021-09-12 18:59:36 +02:00
+								??? example "Expand to inspect example"
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated documentation

											
										
										
											2022-09-11 19:25:40 +02:00
+								    === ":octicons-file-code-16: `docs/page.md`"
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
 								        ```` markdown
 								        # Example
 								        ## Text
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								        It's very easy to make some words **bold** and other words *italic*
 								        with Markdown. You can even add [links](#), or even `code`:
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
 								        ```
 								        if (isAwesome) {
 								          return true
 								        }
 								        ```
 								        ## Lists
 								        Sometimes you want numbered lists:
 . One
 . Two
 . Three
 								        Sometimes you want bullet points:
 								        * Start a line with a star
 								        * Profit!
 								        ````
-												Updated documentation

											
										
										
											2022-09-11 19:25:40 +02:00
+								    === ":octicons-codescan-16: `search_index.json`"
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
 								        ``` json
 								        {
 								          "config": {
 								            "indexing": "full",
 								            "lang": [
 								              "en"
 								            ],
 								            "min_search_length": 3,
 								            "prebuild_index": false,
 								            "separator": "[\\s\\-]+"
 								          },
 								          "docs": [
 								            {
 								              "location": "page/",
-												Added section on rich search results

											
										
										
											2021-09-12 18:59:36 +02:00
+								              "title": "Example",
 								              "text": "Example Text It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true } Lists Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
+								            },
 								            {
 								              "location": "page/#example",
-												Added section on rich search results

											
										
										
											2021-09-12 18:59:36 +02:00
+								              "title": "Example",
 								              "text": ""
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
+								            },
 								            {
 								              "location": "page/#text",
-												Added section on rich search results

											
										
										
											2021-09-12 18:59:36 +02:00
+								              "title": "Text",
 								              "text": "It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true }"
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
+								            },
 								            {
 								              "location": "page/#lists",
-												Added section on rich search results

											
										
										
											2021-09-12 18:59:36 +02:00
+								              "title": "Lists",
 								              "text": "Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
+								            }
 								          ]
 								        }
 								        ```
 								If we inspect the search index, we immediately see several problems:
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+.  __All content is included twice__: the search index contains one entry
 								      with the entire contents of the page, and one entry for each section of
 								      the page, i.e., each block preceded by a headline or subheadline. This
 								      significantly contributes to the size of the search index.
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+.  __All structure is lost__: when the search index is built, all structural
 								      information like HTML tags and attributes are stripped from the content.
 								      While this approach works well for paragraphs and inline formatting, it
 								      might be problematic for lists and code blocks. An excerpt:
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
 								    ```
-												Added section on rich search results

											
										
										
											2021-09-12 18:59:36 +02:00
+								    … links , or even code : if (isAwesome) { … } Lists Sometimes you want …
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
+								    ```
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								    - __Context__: for an untrained eye, the result can look like gibberish, as
 								      it's not immediately apparent what classifies as text and what as code.
 								      Furthermore, it's not clear that `Lists` is a headline as it's merged
 								      with the code block before and the paragraph after it.
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated blog article

											
										
										
											2021-09-13 19:06:33 +02:00
+								    - __Punctuation__: inline elements like links that are immediately followed
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								      by punctuation are separated by whitespace (see `,` and `:` in the
 								      excerpt). This is because all extracted text is joined with a whitespace
 								      character during the construction of the search index.
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								It's not difficult to see that it can be quite challenging to implement a good
 								search experience for theme authors, which is why Material for MkDocs (up to
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								now) did some [monkey patching] to be able to render slightly more
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								meaningful search previews.
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								  [monkey patching]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/document/index.ts#L68-L71
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
+								### Search worker
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								The actual search functionality is implemented as part of a web worker[^1],
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								which creates and manages the [lunr] search index. When search is initialized,
 								the following steps are taken:
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								  [^1]:
-												Restructured documentation (#5692)


											
										
										
											2023-09-14 19:09:18 +02:00
+								    Prior to <!-- md:version 5.0.0 -->, search was carried out in the main
 								    thread  which locked up the browser, rendering it unusable. This problem was
 								    first reported in #904 and, after some back and forth, fixed and released in
 								    <!-- md:version 5.0.0 -->.
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+.  __Linking sections with pages__: The search index is parsed, and each
 								    section is linked to its parent page. The parent page itself is _not
 								    indexed_, as it would lead to duplicate results, so only the sections
 								    remain. Linking is necessary, as search results are grouped by page.
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+.  __Tokenization__: The `title` and `text` values of each section are split
-												Documentation

											
										
										
											2022-06-05 18:16:51 +02:00
+								    into tokens by using the [`separator`][separator] as configured in
 								    `mkdocs.yml`. Tokenization itself is carried out by
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								    [lunr's default tokenizer][default tokenizer], which doesn't allow for
 								    lookahead or separators spanning multiple characters.
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								    > Why is this important and a big deal? We will see later how much more we
 								    > can achieve with a tokenizer that is capable of separating strings with
 								    > lookahead.
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Restructured documentation (#5692)


											
										
										
											2023-09-14 19:09:18 +02:00
+.  __Indexing__: As a final step, each section is indexed. When querying the
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								    index, if a search query includes one of the tokens as returned by step 2.,
 								    the section is considered to be part of the search result and passed to the
 								    main thread.
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								Now, that's basically how the search worker operates. Sure, there's a little
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								more magic involved, e.g., search results are [post-processed] and [rescored] to
 								account for some shortcomings of [lunr], but in general, this is how data gets
 								into and out of the index.
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Fixed all anchors after turning on validation

											
										
										
											2024-04-25 05:51:05 +02:00
+								  [separator]: ../../plugins/search.md#config.separator
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								  [default tokenizer]: https://github.com/olivernn/lunr.js/blob/aa5a878f62a6bba1e8e5b95714899e17e8150b38/lunr.js#L413-L456
 								  [post-processed]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/_/index.ts#L249-L272
 								  [rescored]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/_/index.ts#L274-L275
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
 								### Search previews
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								Users should be able to quickly scan and evaluate the relevance of a search
 								result in the given context, which is why a concise summary with highlighted
 								occurrences of the search terms found is an essential part of a great search
 								experience.
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								This is where the current search preview generation falls short, as some of the
-												Updated blog article

											
										
										
											2021-09-13 19:06:33 +02:00
+								search previews appear not to include any occurrence of any of the search
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								terms. This was due to the fact that search previews were [truncated after a
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								maximum of 320 characters][truncated], as can be seen here:
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated documentation

											
										
										
											2021-10-04 23:36:31 +02:00
+								<figure markdown>
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								![search preview]
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated documentation

											
										
										
											2021-10-04 23:36:31 +02:00
+								  <figcaption markdown>
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								The first two results look like they're not relevant, as they don't seem to
 								include the query string the user just searched for. Yet, they are.
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
 								  </figcaption>
 								</figure>
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								A better solution to this problem has been on the roadmap for a very, very long
 								time, but in order to solve this once and for all, several factors need to be
 								carefully considered:
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+. __Word boundaries__: some themes[^2] for static site generators generate
 								   search previews by expanding the text left and right next to an occurrence,
 								   stopping at a whitespace character when enough words have been consumed. A
 								   preview might look like this:
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Documentation

											
										
										
											2023-09-15 09:25:50 +02:00
+								    ```
-												Updated blog article

											
										
										
											2021-09-13 19:06:33 +02:00
+								    … channels, e.g., or which can be configured via mkdocs.yml …
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
+								    ```
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								    While this may work for languages that use whitespace as a separator
 								    between words, it breaks down for languages like Japanese or Chinese[^3],
 								    as they have non-whitespace word boundaries and use dedicated segmenters to
 								    split strings into tokens.
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								  [^2]:
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								    At the time of writing, [Just the Docs] and [Docusaurus] use this method
 								    for generating search previews. Note that the latter also integrates with
 								    Algolia, which is a fully managed server-based solution.
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								  [^3]:
 								    China and Japan are both within the top 5 countries of origin of users of
 								    Material for MkDocs.
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Documentation

											
										
										
											2023-09-20 14:03:28 +02:00
+								  [truncated]: https://github.com/squidfunk/mkdocs-material/blob/master/src/templates/assets/javascripts/templates/search/index.tsx#L90
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								  [search preview]: search-better-faster-smaller/search-preview.png
 								  [Just the Docs]: https://pmarsceill.github.io/just-the-docs/
 								  [Docusaurus]: https://github.com/lelouch77/docusaurus-lunr-search
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+.   __Context-awareness__: Although whitespace doesn't work for all languages,
 								    one could argue that it could be a good enough solution. Unfortunately, this
 								    is not necessarily true for code blocks, as the removal of whitespace might
 								    change meaning in some languages.
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+.  __Structure__: Preserving structural information is not a must, but
 								    apparently beneficial to build more meaningful search previews which allow
 								    for a quick evaluation of relevance. If a word occurrence is part of a code
 								    block, it should be rendered as a code block.
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
 								## What's new?
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								After we built a solid understanding of the problem space and before we dive
 								into the internals of our new search implementation to see which of the
 								problems it already solves, a quick overview of what features and improvements
 								it brings:
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								- __Better__: support for [rich search previews], preserving the structural
-												Updated blog article

											
										
										
											2021-09-13 19:06:33 +02:00
+								  information of code blocks, inline code, and lists, so they are rendered
-												Documentation

											
										
										
											2023-09-15 09:25:50 +02:00
+								  as-is, as well as [lookahead tokenization], [more accurate highlighting], and
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								  improved stability of typeahead. Also, a [slightly better UX].
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								- __Faster__ and __smaller__: significant decrease in search index size of up
 								  to 48% due to improved extraction and construction techniques, resulting in a
 								  search experience that is up to 95% faster, which is particularly helpful for
 								  large documentation projects.
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								  [rich search previews]: #rich-search-previews
 								  [lookahead tokenization]: #tokenizer-lookahead
 								  [more accurate highlighting]: #accurate-highlighting
 								  [slightly better UX]: #user-interface
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Added section on rich search results

											
										
										
											2021-09-12 18:59:36 +02:00
+								### Rich search previews
-												Set up blog and started article about new search

											
										
										
											2021-09-12 16:41:19 +02:00
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								As we rebuilt the search plugin from scratch, we reworked the construction of
 								the search index to preserve the structural information of code blocks, inline
 								code, as well as unordered and ordered lists. Using the example from the
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								[search index] section, here's how it looks:
-												Added section on rich search results

											
										
										
											2021-09-12 18:59:36 +02:00
 								=== "Now"
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								    ![search preview now]
-												Added section on rich search results

											
										
										
											2021-09-12 18:59:36 +02:00
 								=== "Before"
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								    ![search preview before]
-												Added section on rich search results

											
										
										
											2021-09-12 18:59:36 +02:00
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								Now, __code blocks are first-class citizens of search previews__, and even
 								inline code formatting is preserved. Let's take a look at the new structure of
 								the search index to understand why:
-												Added section on rich search results

											
										
										
											2021-09-12 18:59:36 +02:00
 								??? example "Expand to inspect search index"
 								    === "Now"
 								        ``` json
 								        {
 								          ...
 								          "docs": [
 								            {
 								              "location": "page/",
 								              "title": "Example",
 								              "text": ""
 								            },
 								            {
 								              "location": "page/#text",
 								              "title": "Text",
 								              "text": "<p>It's very easy to make some words bold and other words italic with Markdown. You can even add links, or even <code>code</code>:</p> <pre><code>if (isAwesome){\n  return true\n}\n</code></pre>"
 								            },
 								            {
 								              "location": "page/#lists",
 								              "title": "Lists",
 								              "text": "<p>Sometimes you want numbered lists:</p> <ol> <li>One</li> <li>Two</li> <li>Three</li> </ol> <p>Sometimes you want bullet points:</p> <ul> <li>Start a line with a star</li> <li>Profit!</li> </ul>"
 								            }
 								          ]
 								        }
 								        ```
 								    === "Before"
 								        ``` json
 								        {
 								          ...
 								          "docs": [
 								            {
 								              "location": "page/",
 								              "title": "Example",
 								              "text": "Example Text It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true } Lists Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
 								            },
 								            {
 								              "location": "page/#example",
 								              "title": "Example",
 								              "text": ""
 								            },
 								            {
 								              "location": "page/#text",
 								              "title": "Text",
 								              "text": "It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true }"
 								            },
 								            {
 								              "location": "page/#lists",
 								              "title": "Lists",
 								              "text": "Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
 								            }
 								          ]
 								        }
 								        ```
 								If we inspect the search index again, we can see how the situation improved:
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+.  __Content is included only once__: the search index does not include the
 								    content of the page twice, as only the sections of a page are part of the
 								    search index. This leads to a significant reduction in size, fewer bytes to
 								    transfer, and a smaller search index.
-												Added section on rich search results

											
										
										
											2021-09-12 18:59:36 +02:00
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+.  __Some structure is preserved__: each section of the search index includes
 								    a small subset of HTML to provide the necessary structure to allow for more
 								    sophisticated search previews. Revisiting our example from before, let's
 								    look at an excerpt:
-												Added section on rich search results

											
										
										
											2021-09-12 18:59:36 +02:00
 								    === "Now"
 								        ``` html
 								        … links, or even <code>code</code>:</p> <pre><code>if (isAwesome){ … }\n</code></pre>
 								        ```
 								    === "Before"
 								        ```
 								        … links , or even code : if (isAwesome) { … }
 								        ```
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								    The punctuation issue is gone, as no additional whitespace is inserted, and
 								    the preserved markup yields additional context to make scanning search
 								    results more effective.
-												Added section on rich search results

											
										
										
											2021-09-12 18:59:36 +02:00
 								On to the next step in the process: __tokenization__.
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								  [search index]: #search-index
 								  [search preview now]: search-better-faster-smaller/search-preview-now.png
 								  [search preview before]: search-better-faster-smaller/search-preview-before.png
-												Added section on rich search results

											
										
										
											2021-09-12 18:59:36 +02:00
 								### Tokenizer lookahead
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								The [default tokenizer] of [lunr] uses a regular expression to split a given
-												Documentation

											
										
										
											2022-06-05 18:16:51 +02:00
+								string by matching each character against the [`separator`][separator] as
 								defined in `mkdocs.yml`. This doesn't allow for more complex separators based
 								on lookahead or multiple characters.
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								Fortunately, __our new search implementation provides an advanced tokenizer__
 								that doesn't have these shortcomings and supports more complex regular
-												Updated blog article

											
										
										
											2021-09-13 19:06:33 +02:00
+								expressions. As a result, Material for MkDocs just changed its own separator
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								configuration to the following value:
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
 								```
 								[\s\-,:!=\[\]()"/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;
 								```
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								While the first part up to the first `|` contains a list of single control
 								characters at which the string should be split, the following three sections
 								explain the remainder of the regular expression.[^4]
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								  [^4]:
-												Documentation

											
										
										
											2022-06-05 18:16:51 +02:00
+								    As a fun fact: the [`separator`][separator] [default value] of the search
 								    plugin being `[\s\-]+` always has been kind of irritating, as it suggests
 								    that multiple characters can be considered being a separator. However, the
 								    `+` is completely irrelevant, as regular expression groups involving
 								    multiple characters were never supported by
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								    [lunr's default tokenizer][default tokenizer].
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
-												Documentation

											
										
										
											2022-06-05 18:16:51 +02:00
+								  [default value]: https://www.mkdocs.org/user-guide/configuration/#separator
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
 								#### Case changes
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								Many programming languages use `PascalCase` or `camelCase` naming conventions.
 								When a user searches for the term `case`, it's quite natural to expect for
 								`PascalCase` and `camelCase` to show up. By adding the following match group to
 								the separator, this can now be achieved with ease:
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
 								```
 								(?!\b)(?=[A-Z][a-z])
 								```
-												Updated blog article

											
										
										
											2021-09-13 19:06:33 +02:00
+								This regular expression is a combination of a negative lookahead (`\b`, i.e.,
 								not a word boundary) and a positive lookahead (`[A-Z][a-z]`, i.e., an uppercase
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								character followed by a lowercase character), and has the following behavior:
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
 								- `PascalCase` :octicons-arrow-right-24: `Pascal`, `Case`
 								- `camelCase` :octicons-arrow-right-24: `camel`, `Case`
 								- `UPPERCASE` :octicons-arrow-right-24: `UPPERCASE`
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								Searching for [:octicons-search-24: searchHighlight][q=searchHighlight]
 								now brings up the section discussing the `search.highlight` feature flag, which
 								also demonstrates that this now even works properly for search queries.[^5]
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								  [^5]:
 								    Previously, the search query was not correctly tokenized due to the way
-												Documentation

											
										
										
											2023-09-15 09:25:50 +02:00
+								    [lunr] treats wildcards, as it disables the pipeline for search terms that
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								    contain wildcards. In order to provide a good typeahead experience,
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								    Material for MkDocs adds wildcards to the end of each search term not
 								    explicitly preceded with `+` or `-`, effectively disabling tokenization.
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								  [q=searchHighlight]: ?q=searchHighlight
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
 								#### Version numbers
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								Indexing version numbers is another problem that can be solved with a small
 								lookahead. Usually, `.` should be considered a separator to split words like
 								`search.highlight`. However, splitting version numbers at `.` will make them
 								undiscoverable. Thus, the following expression:
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
 								```
 								\.(?!\d)
 								```
-												Updated blog article

											
										
										
											2021-09-13 19:06:33 +02:00
+								This regular expression matches a `.` only if not immediately followed by a
 								digit `\d`, which leaves version numbers discoverable. Searching for
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								[:octicons-search-24: 7.2.6][q=7.2.6] brings up the [7.2.6] release notes.
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								  [q=7.2.6]: ?q=7.2.6
-												Fixed all anchors after turning on validation

											
										
										
											2024-04-25 05:51:05 +02:00
+								  [7.2.6]: ../../changelog/index.md#7.2.6
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
 								#### HTML/XML tags
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								If your documentation includes HTML/XML code examples, you may want to allow
 								users to find specific tag names. Unfortunately, the `<` and `>` control
 								characters are encoded in code blocks as `&lt;` and `&gt;`. Now, adding the
 								following expression to the separator allows for just that:
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
 								```
 								&[lg]t;
 								```
 								---
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								_We've only just begun to scratch the surface of the new possibilities
 								tokenizer lookahead brings. If you found other useful expressions, you're
 								invited to share them in the comment section._
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
 								### Accurate highlighting
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								Highlighting is the last step in the process of search and involves the
 								highlighting of all search term occurrences in a given search result. For a
 								long time, highlighting was implemented through dynamically generated
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								[regular expressions].[^6]
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
 								This approach has some problems with non-whitespace languages like Japanese or
-												Updated blog article

											
										
										
											2021-09-13 19:06:33 +02:00
+								Chinese[^3] since it only works if the highlighted term is at a word boundary.
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								However, Asian languages are tokenized using a [dedicated segmenter], which
-												Updated blog article

											
										
										
											2021-09-13 19:06:33 +02:00
+								cannot be modeled with regular expressions.
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
 								  [^6]:
 								    Using the separator as defined in `mkdocs.yml`, a regular expression was
-												Updated blog article

											
										
										
											2021-09-13 19:06:33 +02:00
+								    constructed that was trying to mimic the tokenizer. As an example, the
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								    search query `search highlight` was transformed into the rather cumbersome
 								    regular expression `(^|<separator>)(search|highlight)`, which only matches
 								    at word boundaries.
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								Now, as a direct result of the [new tokenization approach], __our new search
 								implementation uses token positions for highlighting__, making it exactly as
 								powerful as tokenization:
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+.  __Word boundaries__: as the new highlighter uses token positions, word
 								    boundaries are equal to token boundaries. This means that more complex cases
 								    of tokenization (e.g., [case changes], [version numbers], [HTML/XML tags]),
 								    are now all highlighted accurately.
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+.  __Context-awareness__: as the new search index preserves some of the
 								    structural information of the original document, the content of a section
 								    is now divided into separate content blocks – paragraphs, code blocks, and
 								    lists.
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
 								    Now, only the content blocks that actually contain occurrences of one of
 								    the search terms are considered for inclusion into the search preview. If a
 								    term only occurs in a code block, it's the code block that gets rendered,
-												Documentation

											
										
										
											2023-09-15 09:25:50 +02:00
+								    see, for example, the results of
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								    [:octicons-search-24: twitter][q=twitter].
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								  [regular expressions]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/highlighter/index.ts#L61-L91
 								  [dedicated segmenter]: http://chasen.org/~taku/software/TinySegmenter/
 								  [new tokenization approach]: #tokenizer-lookahead
 								  [case changes]: #case-changes
 								  [version numbers]: #version-numbers
 								  [HTML/XML tags]: #htmlxml-tags
 								  [q=twitter]: ?q=twitter
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
 								### Benchmarks
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								We conducted two benchmarks – one with the documentation of Material for MkDocs
 								itself, and one with a very massive corpus of Markdown files with more than
 ,000 words – a size most documentation projects will likely never
 								reach:
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
-												Updated documentation

											
										
										
											2021-10-04 23:36:31 +02:00
+								<figure markdown>
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
 								|                         |   Before |            Now |     Relative |
 								| ----------------------- | -------: | -------------: | -----------: |
 								| __Material for MkDocs__ |          |                |              |
 								| Index size              |   573 kB |     __335 kB__ |     __–42%__ |
 								| Index size (`gzip`)     |   105 kB |      __78 kB__ |     __–27%__ |
-												Improved language in blog entry

											
										
										
											2021-09-15 09:42:47 +02:00
+								| Indexing time[^7]       |   265 ms |     __177 ms__ |     __–34%__ |
 								| __KJV Markdown[^8]__    |          |                |              |
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
+								| Index size              |   8.2 MB |     __4.4 MB__ |     __–47%__ |
 								| Index size (`gzip`)     |   2.3 MB |     __1.2 MB__ |     __–48%__ |
-												Improved language in blog entry

											
										
										
											2021-09-15 09:42:47 +02:00
+								| Indexing time           | 2,700 ms |   __1,390 ms__ |     __–48%__ |
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								<figcaption>
 								  <p>Benchmark results</p>
 								</figcaption>
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
+								</figure>
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								  [^7]:
-												Improved language in blog entry

											
										
										
											2021-09-15 09:42:47 +02:00
+								    Smallest value of ten distinct runs.
 								  [^8]:
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								    We agnostically use [KJV Markdown] as a tool for testing to learn how
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								    Material for MkDocs behaves on large corpora, as it's a very large set of
 								    Markdown files with over 800k words.
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								The results show that indexing time, which is the time that it takes to set up
 								the search when the page is loaded, has dropped by up to 48%, which means __the
 								new search is up to 95% faster__. This is a significant improvement,
 								particularly relevant for large documentation projects.
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								While 1,3s still may sound like a long time, using the new client-side search
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								together with [instant loading] only creates the search index on the initial
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								page load. When navigating, the search index is preserved across pages, so the
 								cost does only have to be paid once.
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
-												Updated blog articles to use named links

											
										
										
											2021-10-11 17:16:48 +02:00
+								  [KJV Markdown]: https://github.com/arleym/kjv-markdown
 								  [instant loading]: ../../setup/setting-up-navigation.md#instant-loading
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
 								### User interface
-												Improved language in blog entry

											
										
										
											2021-09-15 09:42:47 +02:00
+								Additionally, some small improvements have been made, most prominently the
 								__more results on this page__ button, which now sticks to the top of the search
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								result list when open. This enables the user to jump out of the list more
 								quickly.
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
 								## What's next?
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								Our new search implementation is a big improvement to Material for MkDocs. It
 								solves some long-standing issues which needed to be tackled for years. Yet,
 								it's only the start of a search experience that is going to get better and
 								better. Next up:
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
-												Updated blog article

											
										
										
											2021-09-13 19:06:33 +02:00
+								- __Context-aware search summarization__: currently, the first two matching
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								  content blocks are rendered as a search preview. With the new tokenization
 								  technique, we laid the groundwork for more sophisticated shortening and
 								  summarization methods, which we're tackling next.
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								- __User interface improvements__: as we now gained full control over the
 								  search plugin, we can now add meaningful metadata to provide more context and
 								  a better experience. We'll explore some of those paths in the future.
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
-												Updated documentation

											
										
										
											2021-09-13 18:14:28 +02:00
+								If you've made it this far, thank you for your time and interest in Material
-												Updated blog article

											
										
										
											2021-09-13 19:06:33 +02:00
+								for MkDocs! This is the first blog article that I decided to write after a
-												Switched to giscus for comments and added blog template

											
										
										
											2022-01-16 10:50:53 +01:00
+								short [Twitter survey] made me to. You're invited to leave a comment
 								to share your experiences with the new search implementation.
-												Finished blog article

											
										
										
											2021-09-13 16:13:05 +02:00
-												Updated Twitter links to X (#7573)


											
										
										
											2024-10-01 10:27:08 +02:00
+								  [X survey]: https://x.com/squidfunk/status/1434477478823743488