English

Designing and Implementing a Rich Text Editor from Scratch

A rich text editor allows users to apply various formats and styles while entering and editing text content, such as mixing text and images, with a "what-you-see-is-what-you-get" capability. Unlike simple plain text editing components like <input>, a rich text editor provides more features and flexibility, enabling users to create richer and more structured content. Modern rich text editors now encompass not only text and images but also videos, tables, code blocks, attachments, formulas, and other complex modules.

Open Source Repository: https://github.com/WindRunnerMax/BlockKit
Online Editor: https://windrunnermax.github.io/BlockKit/
Project Notes: https://github.com/WindRunnerMax/BlockKit/blob/master/NOTE.md

Why?

So why should we design a new rich text editor from scratch? Editors are notoriously challenging, and there are already many excellent implementations available. For example, the expressive data structure design of Quill, the Draft editor which integrates with the React view layer, the pure editor engine Slate, the highly modular ProseMirror, the ready-to-use TinyMCE/TipTap, and the collaborative solution EtherPad, among others.

I have also focused on various implementations of rich text editors, and I often read articles related to editors on different sites. However, I noticed that there are very few discussions on the underlying design of these editors. Most of the content revolves around application-level discussions, such as how to use the editor engine to implement certain features. While these implementations at the application level have their complexities, the foundational design is a more worthy topic of exploration.

Moreover, I see rich text editors as akin to low-code designs, more precisely, an implementation of No Code. Both low-code and rich text rely on DSL to describe and manipulate DOM structures, with rich text primarily utilizing keyboard input for DOM manipulation, while no-code involves drag-and-drop methods. I believe there’s a shared design philosophy here.

Recently, I have been concentrating on the application layer of editor implementation. Throughout this process, I've encountered numerous challenges and documented related articles. However, during this journey, I found many aspects that I felt could be optimized, especially at the data structure level, and I hope to apply some of my ideas. Specifically, I have several reasons for this:

Editor Column

Learning from books often feels insufficient; true understanding comes from practical engagement.

I started writing my blog in 2020, documenting a wide range of topics, essentially noting down whatever came to mind as part of my learning journey. Then, in 2024, I penned quite a few articles on rich text editors, primarily consolidating the issues I encountered and their solutions, focusing on design at the application layer, such as:

Additionally, I recently studied the implementation of the slate rich text editor and contributed some PRs to the slate repository. I also wrote several articles relating to slate and developed a document editor based on it, again focusing on application-level implementations, including:

After implementing many application-level features, I discovered numerous areas within the entire editor that could benefit from deeper research. Many implementations seem straightforward, but upon closer examination, there are many details worth exploring, such as the common zero-width characters in DOM structures, the rendering of Mention nodes, etc. These topics could be documented independently, which is actually the most compelling reason for my desire to implement an editor from scratch.

In 2024, I began writing extensively about various business-related topics, but by 2025, I found myself somewhat out of fresh ideas. Currently, I don't have any other areas of expertise, making it a good choice to focus on editor-related content, simplifying the topic selection for my articles. Although my goal is to delve deeper into editor-related themes, I still document challenges as they arise. For example, I recently identified a potential implementation for state management based on immer combined with OT-JSON.

As for the specific implementation of the editor, my current target is to create a usable editor rather than an editor with extensive compatibility and complete functionality. The reason is that there are already many excellent editor implementations available, along with numerous ecosystem plugins to support them, capable of meeting most requirements. My goal for the editor is primarily to ensure compatibility with Chrome, and I won't be considering mobile compatibility at this stage. However, if the editor turns out well, adapting for compatibility is certainly an option I would pursue.

However, at this stage, I am tentatively designing and implementing the editor. During this process, I will inevitably encounter many issues, which will become the main content of this column. Initially, I planned to refine the editor before starting to write articles, but later I realized that the historical solutions from the design process were equally valuable. Thus, I decided to document the design journey as well. If in the future the editor can truly be applied in a production environment, these articles will help trace back to why certain design choices were made, which would be fantastic. Overall, we cannot expect to achieve everything at once, but we can make progress step by step.

Diving Deeper into the Editor

This part reminds me of the saying: our rich text editor is such that if you don't write it, you won't understand it.

The editor is a very detail-oriented project, and often requires in-depth study of the browser's API, such as the caretPositionFromPoint method on document, which is used to obtain the location of the selection at a specific point, usually applied for positioning text after drag-and-drop. Additionally, there are many selection-related APIs, like Selection, Range, and so forth; these are foundational to the implementation of the editor.

Thus, delving into the underlying workings of the editor is very meaningful. Many times we need to interact with the browser, and it can also provide value for our daily business development. Here, I want to discuss zero-width characters in the editor to learn about the intricate design details—they're a fascinating topic, and such matters often go unnoticed if not studied.

As the name suggests, a zero-width character has no width, thus it's easily inferred that these characters are visually not displayed. Therefore, these characters can serve as invisible placeholders, achieving special effects. For example, they can be used for information hiding, watermarking, and encrypted information sharing. Some novel sites use this method alongside glyph substitution to trace back piracy.

In a rich text editor, if we inspect elements in the developer tools, we might find characters resembling &ZeroWidthSpace;, i.e., U+200B, which are common zero-width characters. For instance, in the editor of Feishu documents, we can detect existing zero-width characters using $(" [data-enter]").

<!-- document.querySelectorAll("[data-enter]") -->
<span data-string="true" data-enter="true" data-leaf="true">\u200B</span>
<span data-string="true" data-enter="true" data-leaf="true">&ZeroWidthSpace;</span>

From the name, this zero-width character is visually not displayed because it has zero width. However, in the editor, this character is quite important. Simply put, we need this character for cursor placement and additional display effects. It’s important to note that we are referring to editors implemented with ContentEditable, as self-drawn selection editors may not require this design aspect.

First, let's discuss the additional display effects. For instance, when selecting text content in Feishu documents, if we reach the end of the text, we notice an extra effect resembling xxx|. If we don’t focus on it, we might think this is a default behavior of the editor, but in reality, this effect does not exist in either slate or quill.

In fact, this effect is achieved by inserting a zero-width character at the end of the line content, allowing for the text selection effect at the end. This effect is more commonly seen in Word, where an additional rendered enter symbol appears.

<div contenteditable="true">
  <div><span>End Zero-Width Character Line 1</span><span>&#8203;</span></div>
  <div><span>End Zero-Width Character Line 2</span><span>&#8203;</span></div>
  <div><span>End Plain Text Line 1</span></div>
  <div><span>End Plain Text Line 2</span></div>
</div>

If the zero-width character was only for rendering effects, its purpose might seem unnecessary. However, in terms of interaction, this effect is very useful. For example, suppose we have 3 lines of text, and we select from the end of the 1st line to the second line, and then press the Tab key; the contents of both lines will be indented.

Without this display effect, if the user performs the indentation operation, they might think they only selected the 2nd line, but actually, they selected both 1/2 lines. This could lead the user to believe there's a BUG, and we have indeed received feedback regarding this interaction effect.

123|
4|x56

I also conducted a brief survey on various online document implementations: in editors based on contenteditable, Feishu documents and early EtherPad have this interaction; in self-drawn selection editors, DingTalk documents have this implementation; and in editors developed with the Canvas engine, Tencent documents and Google Docs also exhibit this feature.

In terms of rendering effects, zero-width characters serve another crucial purpose: sustaining line content. When our line content is empty, the DOM structure of that line is effectively empty, causing the height of the line to collapse to 0, making it impossible to place the cursor. To resolve this, we can insert zero-width characters in the line content, allowing the line to sustain its height and enabling cursor placement. Of course, we could also use <br> to maintain line height; both solutions have their pros and cons, and their compatibility varies.

<div data-line-node></div>
<div data-line-node><br></div>
<div data-line-node><span>&#8203;</span></div>

In editors with a block structure like Notion, there is another important interactive effect: the independent selection of block-level structures. This means that we can directly select an entire code block rather than just the text within it. This effect is rarely implemented in current open-source editors; usually, it requires redesigning the selection area within the block structure.

Typically, this interaction can also be achieved using zero-width characters. Since our selection area usually needs to be placed on text nodes, one straightforward approach is to place a zero-width character at the end of the line where the block structure is located. When the selection area is on the zero-width character, the whole block will be selected. The advantage of using a zero-width character instead of a <br> is that the zero-width character has no width itself and won't cause any unwanted line breaks.

<div>
  <pre><code>
    xxx
  </code></pre>
  <span data-zero-block>&#8203;</span>
</div>

Structurally, zero-width characters play a crucial role as well. In an editor, nodes with contenteditable=false exhibit special behavior, particularly in nodes similar to inline-block, such as Mention nodes. When there is no content before or after these nodes, we need to insert zero-width characters on either side to enable cursor placement.

In the example below, line-1 does not allow for placing the cursor after the content @xxx, even though we can position it before. However, at that point, the cursor is located on a line node, which does not meet our expectation of being on a text node. Therefore, we must insert a zero-width character afterwards. In line-2/3, we see the desired cursor placement effect. The 0.1px here serves as a compatibility "magic" for cursor placement. Without this "hack," the cursor wouldn't be able to be placed after the inline-block node in non-sibling nodes.

<div contenteditable style="outline: none">
  <div data-line-node="1">
    <span data-leaf><span contenteditable="false" style="margin: 0 0.1px;">@xxx</span></span>
  </div>
  <div data-line-node="2">
    <span data-leaf>&#8203;</span>
    <span data-leaf><span contenteditable="false" style="margin: 0 0.1px;">@xxx</span></span>
    <span data-leaf>&#8203;</span>
  </div>
  <div data-line-node="3">
    <span data-leaf>&#8203;<span contenteditable="false">@xxx</span>&#8203;</span>
  </div>
</div>

Moreover, the editor naturally deals with characters, and in the implementation of Unicode encoding shown in js, emoji are among the most common and problematic expressions. Besides the fact that a single emoji's length is 2, combined emojis are also represented using a unique zero-width joiner \u200d.

"🎨".length
// 2
"🧑" + "\u200d" + "🎨"
// 🧑‍🎨

Data Structure Design

The design of the editor's data structure has a broad impact, whether in maintaining the textual content of the editor, nesting block structures, serializing and deserializing, or in application-layer aspects such as diff algorithms, find and replace, collaborative algorithms, as well as data transformations for backend services, exporting to md/word/pdf, and data storage. All of these involve the design of the editor’s data structure.

Generally speaking, a nested data structure based on JSON is commonly used to represent the editor's Model, as seen in editors like Slate, ProseMirror, and Lexical. Taking the Slate editor as an example, both the data structure and selection design lean as much as possible towards HTML, allowing for extensive levels of nested nodes.

[
  {
    type: "paragraph",
    children: [{ text: "editable" }],
  },
  {
    type: "ul",
    children: [
      {
        type: "li",
        children: [{ text: "list" }],
      },
    ],
  },
];

It is also a common solution to express document content using a linear flat structure, as seen in editors such as Quill, EtherPad, and Google Docs. For instance, in the Quill editor, the data structure representation does not involve nesting, although it is essentially still a JSON structure, with a more streamlined representation for selections.

[
  { insert: "editable\n" },
  { insert: "list\n", attributes: { list: "bullet" } },
];

Of course, there are many unique data structure designs, such as the piece table data structure used in vscode/monaco. Code editors can also be considered a type of rich text editor, especially since they support syntax highlighting. However, I have yet to delve deeply into structures like the piece table.

Here, I would like to express the entire rich text structure using a linear data structure. Although a nested structure can convey the content of the document more intuitively, it also complicates content manipulation, especially when nested content is involved. Taking slate as an example, the API design in versions before 0.50 was quite complex, requiring significant effort to understand. However, it has been considerably simplified since then:

// https://github.com/ianstormtaylor/slate/blob/6aace0/packages/slate/src/interfaces/operation.ts
export type NodeOperation =
  | InsertNodeOperation
  | MergeNodeOperation
  | MoveNodeOperation
  | RemoveNodeOperation
  | SetNodeOperation
  | SplitNodeOperation;
export type TextOperation = InsertTextOperation | RemoveTextOperation;

From this, we can see that slate requires 9 types of operations (Op) to fully manipulate document content. In contrast, if we utilize a linear structure, we only need three types of operations to represent the entire document’s manipulation. Of course, for operations like Move, additional range calculation handling is necessary, effectively shifting the computational cost to the application layer.

// https://github.com/WindRunnerMax/BlockKit/blob/c24b9e/packages/delta/src/delta/interface.ts
export interface Op {
  // Only one property out of {insert, delete, retain} will be present
  insert?: string;
  delete?: number;
  retain?: number;

  attributes?: AttributeMap;
}

Additionally, the normalization of nested structures can become very complex, and the time complexity associated with changes increases significantly, particularly with dirty path marking algorithms and the subsequent data processing that needs to be handled by the aforementioned Op. Furthermore, user actions can lead to nested levels that aren’t easy to control, necessitating a normalization process to standardize data. Otherwise, when pasting HTML, for instance, a large amount of data nesting may occur.

[{
  children: [{
    children: [{
      children: [{
        children: [{
          // ...
          text: "content"
        }]
      }]
    }]
  }]
}]

To give a more practical example, consider if we have nested formatted content, such as quote and list. If our document's data structure is nested, manipulating the content could result in two scenarios: ul > quote or quote > ul. Typically, we would need to design rules for normalization. However, with a flat structure, all attributes are contained within attrs, and the changes to the data format due to different operations become completely idempotent.

// slate
[{
  type: "quote",
  children: [{
    type: "ul",
    children: [{ text: "text" }]
  }],
}, {
  type: "ul",
  children: [{
    type: "quote",
    children: [{ text: "text" }]
  }],
}]

// quill
[{
  insert: "text",
  attributes: { blockquote: true, list: "bullet" }
}]

The flat data structure has advantages in terms of data handling, though it may struggle to express structured data at the view layer, especially for expressing nested constructs like code blocks or tables. However, this is not an unfeasible task; for instance, the complex nested tables in Google Docs are actually based on a completely linear structure, which involves some clever design that I won't elaborate on here.

Moreover, if we need to implement an online document editor, our entire management process may require diffs, meaning we need to determine the additions, deletions, and modifications in both data structures. In this regard, a flat data structure handles text content more effectively, while nested JSON structures can lead to significant complications. Other surrounding applications related to data handling would also see an increase in overall complexity.

Lastly, there are implementations related to collaboration. Collaborative algorithms are optional modules of rich text editors. Whether based on OT or Op-Based CRDT collaboration algorithms, they require the transmission of the aforementioned op types and data. Clearly, the complexity of having 9 operations in op types is greater than just 3 operation types.

Therefore, I wish to implement the entire editor structure using a linear data structure, making quill’s delta a very suitable choice. However, quill features its own view layer structure, which cannot be easily combined with other view layers such as react. The advantage of combining these view layers lies in the ability to leverage component library styles to create the editor, thus avoiding the need for each component to implement their own styles. Thus, I plan to create a core rich text editor from scratch based on quill’s data structure, and similarly combine the basic view layers, as done in slate.

Solution Selection

There is an interesting question here: why is it possible to achieve some capabilities similar to those of an Office Word editor with a codebase of less than 1MB? This is because browsers have already handled many tasks for us and provided APIs for developers, including input method processing, font parsing, layout engines, view rendering, and more.

Therefore, we need to design a way to interact with the browser, after all, we are required to interact with the browser. The classic description of a rich text editor can be categorized into three levels:

L0: Implements rich text editing based on the ContentEditable feature provided by the browser and uses document.execCommand for command operations. This serves as an early lightweight editor that can be developed relatively quickly, but it offers very limited customization options.
L1: Also implements rich text editing based on ContentEditable, but allows for a customizable data model and command execution via data-driven approaches. Common examples include Yuque and Feishu Documents, which meet the vast majority of use cases but cannot extend beyond the browser's built-in layout effects.
L2: Implements a layout engine based on Canvas, relying on a minimal set of browser APIs. Common implementations include Google Docs and Tencent Docs, requiring complete control over the layout, akin to using a drawing board rather than the DOM to render rich text, which presents a significant technical challenge.

In fact, in the current landscape of open source products, all three types of editors are represented, with the majority being of the L1 type. Among these, there is also a subdivision that operates without relying on ContentEditable, yet is not a fully hand-drawn engine, but rather relies on the DOM for content display with a custom selection rendering implementation, effectively categorizing it as L1.5.

With a focus on learning, it's crucial to choose an implementation with abundant open source products so that we can better reference and analyze related content when encountering issues. As such, I intend to select an editor based on ContentEditable that implements a data-driven standard MVC model, interacting with the browser to achieve basic rich text editing capabilities. Before we proceed, let’s first understand the fundamentals of editor implementation:

ExecCommand

If we only need basic inline styles like bold, italic, underline, etc., this might be sufficient in some simple input boxes. In that case, we can choose to use execCommand for the implementation. The advantage of using execCommand directly is that its footprint is very minimal; for instance, the implementation of pell requires only 3.54KB of code, alongside other implementations like react-contenteditable.

We can also implement a minimal DEMO to achieve bold text. The execCommand command can execute operations within the selected elements of a contenteditable element. The document.execCommand method accepts three parameters: the command name, a boolean indicating if a user interface should be shown, and command parameters. The user interface display is typically set to false, as it's not implemented in Mozilla, and command parameters are optional—for example, the hyperlink command necessitates the inclusion of a specific link address.

<div>
  <button id="$1">Bold</button>
  <div style="border: 1px solid #eee; outline: none" contenteditable>
    123123
  </div>
</div>
<script>
  $1.onclick = () => {
    document.execCommand("bold");
  };
</script>

Of course, this example is quite simple. We could also determine the bold status of the button when the selection changes, allowing us to display the current selection status. However, we need to synchronize with the behavior of the execCommand command, which we’ve noted has very poor control. Therefore, we need to iterate through all the selected nodes using document.createTreeWalker to ascertain the current selection state.

It's also essential to note that the behavior of the execCommand command can vary across different browsers. This is another instance of browser compatibility issues we've discussed previously, and sadly, we have no way to control this as they are default behaviors:

In an empty contenteditable editor, pressing the enter key results in Chrome inserting <div><br></div>, while in FireFox(<60), it inserts <br>, and IE will insert <p><br></p>.
In an editor with text, inserting an enter character in the middle of the text, such as 123|123, causes Chrome to display it as 123<div>123</div>, while in FireFox, it formats to <div>123</div><div>123</div>.
Similarly, in an editor with text, if you insert an enter character and then delete it, such as 123|123->123123, Chrome will revert to 123123, while FireFox will change it to <div>123123</div>.
If two lines of text are selected simultaneously and we execute the command ("formatBlock", false, "P"), Chrome will wrap both lines in a single <p>, while FireFox will wrap each line in separate <p> tags.
...

Additionally, similar to implementing bold functionality, we cannot control whether <b></b> or <strong></strong> is used to achieve bold text. There are also browser compatibility issues: for instance, in the IE browser, bold is achieved using <strong></strong>, while in Chrome, it’s done with <b></b>. Furthermore, IE and Safari do not support using the heading command for headings, among other limitations. More complex features, such as images and code blocks, are also challenging to implement effectively.

However, the default behavior is not entirely useless. In some cases, we may want to create a purely HTML editor. After all, if building an editor based on the MVC pattern, it would need to handle data content that is not valid for the Model, which could result in the loss of original HTML content. Thus, in such scenarios, relying on the browser's default behavior may be the most efficient approach; our main focus would then likely be on handling XSS.

ContentEditable

ContentEditable is an attribute in HTML5 that allows elements to become editable. When paired with the built-in execCommand, it forms the basis of our earlier discussion of a simple DEMO. To create the most basic text editor, you simply need to enter the following in the address bar:

data:text/html,<div contenteditable style="border: 1px solid black"></div>

Although using document.execCommand to execute commands for modifying HTML is straightforward, we’ve discussed how poorly controllable it is. Besides the aforementioned compatibility issues with the execCommand commands, there are many behaviors in the DOM that also need to be compatible, such as the following sentence with a simple bold format:

123**456**789

There are numerous ways to express such content, and the editor may consider them visually equivalent. In that case, we might need to treat these DOM structures equivalently:

<span>123<b>456</b>789</span>
<span>123<strong>456</strong>789</span>
<span>123<span style="font-weight: bold;">456</span>789</span>

However, this is merely visually equal. When fully corresponding this to the Model, it naturally becomes a cumbersome task. Moreover, expressing the selection is also complex. For example, consider the following DOM structure:

<span>123</span><b><em>456</em></b><span>789</span>

If we want to express the selection collapsed to the left of character 4, there will be multiple ways to achieve this position, which will heavily rely on the browser's default behavior:

{ node: 123, offset: 3 }
{ node: <em></em>, offset: 0 }
{ node: <b></b>, offset: 0 }

In order to achieve stronger extensibility and controllability, as well as to address the issue of data not corresponding with the view, L1's rich text editor employs the concept of a custom data model. This involves extracting a data structure from the DOM tree; having the same data structure ensures that the rendered HTML will also be consistent. By utilizing custom commands to directly manipulate the data model, we ultimately ensure the consistency of the rendered HTML document. To express the selection, we need to continuously normalize the selection model based on the DOM selection.

This essentially aligns with the common MVC model, where executing commands modifies the current model, which in turn reflects changes in the view's rendering. Simply put, it involves constructing a data model that describes the document's structure and content, while utilizing custom execCommand to modify this descriptive model. In this phase of the rich text editor, abstracting the data model resolves issues related to dirty data and challenges in implementing complex features. We can briefly outline the process:

<script>
  const editor = {
    // Model selection
    selection: {},
    execCommand: (command, value) => {
      // Execute specific commands, such as bold
      // After executing the command, update the model and call DOM render
    },
  }
  const model = [
    // Data model
    { type: "bold", text: "123" },
    { type: "span", text: "123123" },
  ];
  const render = () => {
    // Render specific DOM based on type
  };
  document.addEventListener("selectionchange", () => {
    // On selection change
    // Update the model selection based on the DOM selection
  });
</script>

Similar solutions, whether in quill or slate, follow this kind of scheduling. However, the implementation akin to slate requires more complex compatibility handling after integrating an adapter with React. When a ContentEditable is added to a React node, a warning like the one below may arise:

<div
  contentEditable
  suppressContentEditableWarning
></div>
// A component is `contentEditable` and contains `children` managed by React. It is now your responsibility to guarantee that none of those nodes are unexpectedly modified or duplicated. This is probably not intentional.

This warning indicates that React cannot ensure that the children within ContentEditable won't be accidentally modified or duplicated, which may not be intended. In other words, not only does React itself need to execute DOM operations, but once ContentEditable is used, this behavior becomes uncontrollable; naturally, this issue will also manifest in our editor.

Additionally, there are other behaviors to consider. For instance, in the example below, we cannot select smoothly from the character 123 to 456. Here, crossing over ContentEditable nodes prevents us from leveraging the browser's default behavior, which, while reasonable, may indeed complicate the implementation of our blocks editor.

<div contenteditable="false">
  <div contenteditable="true">123</div>
  <div contenteditable="true">456</div>
</div>

We can actually avoid using ContentEditable. Imagine that even if we don't implement an editor, text content on the page can still be selected, which is essentially our standard selection implementation. If we utilize the native selection and then create a controller layer on top of it, we could achieve a fully controlled editor.

However, this presents a significant challenge: content input. Since disabling ContentEditable means the cursor won't appear, it's obviously impossible to enter any text. If we wish to enable content input, particularly to activate the IME (Input Method Editor), the conventional API provided by browsers is to use <input>. Thus, we must implement a hidden <input> element for text input. In fact, many code editors, such as CodeMirror, adopt a similar approach.

However, using hidden <input> elements can lead to other issues. When the focus is on the input, users cannot select the text in the browser. This is because within the same page, focus can only exist at one position at a time, which means we need to implement custom selection rendering in such cases. For instance, DingTalk documents and Youdao Cloud Notes use custom selection rendering, while the open-source Monaco Editor also implements this approach. On the other hand, TextBus handles cursor rendering, while selection is managed by the browser.

To summarize, when using ContentEditable, handling many peculiar DOM behaviors is necessary, although we clearly do not need to overly manage the input activation behavior. However, if we choose not to use ContentEditable and instead rely on DOM to present rich text content, we must employ additional hidden input nodes for input functionality. Because focus issues prevent us from using the browser's selection behavior in this case, we are again led to implementing custom selection rendering.

Canvas

Drawing the desired content using Canvas gives off a somewhat Renaissance vibe. This method is entirely independent of DOM, allowing for full control over the layout engine. The term "Renaissance" here refers to the fact that any ecosystem based on DOM compatibility will become ineffective, including accessibility, SEO, support for development tools, and so on.

So why abandon the existing DOM ecosystem in favor of Canvas for rendering rich text content? Rich text can be quite complex, as it involves not only text but also images and various structured formats, such as tables. Rendering such content requires custom implementations, which effectively means partially reinventing skia.

Currently, editors based on Canvas include Tencent Documents and Google Docs, with open-source implementations like Canvas Editor. Beyond document editors, online spreadsheets are primarily implemented using Canvas, such as Tencent Document Sheets and Feishu's multidimensional tables, with open-source implementations like LuckySheet.

In a blog post released by Google Docs, two main reasons were cited for using Canvas to render documents:

Consistency in documents: Here, consistency refers to the browser's compatibility with similar behaviors. For example, in Chrome, double-clicking on a block of text automatically highlights the entire word, whereas earlier versions of Firefox would select a whole sentence instead. Such inconsistencies in behavior can lead to a lack of consistency in user experience. Using Canvas to render documents allows for implementing and maintaining this consistency independently.
Efficient rendering performance: Rendering documents via Canvas allows for better control over when to draw, eliminating the need to wait for reflows and redraws, thus avoiding the complex compatibility considerations of DOM that can lead to performance losses. Furthermore, using Canvas for drawing replaces the heavy lifting of DOM operations, enabling frame-by-frame rendering and hardware acceleration to enhance rendering performance, thereby improving the responsiveness of user interactions and overall user experience.

Moreover, the layout engine can control the typesetting effects of documents and facilitate various requirements for rich text. As a result, we may face the issue of why our products cannot support effects similar to Microsoft Word. For example, if a line is perfectly filled with text, adding a period should not cause the preceding text to wrap but should instead allow the period to fit seamlessly at the end of the line. However, if we add another character, that character would cause a line break. This scenario does not occur in browser rendering. Hence, to surpass the limitations of browser typesetting, implementing our own typesetting capabilities is essential.

<!-- word -->
Text text text text text text text text text text text text text text text.
<!-- browser -->
Text text text text text text text text text text text text text text text
.

In other words, periods typically do not appear at the beginning of paragraphs in Word, whereas they can occur in browsers, especially with pure ASCII characters. If we want to avoid such typesetting discrepancies, we must implement our own layout engine to control the typesetting effects of documents.

Additionally, there are various other features to consider, such as controlled RTL layouts, pagination, page numbering, headers, footnotes, font glyph control, and so on. Pagination capability, for instance, is essential in certain contexts where printing is required, but the DOM implementation cannot determine height before rendering, making it challenging to achieve effective pagination. Moreover, managing the rendering of large tables and their pagination also becomes difficult to control.

Therefore, if we wish to align our implementation with that of Word, we must completely rebuild it using Canvas. Beyond these additional features, there are also fundamental capabilities originally provided by browsers based on DOM, such as input method support, copy-paste functionality, and drag-and-drop support. Basic Canvas implementations cannot support these features, especially input method editor (IME) support and text selection—which necessitates complex interactions and can come at a significant implementation cost.

Conclusion

In this article, we have discussed many aspects of implementing basic capabilities in rich text editors, particularly focusing on the design between DOM structural representation and data structures. We also covered the characteristics of various browser interaction solutions, including ExecCommand, ContentEditable, and Canvas, providing a brief overview of mature products and open-source editors, while describing the advantages and disadvantages of the related implementations.

We will implement a basic rich text editor based on ContentEditable later on, starting with an overview of the overall architectural design and operations of the data structure. Following that, we will implement specific modules one by one, such as the input module, clipboard module, selection module, and so on. Building an editor is never a simple task; in addition to the foundational framework design at the core level, there will be many compatibility issues to address at the application layer. Hence, this will be a substantial project that will require gradual accumulation of efforts.

Daily Challenge

https://github.com/WindRunnerMax/EveryDay

References

ON THIS PAGE

Designing and Implementing a Rich Text Editor from Scratch#

Why?#

Editor Column#

Diving Deeper into the Editor#

Data Structure Design#

Solution Selection#

ExecCommand#

ContentEditable#

Canvas#

Conclusion#

Daily Challenge#

References#