Previously, we achieved bidirectional synchronization between the editor selection and model selection to implement controlled selection operations, which is a fundamental capability in the editor. Next, building upon the editor selection module, we need to implement a half-controlled input mode using composite events in the browser, handling complex default behaviors of the DOM structure and ensuring compatibility with various input scenarios of IMEs.
Articles on implementing a rich text editor project from scratch:
The Input
module is responsible for handling input in the editor, which is one of the core operations. We need to manage input methods such as input method editors (IME), keyboard, mouse, etc. The interaction with IMEs requires extensive compatibility handling, including candidate words, predictive text, shortcuts, accents, etc. Mobile IME compatibility poses even more challenges, and compatibility issues specific to mobile IMEs are separately outlined in the draft
.
Similar to the selection module, the editor input module needs to manipulate its default behavior based on the browser's DOM, especially when activating IMEs, which requires additional modules to work in unison, thereby necessitating complex compatibility adaptations. Input modes can be categorized into three types: uncontrolled input, half-controlled input, and controlled input, each tailored for specific use cases and implementation methods.
The uncontrolled method relies entirely on the browser's default behavior to handle input operations without intervention or modification. Changes in the DOM structure need to be monitored and applied to the editor after collection. While this method maximizes the use of native browser capabilities like selections and cursors, its main drawback is the lack of control over input, inability to prevent default behaviors, and instability.
For example, a common issue currently is that ContentEditable
cannot completely block IME input, resulting in the inability to fully control Chinese input behavior. In the following example, inputting English and numbers won't trigger responses, but Chinese characters can be input normally. This limitation is one reason why many editors opt for custom selection rendering and controlled input, as seen in applications like VSCode
and DingTalk.
When using the uncontrolled method for input, we utilize MutationObserver
to identify the current input characters, then parse the DOM structure to extract the latest Text Model
. Subsequently, a diff
operation is performed against the original Text Model
to determine the changes and generate ops
, which can then be applied to the current Model
for further processing.
Even within uncontrolled input, there are various implementation approaches. For instance, one can perform text diff
based on behavior after triggering Input
events to derive ops
and combine attributes based on a schema. Alternatively, one can rely entirely on MutationObserver
to capture fragment-level changes in nodes, followed by a diff
operation. The famous Quill
editor employs this approach.
Handling inputs in Quill
is not overly complex, despite involving significant event communication and special case handling. The core logic remains relatively clear. However, a potential challenge is that the view layer encapsulated by parchment
is not in the core package. Although some methods are inherited and overridden, debugging can still be cumbersome, especially as elements like Text
are directly exported from Quill
.
Overall, the uncontrolled input of quill
is divided into two processing methods. For regular ASCII
input, it directly compares the MutationRecord
's oldValue
with the latest newText
text to obtain the changed ops
. However, for input methods like IME
, such as Chinese input, it may lead to multiple Mutations
, resulting in a full delta
comparison to identify the changes.
The key issue to note here is why textBlot
can obtain the latest value, whether from MutationRecord
or getDelta
, it is acquired using textBlot.value()
. The getDelta
section iterates through all Bolts
to obtain the latest value
, with this part being cached by line to prevent potential performance issues.
TextBlot
is implemented in parchment
, hence debugging can be challenging. It is important to focus on updating to the latest text; we only need to focus on updates related to plain text content. Blot
has an update method that is triggered when the DOM
changes. It is crucial to note that updates are obtained from static methods, not from the instance's .value
.
Additionally, it is important to consider the timing of updates, which means that the call timing must update the content of the Blot
first to obtain the latest text content, and finally schedule the scroll
update to update the editor model. We primarily focus on input changes, but there are also DOM
structure changes caused by format
, which belong to the optimize
method to handle the MutationRecord
part.
There is an interesting implementation when executing the diff
method with the cursor
parameter, considering a scenario where text changes from xxx
to xx
, there are many possibilities. It could be deleting a character at any position, deleting a character forward at the cursor position, or even deleting two x
and then inserting an x
.
Therefore, to precisely identify the changes in ops
, the cursor position needs to be passed into the diff
method. This allows the string to be divided into three segments, where the prefixes and suffixes are the same, and the middle part serves as the differing content. Handling this part is a high-frequency operation, bypassing the complexity of the actual diff
process, for better performance in handling text changes.
Semi-controlled method refers to handling English input, content deletion, IME
input, and additional KeyDown
, Input
events through BeforeInputEvent
and CompositionEvent
, to assist in completing this work. This approach allows interception of user input to construct changes applied to the current content model.
Of course, for events like CompositionEvent
, extra handling is needed because, as mentioned earlier, IME
input cannot be fully controlled, hence semi-controlled is the mainstream implementation method. Due to browser compatibility, compatibility processing for BeforeInputEvent
is usually needed, such as leveraging React's synthetic events or onKeyDown
to achieve the necessary compatibility.
The input mode of the slate
editor is implemented in a semi-controlled manner, primarily based on the beforeinput
event and composition
related events to handle input and deletion operations. Initially, when slate
was first implemented, the beforeinput
event was not widely supported, but now it can be used in most modern browsers, while the composition
event has long been widely supported.
Looking at the controlled part, our control specifically refers to preventing default input behavior, where we can actively update the editor model based on relevant events. In this input scenario, we mainly focus on the inputType
related to insert
; yet, there are numerous patterns to handle on input, and slate
also has extensive compatibility logic to address various browser implementation issues.
Next, we can focus on the uncontrolled part of Slate, which is due to the inability to fully control IME input, leading to compatibility issues that must be addressed. Handling this compatibility in Slate can be somewhat complex, with inconsistent behavior across different browsers; for example, in Safari, there is an insertFromComposition
type that needs correction in similar scenarios to maintain the editor model.
Apart from the inability to prevent default behaviors, the uncontrolled aspect is also evident in modifications to the DOM structure. This part can be considered the most challenging to handle because once the IME is activated, it inevitably modifies the DOM. This means that this portion of the DOM is in an unknown state, and any unforeseen changes in the DOM content could disrupt the synchronization of the editor model, requiring additional compatibility efforts.
A fully controlled approach refers to the method where characters are recorded when any content is entered, the original content is deleted when the input ends, and a new Model
is constructed. Full control usually requires a hidden input box or even an iframe
to be used. Since only one focus must be maintained on the browser page, this method also requires the implementation of custom selection.
There are many details to be handled, such as drawing content during CompositionEvent
without triggering collaboration. Additionally, to achieve a consistent input experience with browsers, such as the pinyin status prompt when the input method is activated in a browser, this prompt is not only for display, pressing the left and right keys allows for candidate word selection changes, which also need to be simulated in a fully controlled mode.
In an editor implemented with a controlled mode, we can classify the dependence on browser APIs
into three categories based on the level of reliance, from high to low reliance, representing a varying level of implementation difficulty. The three types are those dependent on iframe
focus magic and Editable
types, those not dependent on Editable
but rely on DOM
for custom selection implementation, and those fully based on Canvas
drawing.
We can find typical editor implementations for these three types, such as the TextBus
relying on iframe
magic, documents like DingTalk and Zoom relying on custom selection, and documents like Tencent Docs and Google Docs fully based on Canvas
drawing. In reality, there are relatively few open-source editor implementations of controlled input mode because it is complex to implement and requires a lot of compatibility handling.
Next, let's take a look at these three types separately. Firstly, the implementation method of iframe
magic needs to be discussed, which inevitably involves browser focus issues. In browsers, the selected text's focus is placed on the selected text, and clicking on another input field at this time will cause the focus to shift, which can be viewed through document.activeElement
.
There are certain specifications regarding what elements can gain focus, such as editable elements, the tabindex
attribute, a
tags, and so on, which we will not delve too deeply into. The problem here is that if we place a separate input
to receive input instead of directly relying on Editable
, there will be an issue with the browser's selection shifting, making it impossible to select text.
Therefore, typically, when choosing to use an additional input
to handle input, we must independently draw the selection effect, known as the drag-blue effect. However, in the presence of iframes, browsers do not strictly maintain a singular selection effect, which is what we refer to as magic, such as the very unique implementation of TextBus
mentioned earlier.
TextBus
does not use common implementation solutions like ContentEditable
or custom selection as seen in CodeMirror
or Monaco
. From examining the DOM
nodes of the Playground
, it maintains a hidden iframe
for implementation. Within this iframe
, there is a textarea
used to process IME
input.
Now looking at a simple example with iframes and content selection focus competition, it can be noticed that under continuous iframe competition, we cannot drag text selections. It is worth mentioning that we cannot directly focus
in the onblur
event, as this operation will be blocked by the browser and must be triggered asynchronously in a macro task.
The key point is that this "Magic" behavior can be triggered in various browsers. Specifically on desktop browsers; behavior might differ on mobile browsers due to inconsistent event standards for key inputs, as highlighted in the draft.js
documentation under "Not Fully Supported" for mobile devices.
As for fully implementing custom text selections in editors, I have not come across any open-source implementations yet. The complexity lies in simulating the entire browser's interaction behaviors. Browsers handle many intricate details of text selection interactions, such as extending selections when dragging even off text areas, and the threshold for selecting characters in the middle of dragged text.
Rich-text editors have not been closely monitored, but code editors like CodeMirror
and VSCode (Monaco)
have implemented custom text selections. Commercial online document products like DingTalk Docs, Zoom Docs, and Youdao Cloud Notes also have custom text selections. Since the selection DOM
typically does not respond to mouse events, direct DOM
manipulation can be used for debugging.
For implementations like DingTalk Docs which use it as a web-component
, a bit more effort may be required for exploration. Additionally, a previous mentioned method of custom text selection involves caretRangeFromPoint
and caretPositionFromPoint
APIs to calculate selection positions; refer to the article on the core interaction strategies of browser selection models.
Lastly, editors entirely drawn using Canvas
present a different challenge, requiring manual drawing for both text and selections. Since browsers provide only basic APIs
for Canvas
, it acts as a blank canvas, necessitating self-implemented features and event handling, making it quite laborious.
The typical implementations are demonstrated in Google Docs and Tencent Docs, both commercial document editors fully based on Canvas
drawing. Google Docs, as the pioneer of Canvas
implementation, has articles comparing its old and new versions, particularly noting updates in editing interfaces and layout engines; a link can be found in the reference section at the end.
Interestingly, compared to editors relying on controlled DOM
inputs, open-source implementations of rich-text editors based on Canvas
drawing do exist, such as canvas-editor
, a well-developed open-source rich-text editor utilizing Canvas
drawing. However, unless there is a clear need for features like word processing, pagination, layout, and printing, relying solely on Canvas
for implementation remains a costly endeavor.
As demonstrated above, the behavior in a browser is different. Thus, to surpass the browser's formatting limitations, one has to develop their own layout capabilities. Rendering the position of each character, deciding line breaks, and other layout strategies all need to be implemented manually, leading to a myriad of boundary conditions to consider. In my previous work with an editor based on Canvas
, a significant amount of time was dedicated to the rich text drawing and layout part.
Secondly, there is the implementation of selection rendering. As we discussed earlier, the caretRangeFromPoint
and caretPositionFromPoint
APIs are used to calculate the selection position, provided by the browser's selection calculation capabilities. When working with text drawn on Canvas
, devoid of the DOM, all calculations for such functionalities need to be done manually. However, details like character width are stored, making the task less complex.
While Canvas
can break free from the browser's formatting constraints, eliminating the performance issues stemming from DOM
complexities, the inherent complexity of using Canvas
itself poses a challenge. Moreover, abandoning DOM
essentially means forsaking the current relevant ecosystem, including aspects like SEO and accessibility, which cannot be directly utilized. Hence, unless there is an absolute necessity, one should approach this transition with caution.
Coming back to the input part, given that direct interaction with IME is not possible in the browser, one is limited to handling events triggered by the browser, essentially relying on the input
function for user input interaction, which corresponds with the self-implemented selection rendering mode using DOM
. Implementing rich text using Canvas
is akin to building a browser's layout engine, demanding considerable effort.
Additionally, concerning collaborative input, we mentioned earlier that word selection changes when pressing left or right keys, which should not affect other clients in a collaborative setting. Nevertheless, as the content length changes, collaboration cannot simply filter out the local
attribute. Fully mimicking browser behavior for formatting and interacting with the DOM requires accounting for these details, without entirely decoupling from the existing rendering framework.
Hence, this collaborative input aspect needs extra handling. The straightforward approach involves temporarily suspending collaborative processing, merging the intended final states, and then collaboratively sharing the consolidated state. Another method is extending the Z
on the AXY
scheduling model to implement a local queue. As the queue content is locally applied, methods for moving op
within the queue before and after are necessary. We can dive deeper into this temporary collaboration aspect later.
Based on the aforementioned input mode overview, our focus now shifts to the implementation of semi-controlled input mode. This mode stands as the predominant approach for most rich text editors today. Generally, in the semi-controlled mode, while ensuring a streamlined user input experience, it offers a relatively good degree of control and flexibility. Drawing from our prior discussions on input design and abstraction, we can relatively straightforwardly design the entire process:
Range Model
, involving transforming selections mapped from DOMRange
to the Model
. This step requires substantial lookup and iteration, supplemented by using the WeakMap
object discussed earlier to find the Model
for position calculations.BeforeInputEvent
and CompositionEvent
to respectively handle input/deletion and IME input. Constructing Delta Change
based on input to apply to the state structure and trigger ContentChange
, thereby prompting a view layer update.DOM
and our maintained Model
becomes imperative. This selection change interactivity necessitates simulating browser behavior, transferring the Model
mapped selections to DOMRange
selections and then applying them to the browser's Selection
object, involving multiple boundary conditions.At one point, I considered completing the task through self-drawn selections and cursors. I found controlling input via Editable
to be quite challenging, especially when it comes to the IME
, which can easily disrupt the current DOM
structure. Consequently, dirty data checks, forcibly refreshing the DOM
structure might be needed. However, a brief insight indicated that self-drawing has its own challenges, thus opting for the widely-used Editable
.
However, even Editable
poses numerous challenges, including a myriad of details that are hard to cover comprehensively. For instance, perceiving how to detect damage to the DOM
, requiring forced refreshes. When addressing all edge cases, the complexity of the code increases, potentially paving the way for performance issues, particularly when dealing with extensive documents.
Regarding performance, apart from the aforementioned WeakMap
optimization strategy, various areas merit further enhancements. Due to the nature of the Delta
data structure, maintaining a reciprocal transformation between Range-RawRange
selections is essential. Since we have recorded the start
and size
of LineState
, allowing bidirectional sorting based on start
, envisioning a binary approach for searches becomes viable. Furthermore, as we can predict precisely what content has been updated with each state, reusing the original state objects through computation instead of refreshing all objects with each update ensures an immutable approach, easing the maintenance intricacies and difficulties.
One more thing, a flattened data structure would be more suitable for large documents. Flattening implies simplicity, like the current Delta
, which is a flattened data structure. However, random access efficiency might be slightly slower. Perhaps when performance issues arise, it might be necessary to consider incorporating some data storage solutions such as PieceTable
, although that seems a bit far off for now.
In this context, controlled input mode refers to the parts that do not require invoking the IME
input method editor, typically indicating English input, numerical input, and so on. Building upon the above, our implementation here can become more straightforward. You simply need to prevent all default behaviors and then handle the original behaviors in a controlled manner. Taking text insertion insertText
and deletion deleteContentBackward
as examples, we can implement input and deletion of content.
The specific changes are encapsulated in the perform
class. When inserting text content, you first need to get the current selection's status node. If the current node is a void
node, input of content should be avoided. Then, retrieve the ready-to-use attributes of the folding selection or the tail attribute values in the case of a non-collapsed selection, and finally construct the changes in delta
to apply to the editor.
Dealing with content deletion becomes more complex, as we need to consider the state of line breaks during deletion. The main issue is that our line attributes are placed on the line breaks, which can be counterintuitive. The way EtherPad
manages line formatting by placing it at the beginning of lines leads to many rendering-related behaviors, as various line formats like lists, quotes, etc., are rendered at the start of lines.
This aspect of interaction strategies can become considerably intricate. For instance, if the previous line is formatted as a heading, and the current line is a quote format with the cursor positioned at the beginning of the line, directly deleting content might result in the heading format being deleted and the quote format merging with the previous line. This behavior aligns with the quill
editor and is primarily influenced by the data structure. To achieve a more intuitive result, either the data structure or content changes need to be addressed.
Modifying the data structure directly would entail complex compatibility implementations, such as basic normalization requiring the absence of continuous text before \n
tokens, consideration for block structures, as well as the need to evaluate the presence of attributes at the start of lines. Therefore, here we strive to maintain the document's data structure as much as possible through handling changes in content.
The basic content deletion handling is relatively straightforward here, as it only requires deleting content with a length of 1
. Of course, due to the presence of content like Emoji
, the length is often greater than 1
, and using alt+delete
deletes content from a word perspective, which we will address later.
The input part may not seem very complex at first glance, but it's not as straightforward as it appears. For example, after mapping the selection to our self-maintained Range Model
, when performing an input operation, let's say we have two spans
at the beginning, with the current DOM structure being <span>DOM1</span><span>DO|M2</span>
, where |
denotes the cursor position.
If we insert content x
between DO
and M2
characters in the second span
, whether through code apply
or user input, it will cause the DOM2
span to undergo a ContentChange
due to apply
resulting in DOM node refresh, meaning the second span is no longer the original span but a new object.
This change in DOM results in the browser's cursor no longer locating the original DOM2
span structure, so the cursor now becomes <span>DOM1|</span><span>DOxM2</span>
. While we might expect the selection to adjust accordingly during input, practical evidence shows that this method is not effective because the DOM nodes are not consistent.
Therefore, what's missing here is updating the DOM Range based on our Range Model and updating the DOM Range as soon as the DOM structure is finalized. This operation needs to be carried out in useLayoutEffect
rather than useEffect
, similar to DidUpdate
in class components, to proactively update the DOM Range.
Here the uncontrolled input mode refers to the part that needs to wake up the IME input method, usually referring to Chinese input, Japanese input, and so on. Since it is an uncontrolled mode, it is easy to cause some issues because the input method in the browser will directly modify the DOM, and we cannot prevent this behavior. Therefore, we can only make corrections after the DOM changes, which is what we commonly refer to as dirty DOM checking.
For example, initially, the current DOM structure is <s>DOM1</s><b>DOM2</b>
. At this point, when we input Chinese characters at the end of the two DOM elements, that is, triggering the IME input method. When we type the words "try out," without applying any additional style, similar to inline-code
, the DOM structure will change to <s>DOM1</s><b>DOM2 try out</b><s>try out</s>
.
It is obvious that the text inside the <b>
tags is abnormal. At this point, our data structure in Delta is correct, as our defined schema does not add any styles. However, this discrepancy causes concern; although the state and delta of the <b>
tag have not changed for us, the DOM has changed due to the input method.
When our Model, maintained by us, is mapped to React's Fiber, because the Model has not changed, React, based on the VDOM diff result, determines that there is no change and proceeds to reuse this DOM structure. However, in reality, this DOM structure has been disrupted by our IME input, leading to issues since we cannot control the IME input.
As we prevent the default behavior during English input, the original DOM structure remains unchanged. Therefore, here, we need to conduct dirty data checks and correct any inconsistencies to ensure the final data is accurate. Presently, one approach being taken is to handle the most basic Text components. In the ref callback, we check if the current content matches op.insert
; if not, we clear all nodes except the first one and revert the content of the first node to the original text content.
When it comes to Chinese input, two aspects need attention. First, when waking up the IME input method, we need to avoid triggering editor-related events such as selection changes and input events. Second, we need to pay attention to the event that marks the end of input method usage, which is the compositionend
event. After input method input ends, we can insert content here and perform the aforementioned dirty DOM check.
Concerning the composition event sequence, it consists of three events: compositionstart
, compositionupdate
, and compositionend
, corresponding to the awakening, updating, and ending of the input method. Even if not implementing an editor, this is relevant in inputting content; for instance, when pressing Enter, if you do not check whether the input method is activated, unintended actions may occur.
Previously, we implemented the selection module for the editor, achieving a controlled selection synchronization mode, which is one of the core state synchronization modes mentioned in the MVC
layered architecture. Here, building upon the selection module, we utilize browser composition events to implement a semi-controlled input mode. This is also an important implementation of state synchronization, and widely used in most rich text editors as the mainstream input method.
Next, we will focus on handling the default behaviors of complex DOM
structures in browsers, as well as various input scenarios for IME
input method compatibility. Essentially, we will address input method and browser compatibility behaviors on a case-by-case basis. For example, we need to handle issues like the length of Emoji
emoticons, DOM
structure input method operations, more complex dirty DOM
checks, and more.