When implementing an online document system, it's essential to consider version control and auditing capabilities, as these are crucial aspects of the entire document management process. In this process, document differencing or diff capability is typically required, allowing us to track document changes like the variances between a document draft and the live document, or the disparities between private versions A and B. This article is based on the Quill rich text editor engine and delves into the implementation of document diff algorithms.
Quill is a modern rich text editor known for its excellent compatibility, robust extensibility, and some out-of-the-box features. Open-sourced in 2012, Quill has brought many innovations to the rich text editor scene, making it one of the most widely used editors in the open-source community. The ecosystem around Quill is rich and diverse, with numerous examples available on platforms like GitHub, including comprehensive collaboration implementations.
Firstly, let's consider a scenario - when describing a piece of plain text, simple input suffices. For instance, the underlying data structure of this article is plain text, and the content format is compiled by the compiler through lexical and syntax analysis, akin to serialization and deserialization. However, for a rich text editor like Quill, frequent serialization and deserialization during editing are performance-intensive. Hence, the data structure needs to be easily readable and writable, such as a JSON object. Describing rich text in JSON can take various forms, but fundamentally, it involves attaching additional attributes to specific text segments. For instance, if A is bold and B is italic, bold property is attached to A, and italic property to B. This data structure can effectively represent rich text content.
For Quill, the data structure is described by quill-delta. This data structure's design is excellent, and quill-delta also serves as an implementation of rich text Operational Transformation (OT) collaboration algorithm. However, for the diff capability we are focusing on here, it pertains more to the data structure level than collaboration. In quill-delta, the data structure for a piece of text is illustrated as follows. Of course, quill-delta expressions can be quite elaborate, enabling complete document content description through operations like retain, insert, and delete. We will delve into these Op when implementing the diff view feature later on.
Upon seeing this data structure, one might think it's just a standard JSON. Can't we directly use JSON diff since there are numerous algorithms available? This method theoretically works for purely insert text content, but the granularity may lack precision, failing to pinpoint specific word modifications. Thus, constructing a delta based on the diff results as per the quill-delta structure from A to B is not straightforward. For instance, in the following JSON, the result of our diff is deleting insert: 1 and adding "insert": "12", "attributes": { "bold": true }. In reality, what we've done is added bold style to 1 and appended 2 with bold. To apply this diff result to generate B from A, two actions are required: precise content modifications and transforming the diff result into delta. Hence, a better diff algorithm design is needed to simplify the entire process.
Our goal here is to achieve a finer granularity in diff, construct delta directly, and apply it, meaning A.apply(diff(a, b)) = B. Quill-delta already has a pre-implemented diff algorithm, but we have streamlined it for better comprehension, especially concerning non-insert operations. It's crucial to note that the diff we discuss here is for non-collaborative modes. For document editors that have implemented OT, relevant version Op can be extracted from the history for compose + invert, and performing a full document diff algorithm isn't always necessary.
Complete DEMO can be directly opened in the console at https://codesandbox.io/p/devbox/z9l5sl to view it. As mentioned earlier, after using JSON for diff, we still need to process the data in two additional steps, especially for dealing with granularity seems more challenging. So, why not change our perspective on this granularity issue? We are working on processing rich text, and rich text is text with attributes. Therefore, can we adopt a text diff algorithm, then perform additional processing on the attribute values? This way, we can handle granularity very finely. In theory, this approach seems feasible, so we can continue exploring along this line of thinking.
First, let's consider the pure text diff algorithm. So, let's briefly understand the diff algorithm used by diff-match-patch. This algorithm is generally considered the best general-purpose diff algorithm, designed by Eugene W. Myers here. Due to the capability of match and patch in diff-match-patch, but the algorithm we actually need only requires the diff capability, so we can just use fast-diff. It removes matching, patching, and all extra difference options, retaining only the basic diff capability, with the diff result being a two-dimensional array [FLAG, CONTENT][].
Next, we need to construct the string. As mentioned above, the data format of quill-delta has been mentioned, so it is simple to construct. We need to first create a Delta object to carry the diff result of our delta.
The Iterator here is the iterator we will use to iterate over block structures. We can imagine that since the diff result consists of N characters, and the insert block in our Delta also has N characters, after diff, we need to process substrings of these two strings. Therefore, we need to iterate over N characters of the entire Delta for processing, and this part of data processing methods is encapsulated in the Iterator object. Let's take an overall look at the iterator's processing method.
In this case, the handling of the next method is a bit more complex. In the next method, our main goal is to extract the content of the insert part. It's important to note that each time we call insert, it won't span across ops. In other words, each next call can only extract up to the length of the insert stored in the current op. If the content exceeds the length of a single op, the corresponding attributes will not be consistent, making direct merging impossible. In such cases, we need to consider scenarios where the diff result is longer than the insert, requiring compatible handling of the attributes, essentially breaking down the diff result into blocks for processing.
Now, with the Iterator, we can extract the 'insert' part of 'op' in a more granular manner. Next, let's return to handling the diff. First, let’s look at the diff of the attributes. Fundamentally, assuming the current data structure is Record<string, string>, we can directly compare two attributes. The essence of diff is transforming a to become b through certain calculations, where a + diff = b. Therefore, iterating through all keys, if the values of the two attrs differ, we can assign the value of b to the target attrs based on value comparison.
Because we have already broken it down quite finely upfront, the final step won't be too complicated. Our goal is to construct the a + diff = b equation, focusing on the diff part. Therefore, when constructing diff, the target to consider is a. We need to keep this purpose in mind throughout the process, or else it may be hard to understand the operations on delta. In the overall process of diff, there are three main parts to handle: iter1, iter2, and text diff. Depending on the type of diff produced, we need to process accordingly. The guiding principle is to take the shorter length as the block to process. In the diff.INSERT section, we insert delta from iter2 inserts. In the diff.DELETE section, we apply the length of delete from iter1 to delta. In the diff.EQUAL section, we extract op from both iter1 and iter2 to handle the diff of attributes or op replacement.
Here we can illustrate the effect of 'diff' with an example. To see the specific effect, you can open the console in https://codesandbox.io/p/devbox/z9l5sl in the src/index.ts file. It mainly demonstrates the three types of 'diff' - DELETE, EQUAL, INSERT, and the resulting delta, where in this case ops1 + result = ops2.
Now that our document diff algorithm is ready, it's time to get down to business and think about how to apply it to specific documents. Let's start with a simple approach. Imagine we have performed a diff between documents A and B, resulting in a patch. We can modify the diff directly, structure it as desired, and apply it to A to obtain a comparison view. Of course, we can also apply deleted content from A and added content from B to generate the view, which we'll discuss later. For now, let's focus on getting the comparison view directly from A.
Essentially, a comparison view highlights deletions in red and additions in green. Since rich text comes with its own formatting, we can leverage this to implement highlighting directly using rich text capabilities.
The core algorithm based on this approach is quite simple. Here, we handle formatting modifications by changing DELETE content to RETAIN with red attributes, adding green attributes to INSERT types, and then assembling this modified patch onto A's delta. Applying the complete delta to the new comparison view is all that's needed. For a complete DEMO, refer to https://codepen.io/percipient24/pen/eEBOjG.
In essence, the core functionality in the code is just a few lines. It's great to solve complex requirements with a simple approach. In less complicated scenarios, you can achieve a comparison within the same document area or use separate views to apply deletion and addition deltas. However, as scenarios become more complex, such as needing to allow real-time editing and displaying diff results in the view representing additions on the right side, directly applying diff-delta to the document may pose some issues. Apart from potential performance concerns from continuously applying delta to rich text, in collaborative environments, managing local Ops and History is necessary, or in non-collaborative settings, filtering relevant keys to prevent storing diff results.
If the scenarios mentioned above just touch the advanced capabilities in basic functionalities, then in search scenarios, directly applying highlights to rich text content doesn't seem feasible. Imagine applying search results directly from the data layer to rich text for highlighting. This would involve all the issues mentioned earlier, and the performance cost due to frequent content changes is unacceptable. In slate, the concept of decorate exists, allowing the construction of Range to consume attributes without changing the document content, which aligns with our needs. Hence, we need a way to highlight content without modifying the rich text content, but it's not easy to achieve this capability at the editor-level rendering like slate. Alternatively, we can take a different approach by simply adding a semi-transparent highlight overlay to relevant positions, making it much simpler. Here, we refer to it as a virtual layer.
On a theoretical level, implementing a virtual layer is rather simple, just adding an extra layer of DOM, but there are many details to consider in this process. Firstly, let's address an issue: if we place a masking layer directly above a rich text editor, with a higher z-index than the text editor's, clicking on the masking layer would cause the rich text editor to lose focus. While we can use event.preventDefault to prevent the default behavior of focus shifting, other actions like click events would still pose similar problems. For instance, if a button's click action in the rich text editor is user-defined, covering the button with the masking layer would redirect the click event to the layer instead of triggering the button's action as expected. This scenario does not align with our intentions.
To tackle this, let's shift our perspective. Lowering the z-index below that of the rich text editor can resolve the event issues. However, it triggers another problem; if the rich text content itself has a background color, adding a masking layer would conceal the layer's color beneath the original background color. Since our rich text capabilities are typically plugin-based, we can't control the user's background color plugins. These plugins must contain some level of transparency, and our masking layer needs to offer a universal capability. As a result, this solution also has its limitations.
The solution to this issue is actually quite simple. In CSS, there's a property named pointer-events, which, when set to none, ensures an element never becomes the target of mouse events. This can address the problems caused by the earlier approach. By using this, we can achieve our basic virtual layer styles and event handling. Additionally, using this property may lead to an interesting observation: right-clicking on the masking layer won't allow you to directly inspect the node in the console; you must use the Elements panel to select the DOM node rather than the reverse.
After determining the method for drawing the masking layer, the next step is to confirm the position information for drawing shapes. Since the DOM nodes created by our rich text editor do not have separate nodes for each character, but rather nodes with the same attributes grouped together, a new issue arises. When dealing with a diff result, it's highly likely that we won't have a complete node for a particular DOM, but just a few characters. In such cases, we can't directly use Element.getBoundingClientRect to get the position information. Instead, we must rely on document.createRange to construct a range. It's worth noting that we are handling Text nodes, and only Text nodes and similar ones can set offsets. Both the start and end nodes in a Text node range can be directly constructed without needing to match exactly. For instance, in Quill, editor.getBounds offers location information retrieval. This method essentially uses editor.scroll to obtain the actual DOM and wraps it with document.createRange, handling various edge cases.
Furthermore, we need to discuss another issue. With diff, we cannot guarantee the length of the current result. As we've established previously, we are implementing diff for plain text. Therefore, the diff result may be lengthy, which can potentially lead to problems. While using editor.getBounds(index, length) directly provides a rect (rectangle) covering range, issues may arise when dealing with longer diff results. Two key considerations arise in such cases: first, single-line content extending beyond the editor's visible area, causing line breaks; and second, content that spans multiple lines with \n characters included in the diff results.
Assuming the content above is the diff result, we are not focusing on whether it is of type INSERT/DELETE/RETAIN for now. Our current goal is to implement highlighting. In these two scenarios, if we directly highlight the rectangular area fetched through getBounds, it is obvious that a large amount of non-text or blank spaces will be highlighted. In this case, our approach would be to take the maximum range for highlighting coverage. Actually, we can accept it if only blank spaces are covered, but consider a scenario where only some content is changed. For example, content is inserted in the entire line N, and at the beginning of line N+1, a letter is also inserted. Due to the width of line N affecting line N+1, the highlight ends up covering the entire line. This inaccurate highlighting in our diff result is not acceptable, whether it involves line breaks or spanning across lines.
Next, we need to address these two issues. For the calculation of positions spanning multiple lines, we can adopt a relatively simple approach. We just need to know exactly where the line breaks occur and split the diff result accordingly to change our granularity from document-level to line-level processing. However, Quill doesn't directly provide operations at the line Range level. Therefore, we need to maintain our own line-level index-length. We can easily split the index-length comprehensively by using delta insert and calculate using editor.scroll.lines. When there are changes in the document content, we can also maintain index values based on delta-changes. Moreover, if we manage Blocks through multiple Quill instances, managing at the Line level becomes more straightforward, and maintaining index becomes simpler. However, during diff, we will need a Block tree-level implementation. It's relatively easier to implement diff when dealing with Blocks with the same id, but when there is a need for diff across Blocks, the implementation might become more complex.
Once we have split the line index-length index, the next step is to break down the original complete diff-index-length into line-level content. Here, it is crucial to handle the line-identifying node, namely \n, attributes, with special consideration. Because all modifications to this node are directly applied to the entire line. For instance, when a line changes from a second-level heading to a first-level heading, the entire line needs to be highlighted as a style change. Of course, there may also be content additions or deletions within the heading, and these highlights can be displayed in different colors. This is one of the reasons we need to maintain line-level Range.
Next, we need to address the issue of line wrapping due to long single-line content. Since we have already segmented the diff results into lines, we can focus on how to render the highlights. As mentioned earlier, we cannot directly draw the rect obtained from calling getBounds onto the text. Therefore, we can consider splitting a paragraph into three parts: the first line head, content body, and last line tail. Only the beginning and end of each line will have partial highlighted areas, which require separate calculation of rect. The body part will always have a complete rect which can be rendered directly. Therefore, using three rects to represent the highlight of single-line content will suffice. The data returned from getBounds is sufficient to support the processing of single-line content into three sections. We need to obtain the rect of the first head and last tail, and calculate the rect of the body part based on these two rects. Depending on the number of actual line breaks, different scenarios need to be considered: for single-line content, only the head is sufficient; for two lines, both head and tail are needed for rendering with body acting as a placeholder for positioning; for multiple lines, head, body, and tail need to render their respective contents to ensure the integrity of the layers.
After solving the above two issues, we can apply delta to the diff algorithm to obtain results, and divide them line by line to construct a new Range. Here we want the left view to reflect DELETE content and the right view to reflect INSERT + RETAIN content. By storing the constructed Range in different arrays based on the different types of diff, and finally obtaining position information with editor.getBounds based on the Range, constructing new layer DOM at relevant positions to achieve highlighting is all we need to do.
To summarize the overall process, in order to implement a diff based on virtual layers, we need a diff algorithm, construct Range, calculate Rect, and render DOM. In fact, to achieve this skill well is quite complex, especially with many edge cases to handle. For example, some text may have different fonts or styles applied, resulting in different rendering heights compared to regular text. If the diff falls on these edges, it may cause issues in calculating our rect, leading to problems in rendering the styles of layer nodes. Here we have not addressed such issues, only established the entire process without special focus on edge cases. For a complete demo, you can directly visit https://codesandbox.io/p/sandbox/quill-diff-view-369jt6.