English

Exploring Rich Text: Online Document Delivery

When developing an online document system, it is crucial to consider the document export capabilities, especially for complex ToB products deployed privately. The ability to deliver private versions of documents becomes significantly important. Additionally, mature online document systems involve numerous complex scenarios that necessitate document export capabilities. In this article, we will delve into the plugin-based design implementation of document export to MarkDown, Word, and PDF using the Quill rich text editor engine.

Description

Recently, a friend shared an interesting story with me. One of their B end major clients requested the ability to support remote printer connection for printing documents directly from their online document system. The reason was quite valid - their company's top management prefers reading physical copies of documents instead of staring at computer screens. To retain this major client, it became imperative to prioritize this feature. Indeed, this capability is essential for a complete online document SaaS system.

Although our online document primarily serves as a SaaS, we can also provide services as a PaaS platform. The scenario is quite clear; for instance, the data structure stored in our document system is usually custom-made. When users wish to initialize document content through locally generated MarkDown templates, we need to offer import capabilities. Similarly, if users want to convert documents back to MarkDown templates, exporting capabilities are again required. For cross-platform data migration or collaboration, it is usually necessary to provide various data conversion capabilities through OpenAPI, fundamentally based on our data structure design.

Regarding data transformation capabilities, we can use a generic data structure type as a benchmark to perform various data format conversions. In our document system, the least costly generic data structure is typically HTML, which can serve as the basis for data transformation, with numerous open-source implementations available as references. Implementing data transformation using this approach is cost-effective, although it may not be as efficient. Therefore, the discussion here focuses on data conversion and export based on our benchmark data structure, the DSL - Domain Specific Language, specifically designed in the flattened rich text description DSL of quill-delta. This article discusses data conversion export using the design of quill-delta data structure. When designing the conversion model, it is essential to consider a plugin-based design, as we cannot guarantee that the document system will not expand block types in the future. Thus, this design concept is highly necessary, and each conversion design we are about to discuss will have related examples at https://github.com/WindrunnerMax/QuillBlocks/tree/master/examples.

MarkDown

In our work, we may encounter scenarios where users wish to embed online documents into their product's website, serving as API documentation or help center documents. Due to cost considerations, most help centers are built on MarkDown since maintaining a rich text product is relatively expensive. As a PaaS product, we need to provide data conversion capabilities. While offering an SDK to directly render our data structure is a product capability, in many cases, investing in manpower for document rendering migration is challenging. Therefore, direct data conversion is the most cost-effective approach.

The trend is gradually shifting from MarkDown to rich text for various product documentation. As developers, using MarkDown to write documents is common. Thus, initially using MD renderers for product deployment is reasonable. However, as products iterate and user bases grow, operational teams and professional technical writing teams come into play, especially for products maintained both domestically and internationally. In such cases, operational and technical writing teams play crucial roles. While we, as developers, may only complete the initial content writing, operational teams are required for maintenance and updates. Operational teams typically do not use MD for document writing, especially if the document repository is managed using Git, which makes it challenging to accept. Hence, WYSIWYG online document products become essential in such cases. Maintaining an online document product is costly, so most teams may opt for integrating a document middleware. The capabilities mentioned above become vital in such scenarios.

As an online document PaaS, it is essential not only to provide the capability of converting data to MD but also the ability to import from MD. Common scenarios include users using MD to write document templates and import them into the document system, along with products already online that do not yet have operational teams configured and use MD for document writing. These products utilize our document SDK renderer, requiring all document content updates to go through our PaaS platform, making data conversion to our DSL crucial in such cases. If positioning ourselves as a PaaS product, continuous compatibility with various scenarios and systems is necessary, aligning with the concept of middleware. However, this article primarily focuses on outbound data conversion solutions rather than data import capabilities.

So, now we are officially starting the conversion of data to MD. First, we need to consider an issue. Different MD parsers have varying levels of support for syntax. For example, in terms of the most basic line breaks, some parsers interpret a single line break as a paragraph, while others require two spaces followed by a carriage return or two carriage returns to properly interpret it as a paragraph. Therefore, to handle such compatibility issues, our plugin-based design is essential. Moving on to the next question, MD is a lightweight format description, and our DSL is a complex format description. We have a wide variety of block structures, so we also need HTML to assist us in converting complex formats. Now, the question arises: Why don't we convert it directly to HTML instead of mixing it with MD format? In reality, this is also for the sake of compatibility. Users' MD may involve different plugins, and combining with HTML may result in style discrepancies. Combining complex styles can be cumbersome, especially when utilizing mixin-react similar to the MDX implementation. Therefore, we choose MD as the base and HTML as an aid to achieve data conversion.

Earlier, we mentioned that our blocks are quite complex and actually involve many nested structures. In HTML, this is akin to nesting code blocks within a table structure. However, the data structure of quill-delta is flat, so we need to convert it into a nested structure for easy handling. Complete conversion to a tree structure would lead to increased complexity. Thus, we opt for a compromise by wrapping an external Map structure to dynamically construct nested structures via key when obtaining data.

// Data representation for rendering alignment
// Flatten the data structure for ease of handling nested relationships
class DeltaSet {
  private deltas: Record<string, Line[]> = {};

  get(zoneId: string) {
    return this.deltas[zoneId] || null;
  }

  push(id: string, line: Line) {
    if (!this.deltas[id]) this.deltas[id] = [];
    this.deltas[id].push(line);
  }
}

Moreover, we need to select a basis for processing the data. Our document is essentially composed of paragraph formats and inline formats. It is evident that we can split it into two parts: line format and inline format. Mapped to delta, this equates to nesting Line with Ops and carrying its own line format such as headings, alignment, etc. Essentially, with our DeltaSet structure, we divided it into three parts to describe the data structure we aim to convert.

const ROOT_ZONE = "ROOT";
const CODE_BLOCK_KEY = "code-block";
type Line = {
  attrs: Record<string, boolean | string | number>;
  ops: Op[];
};
const opsToDeltaSet = (ops: Op[]) => {
  // Construct the `Delta` instance
  const delta = new Delta(ops);
  // Convert `Delta` to a data representation of `Line`
  const group: Line[] = [];
  delta.eachLine((line, attributes) => {
    group.push({ attrs: attributes || {}, ops: line.ops });
  });
  // ...
}

For DeltaSet, we need to define the entrance Zone, marked as "ROOT" in the case of delta structure. In the following DEMO, we only defined the nesting structure of the CodeBlock block level. Therefore, in the example below, we are handling the data nesting expression of the code block. Since the original data structure is flat, we need to handle certain boundary conditions; that is, the start and end of the code block structure. When encountering a code block structure, we point the current processing Zone to a new delta block and establish a reference in the original structure by specifying a zoneId identifier in the op. Upon completion, we restore the pointer to the previous target Zone. Typically, when dealing with multi-level nested blocks, we need to use a stack. However, we will not delve into that here.

const deltaSet = new DeltaSet();
// Mark the current `ZoneId` being processed
// In reality, there might be multiple levels of nesting, in such cases, a `stack` should be used
let currentZone: string = ROOT_ZONE;
// Mark the current type being processed, useful in case of multiple types
let currentMode: "NORMAL" | "CODEBLOCK" = "NORMAL";
// Check if the current `Line` is a `CodeBlock`
const isCodeBlockLine = (line: Line) => line && !!line.attrs[CODE_BLOCK_KEY];
// Traverse the data of `Line` to construct `DeltaSet`
for (let i = 0; i < group.length; ++i) {
  const prev = group[i - 1];
  const current = group[i];
  const next = group[i + 1];
  // Start of a code block structure
  if (!isCodeBlockLine(prev) && isCodeBlockLine(current)) {
    const newZoneId = getUniqueId();
    // If there is a nested relationship, construct a new index
    const codeBlockLine: Line = {
      attrs: {},
      ops: [{ insert: " ", attributes: { [CODE_BLOCK_KEY]: "true", zoneId: newZoneId } }],
    };
    // Add a reference to the new `Zone` in the current `Zone`
    deltaSet.push(currentZone, codeBlockLine);
    currentZone = newZoneId;
    currentMode = "CODEBLOCK";
  }
  // Add the `Line` to the current zone being processed
  deltaSet.push(currentZone, group[i]);
  // End of the code block structure
  if (currentMode === "CODEBLOCK" && isCodeBlockLine(current) && !isCodeBlockLine(next)) {
    currentZone = ROOT_ZONE;
    currentMode = "NORMAL";
  }
}

Now that the data is ready, it's time to design the entire conversion system. As mentioned earlier, the converter consists of two types, so our plugin system is also divided into two parts. In essence, for MD, it's all about string concatenation, so the main output of the plugins is strings. An important point to note is that the same Op description might have multiple formats. For example, a block could be a combination of bold and italic, which would be handled by two different plugins. Therefore, plugins should not directly output results but should concatenate using prefix and suffix. This is especially crucial for line formats, where HTML tags are needed for expression. Additionally, when it's certain that there won't be nested nodes, such as in the case of image formats, a last identifier can be used to mark the last node, avoiding unnecessary checks.

type Output = {
  prefix?: string;
  suffix?: string;
  last?: boolean;
};

Since there are nodes that require HTML formatting and our iteration process is similar to recursive string concatenation, we need a flag to indicate when to parse into HTML instead of MD markup. For example, if a line node is center-aligned, all nodes within that line need to be parsed into HTML tags. It's essential to reset this flag at the beginning of each line iteration to prevent interference from previous content affecting subsequent content.

type Tag = {
  isHTML?: boolean;
  isInZone?: boolean;
};

In handling plugin types, passing adjacent descriptions together during iteration is useful for processing list formats where additional blank lines are required before and after lists. Combining inline formats also helps avoid generating multiple tags for a description block. It's crucial to assign each plugin a unique identifier. As mentioned earlier, compatibility across multiple scenarios is needed, and handling plugins in the order of instantiation for processing can ensure accurate display styles, setting plugin priority is necessary, such as combining reference and list stacking line formats where the reference format needs to be parsed before the list to display styles correctly.


```js
type LineOptions = {
  prev: Line | null;
  current: Line;
  next: Line | null;
  tag: Tag;
};
type LinePlugin = {
  key: string; // Plugin key override
  priority?: number; // Plugin priority
  match: (line: Line) => boolean; // Match rule for `Line`
  processor: (options: LineOptions) => Promise<Omit<Output, "last"> | null>; // Processing function
};
type LeafOptions = {
  prev: Op | null;
  current: Op;
  next: Op | null;
  tag: Tag;
};
type LeafPlugin = {
  key: string; // Plugin key override
  priority?: number; // Plugin priority
  match: (op: Op) => boolean; // Match rule for `Op`
  processor: (options: LeafOptions) => Promise<Output | null>; // Processing function
};

Now let's move on to the entry processing function. Firstly, we need to handle line formats because different line formats may result in different outcomes in inline formats. For example, a center-aligned line format may cause inline formats to be parsed into HTML tags, which are implemented through the variable `tag` object. It is possible that our line format may match multiple plugins, and all results should be saved. The same goes for inline formats. At the end of the processing function, we simply concatenate the results into a string.

const parseZoneContent = async (
  zoneId: string,
  options: { defaultZoneTag?: Tag; wrap?: string }
): Promise<string | null> => {
  const { defaultZoneTag = {}, wrap: cut = "\n\n" } = options;
  const lines = deltaSet.get(zoneId);
  if (!lines) return null;
  const result: string[] = [];
  for (let i = 0; i < lines.length; ++i) {
    // ... Get line data
    const prefixLineGroup: string[] = [];
    const suffixLineGroup: string[] = [];
    // Should not affect the 'Tag' passed from outside
    const tag: Tag = { ...defaultZoneTag };
    // Process line content first // Need to process line format first
    for (const linePlugin of LINE_PLUGINS) {
      if (!linePlugin.match(currentLine)) continue;
      // ... Execute plugin
      if (!result) continue;
      result.prefix && prefixLineGroup.push(result.prefix);
      result.suffix && suffixLineGroup.push(result.suffix);
    }
    const ops = currentLine.ops;
    // Process node content
    for (let k = 0; k < ops.length; ++k) {
      // ... Get node data
      const prefixOpGroup: string[] = [];
      const suffixOpGroup: string[] = [];
      let last = false;
      for (const leafPlugin of LEAF_PLUGINS) {
        if (!leafPlugin.match(currentOp)) continue;
        // ... Execute plugin
        if (!result) continue;
        result.prefix && prefixOpGroup.push(result.prefix);
        result.suffix && suffixOpGroup.unshift(result.suffix);
        if (result.last) {
          last = true;
          break;
        }
      }
      // If 'last' is not matched, need to default to add node content
      if (!last && currentOp.insert && isString(currentOp.insert)) {
        prefixOpGroup.push(currentOp.insert);
      }
      prefixLineGroup.push(prefixOpGroup.join("") + suffixOpGroup.join(""));
    }
    result.push(prefixLineGroup.join("") + suffixLineGroup.join(""));
  }
  return result.join(cut);
};

With the scheduler in place, our focus now shifts to implementing the plugins. Here, let's take the heading plugin as an example to implement the transformation logic. In fact, this part of the logic is very simple, it only needs to parse LineAttributes to determine the return value.

const HeadingPlugin: LinePlugin = {
  key: "HEADING",
  match: line => !!line.attrs.header,
  processor: async options => {
    if (options.tag.isHTML) {
      options.tag.isHTML = true;
      return {
        prefix: `<h${options.current.attrs.header}>`,
        suffix: `</h${options.current.attrs.header}>`,
      };
    } else {
      const repeat = Number(options.current.attrs.header);
      return { prefix: "#".repeat(repeat) + " " };
    }
  },
};

Translation

For inline plugins, the logic is similar. Here, we take the bold plugin as an example to implement the conversion logic. Similarly, it only needs to check OpAttributes to determine the return value.

const BoldPlugin: LeafPlugin = {
  key: "BOLD",
  match: op => op.attributes && op.attributes.bold,
  processor: async options => {
    if (options.tag.isHTML) {
      options.tag.isHTML = true;
      return { prefix: "<strong>", suffix: "</strong>" };
    } else {
      return { prefix: "**", suffix: "**" };
    }
  },
};

In https://github.com/WindrunnerMax/QuillBlocks/blob/master/examples/, there is a complete DeltaSet data conversion delta-set.ts and MarkDown data conversion delta-to-md.ts, which can be tested using ts-node. In fact, we may have also noticed that this dispatcher can not only convert MD format but can also perform complete HTML format conversion. With the HTML conversion logic in place, we now have a very common intermediate product to generate various files. Moreover, if the plugins are modified to a synchronous mode, this solution can also be used to handle the copying behavior of online documents, making its practical applications quite versatile. Additionally, during actual usage, it is crucial to conduct thorough testing of the plugins. Test cases should be accumulated during development to avoid unknown issues caused by modifications, especially when dealing with complex business scenarios involving multiple plugin combinations. Proper handling of various test cases becomes particularly crucial, especially in scenarios of full synchronous updates, emphasizing the accumulation of boundary test cases.

Word

Earlier, we discussed the data conversion compatibility of a PaaS platform, while the ability to directly generate delivery documents is indispensable for an SaaS platform, especially when the product requires private deployment and the provision of multiple online versions. Word is one of the most common document delivery formats, especially useful when the document needs to be exported for further modifications. In this section, let's discuss how to generate delivery documents in Word format.

OOXML, which stands for Office Open XML, is a new document format proposed by Microsoft in Office 2007. In Office 2007, Word, Excel, and PowerPoint default to the OOXML format, which has also become part of the ECMA standard with the designation ECMA-376. In practice, for current Word documents, we can directly unzip them to obtain encapsulated data by changing the file extension to zip. Inside, we find various components of a docx file.

[Content_Types].xml: Defines the content type of each file, marking whether a file is an image (.jpg) or textual content (.xml), for example.
_rels: Typically contains .rels files to save relationships between different Parts, describing the associations between different files, such as the connection between text and images.
docProps: Contains the property information of the entire Word document, such as author, creation time, and tags.
word: Stores the main content of the document, including text, images, tables, and styles, among others.
- document.xml: Stores all text and references to the text.
- styles.xml: Stores all styles used in the document.
- theme.xml: Saves the theme settings applied to the document.
- media: Stores all media files used in the document, such as images.

With all these descriptions, one might be perplexed about how to actually assemble a Word file given the complex relationship descriptions. Since it might be challenging to instantly grasp the composition of an entire docx file, we can rely on frameworks to generate docx files. After investigating some frameworks, I found roughly two methods of generation. One involves using a common HTML format for generation, such as html-docx-js, html-to-docx, pandoc, while the other method involves direct code control for generation, effectively skipping the HTML conversion step, like officegen, docx. Noting that many libraries have not been updated for years and aiming for direct docx output without an intermediate HTML step, especially for online document deliveries requiring strict formatting control, I opted for using docx to generate Word files.

docx simplifies the generation process of the entire Word file. By constructing hierarchical relationships of built-in objects, we can easily create the final file. Moreover, this process can run seamlessly in both Node and browser environments. Therefore, in the upcoming demo in this section, there will be versions for both Node and browser environments. Now, let's take the Node version as an example to discuss how to generate a Word file. Firstly, we need to define styles. In Word, there is a style pane module, similar to CSS classes, where we can understand it as a way to reference styles directly when generating the document, rather than defining styles for each node individually.

const PAGE_SIZE = {
  WIDTH: sectionPageSizeDefaults.WIDTH - 1440 * 2,
  HEIGHT: sectionPageSizeDefaults.HEIGHT - 1440 * 2,
};
const DEFAULT_FORMAT_TYPE = {
  H1: "H1",
  H2: "H2",
  CONTENT: "Content",
  IMAGE: "Image",
  HF: "HF",
};
// ... Basic configurations
const PRESET_SCHEME_LIST: IParagraphStyleOptions[] = [
  {
    id: DEFAULT_FORMAT_TYPE.CONTENT,
    name: DEFAULT_FORMAT_TYPE.CONTENT,
    quickFormat: true,
    paragraph: {
      spacing: DEFAULT_LINE_SPACING_FORMAT,
    },
  },
  // ... Preset formats
]

Next, we need to tackle the unit conversion. When working with word, we often use the unit value PT, whereas in our browser it is usually PX. In our demo, we mainly deal with handling image sizes using DAX and proportions, so here we have listed the unit conversions that are used.

const daxToCM = (dax: number) => (dax / 20 / 72) * 2.54;
const cmToPixel = (cm: number) => cm * 10 * 3.7795275591;
const daxToPixel = (dax: number) => Math.ceil(cmToPixel(daxToCM(dax)));

Similar to the MD conversion, we also need to define the logic for the conversion dispatch. However, one difference is that in MD, the output is a string and offers great flexibility, whereas in docx, there are strict object structure relationships. Therefore, here we need to strictly define the relationships between lines and inline types, and the passed Tag needs to contain more content.

type LineBlock = Table | Paragraph;
type LeafBlock = Run | Table | ExternalHyperlink;
type Tag = {
  width: number;
  fontSize?: number;
  fontColor?: string;
  spacing?: ISpacingProperties;
  paragraphFormat?: string;
  isInZone?: boolean;
  isInCodeBlock?: boolean;
};

The plugin's input design is similar to MD, but the content of the output needs to be more precise. The output of inline element plugins must be inline object types, and the output of line element plugins must be line object types. It is important to note that in line plugins, we pass the leaves parameter, which means that at this point, the scheduling of inline and line elements is handled by line plugins, rather than by the external Zone scheduling module.

type LeafOptions = {
  prev: Op | null;
  current: Op;
  next: Op | null;
  tag: Tag;
};
type LeafPlugin = {
  key: string; // Plugin override
  priority?: number; // Plugin priority
  match: (op: Op) => boolean; // Match `Op` rules
  processor: (options: LeafOptions) => Promise<LeafBlock | null>; // Processing function
};
type LineOptions = {
  prev: Line | null;
  current: Line;
  next: Line | null;
  tag: Tag;
  leaves: LeafBlock[];
};
type LinePlugin = {
  key: string; // Plugin override
  priority?: number; // Plugin priority
  match: (line: Line) => boolean; // Match `Line` rules
  processor: (options: LineOptions) => Promise<LineBlock | null>; // Processing function
};

Moving on to the entry Zone scheduling function. Unlike the previous MD scheduling, here we first need to handle the leaf nodes, which are inline styles. A crucial point to note here is that a Table object cannot wrap a Paragraph object. If we need to implement a block structure, then the outer element should wrap a Table and not a Paragraph. This means that the content of inline elements will determine the format of line elements, where A affects B, so we first handle A, the inline elements. Therefore, the inline elements are processed first, and only one plugin will match a single block structure. Hence, the handling of common content will need to be encapsulated into generic functions.


```js
const parseZoneContent = async (
  zoneId: string,
  options: { defaultZoneTag?: Tag }
): Promise<LineBlock[] | null> => {
  const { defaultZoneTag = { width: PAGE_SIZE.WIDTH } } = options;
  const lines = deltaSet.get(zoneId);
  if (!lines) return null;
  const target: LineBlock[] = [];
  for (let i = 0; i < lines.length; ++i) {
    // ... Fetch line data
    // Cannot affect the `Tag` passed from outside
    const tag: Tag = { ...defaultZoneTag };
    // Process node contents
    const ops = currentLine.ops;
    const leaves: LeafBlock[] = [];
    for (let k = 0; k < ops.length; ++k) {
      // ... Fetch node data
      const hit = LEAF_PLUGINS.find(leafPlugin => leafPlugin.match(currentOp));
      if (hit) {
        // ... Execute plugin
        result && leaves.push(result);
      }
    }
    // Process line contents
    const hit = LINE_PLUGINS.find(linePlugin => linePlugin.match(currentLine));
    if (hit) {
      // ... Execute plugin
      result && target.push(result);
    }
  }
  return target;
};

Next, we also need to define plugins. Here we take the text plugin as an example to implement the conversion logic. Since basic text styles are encapsulated in the TextRun object, we only need to handle the properties of the TextRun object. Of course, for other Run type objects such as ImageRun, we still need to define plugins to handle them separately.

const TextPlugin: LeafPlugin = {
  key: "TEXT",
  match: () => true,
  processor: async (options: LeafOptions) => {
    const { current, tag } = options;
    if (!isString(current.insert)) return null;
    const config: WithDefaultOption<IRunOptions> = {};
    config.text = current.insert;
    const attrs = current.attributes || {};
    if (attrs.bold) config.bold = true;
    if (attrs.italic) config.italics = true;
    if (attrs.underline) config.underline = {};
    if (tag.fontSize) config.size = tag.fontSize;
    if (tag.fontColor) config.color = tag.fontColor;
    return new TextRun(config);
  },
};

For line type plugins, we take the paragraph plugin as an example to implement the conversion logic. The paragraph plugin is the plugin that needs to be merged finally when other paragraph formats cannot be matched. As we mentioned earlier, the problem that the Paragraph object cannot wrap the Table element also needs to be handled here because our block-level expressions are implemented using the Table object. If the leaf nodes do not match a block element, simply return a paragraph element. If a block element is matched and only one element exists, then promote and return it directly. If a block element is matched and there are other elements, wrap all elements in a block element before returning. In fact, this part of the logic should be encapsulated and called by all inline element plugins to ensure compatibility in parsing. Otherwise, if there are nesting issues, the generated word document will not be able to be opened.


```js
const ParagraphPlugin: LinePlugin = {
  key: "PARAGRAPH",
  match: () => true,
  processor: async (options: LineOptions) => {
    const { leaves, tag } = options;
    const config: WithDefaultOption<IParagraphOptions> = {};
    const isBlockNode = leaves.some(leaf => leaf instanceof Table);
    config.style = tag.paragraphFormat || DEFAULT_FORMAT_TYPE.CONTENT;
    if (!isBlockNode) {
      if (tag.spacing) config.spacing = tag.spacing;
      config.children = leaves;
      return new Paragraph(config);
    } else {
      if (leaves.length === 1 && leaves[0] instanceof Table) {
        // A single 'Zone' does not need wrapping, usually an independent block element
        return leaves[0] as Table;
      } else {
        // Wrapping required for nested 'BlockTable' composition
        return makeZoneBlock({ children: leaves });
      }
    }
  },
};

Next, let's discuss headers and footers. In word, a common way of expressing a header is by placing the title of the current page in the upper right corner. This is quite an interesting feature. In word, this is achieved through fields. Leveraging the expression in OOXML and the encapsulation in docx, we can also implement this functionality. Implementing expressions similar to fields is achievable. A commonly used expression for referencing titles is STYLEREF. Simply assembling the string allows us to achieve this. A typical way to express a footer is to display the page number in the lower right corner or center. For this part, no field referencing is needed, making it simple to display the page number, with main focus on positioning control.

const HeaderSection = new Header({
  children: [
    new Paragraph({
      style: DEFAULT_FORMAT_TYPE.HF,
      tabStops: [{ type: TabStopType.RIGHT, position: TabStopPosition.MAX }],
      // ... format control
      children: [
        new TextRun("Header"),
        new TextRun({
          children: [
            new Tab(),
            new SimpleField(`STYLEREF "${DEFAULT_FORMAT_TYPE.H1}" \\* MERGEFORMAT`),
          ],
        }),
      ],
    }),
  ],
});

const FooterSection = new Footer({
  children: [
    new Paragraph({
      style: DEFAULT_FORMAT_TYPE.HF,
      tabStops: [{ type: TabStopType.RIGHT, position: TabStopPosition.MAX }],
      // ... format control
      children: [
        new TextRun("Footer"),
        new TextRun({
          children: [new Tab(), PageNumber.CURRENT],
        }),
      ],
    }),
  ],
});

Another crucial feature in word is the ability to generate a table of contents. Let's consider a question: Have you noticed throughout our entire document we haven't mentioned the introduction of fonts? If we want to know on which page a certain word or paragraph is rendered in word, we need to know the font size so that we can lay it out and determine the page where the title appears. Since we haven't even imported fonts, it's evident that we don't perform rendering and layout execution when generating the document, but rather when the user opens the document. Thus, after introducing a table of contents, you may receive prompts like whether to update these fields in the document. This is because the table of contents is a field whose content is generated or updated only by word, and we cannot achieve this programmatically.

const TOC = new TableOfContents("Table Of Contents", {
  hyperlink: true,
  headingStyleRange: "1-2",
  stylesWithLevels: [
    new StyleLevel(DEFAULT_FORMAT_TYPE.H1, 1),
    new StyleLevel(DEFAULT_FORMAT_TYPE.H2, 2),
  ],
}),

In https://github.com/WindrunnerMax/QuillBlocks/blob/master/examples/, you can find the complete word data conversion files delta-to-word.ts and delta-to-word.html. You can use ts-node to execute tests by opening the HTML file in a browser. Converting data to generate a word document is indeed quite complex, involving many details to handle, especially in the conversion of rich text content like multi-level block nesting, rendering of flowcharts/images, table merging, dynamic content transformation, and more. Achieving comprehensive word export capability requires continuous adaptation to various edge cases and thorough unit testing to maintain stability.

PDF

Apart from Word, PDF is essential for our delivery capabilities on our SaaS platform. In fact, for many documents that need to be printed, PDF is a better choice because it offers a fixed document format that eliminates layout issues across different devices. We can consider PDF as an advanced image - just like images maintain their layout consistency across devices, PDF documents also provide rich content that can be added without layout issues. In this section, we will discuss how to generate delivery documents in PDF format.

There are two main methods to generate PDF: one is based on converting HTML to PDF, commonly done using libraries like dom-to-image/html2canvas to convert HTML to an image and then the image to PDF. However, this method has limitations such as non-selectable text and reduced clarity when zoomed in. Another common approach is using Puppeteer, which offers advanced APIs to control Chromium via the DevTools protocol for generating PDF files. Alternatively, directly printing from the frontend using window.print or react-to-print with the help of iframe for partial printing is also feasible. Another method involves manual layout design to create PDF files. Handling PDF operations is similar to working with Canvas, where everything can be drawn, for example, tables can be drawn using rectangles. Popular libraries for this approach include pdfkit, pdf-lib, and pdfmake.

In this discussion, we will focus on generating PDF directly from our delta data. While it's feasible to convert via intermediary formats like MD, HTML, or Word, we will opt for direct output. Given the complexity of fully understanding the standard PDF data format in a short time, we rely on libraries to generate PDF files. We have chosen pdfmake to create PDF files via a JSON configuration, essentially transforming from one JSON to another. For handling Outline/Bookmark, I extensively researched and ultimately chose pdf-lib for managing the outline generation.

Unlike the descriptive language OOXML used in creating Word, which lacks drawing instructions for direct rendering and acts more like static markup, PDF creation involves actual path rendering using PostScript-PDL, a language for describing text, vector graphics, and image rendering directly in the document. When a PDF file is opened, all drawing instructions are already embedded, allowing rendering directly from these directives without needing client-side layout rendering.

To ensure complete cross-platform document format, PDF files typically embed fonts to guarantee proper display across devices. Therefore, when creating PDF files, font files need to be imported. It's important to note that many fonts require commercial licenses, although there are open-source options like Source Han Serif and Source Han Sans, covering normal, bold, italics, and bolditalics font styles. Installing and referencing fonts-noto-cjk font directly on the server can be considered. Additionally, CJK fonts tend to be large; therefore, subsetting for font embedding is a preferred practice.

// It might be a good idea to use Source Han Serif and Jiang Cheng as the fonts for reference
// https://github.com/RollDevil/SourceHanSerifSC
const FONT_PATH = "/Users/czy/Library/Fonts/";
const FONTS = {
  JetBrainsMono: {
    normal: FONT_PATH + "JetBrainsMono-Regular.ttf",
    bold: FONT_PATH + "JetBrainsMono-Bold.ttf",
    italics: FONT_PATH + "JetBrainsMono-Italic.ttf",
    bolditalics: FONT_PATH + "JetBrainsMono-BoldItalic.ttf",
  },
};

In pdfmake, we can achieve a style panel similar to word through preset styles. Of course, a pdf file cannot be directly edited, so the style panel here mainly serves the purpose of facilitating the implementation of different styles.

const FORMAT_TYPE = {
  H1: "H1",
  H2: "H2",
};
const PRESET_FORMAT: StyleDictionary = {
  [FORMAT_TYPE.H1]: { fontSize: 22, bold: true, },
  [FORMAT_TYPE.H2]: { fontSize: 18, bold: true, },
};
const DEFAULT_FORMAT: Style = {
  font: "JetBrainsMono",
  fontSize: 14,
};

For the transformation scheduling module, similar to the scheduling module in word, we need to define the relationship between lines and inline types, as well as the content to be passed to the Tag. The type control in pdfmake is quite loose, allowing us to easily achieve nested formats as required. However, validation for illegal format nesting is done during runtime. Ideally, we should strive to move this validation to the type definition stage. For example, in reality, ContentText cannot directly have ContentImage as a child element, but it is allowed in the type definition. We can define similar nesting relationships more strictly.

type LineBlock = Content;
type LeafBlock = ContentText | ContentTable | ContentImage;
type Tag = {
  format?: string;
  fontSize?: number;
  isInZone?: boolean;
  isInCodeBlock?: boolean;
};

Regarding the plugin definition part, we continue with the previously designed types. This part follows a similar design pattern, with adjacent block structures and Tag still being the inputs. Line plugins also incorporate leaf node data. The plugin definition maintains key plugin overloading, priority plugin priority, match matching rules, and processor processing functions. The outputs remain two types of block structures, indicating that our previous design approach is quite versatile.

type LeafOptions = {
  prev: Op | null;
  current: Op;
  next: Op | null;
  tag: Tag;
};
type LeafPlugin = {
  key: string; // Plugin overloading
  priority?: number; // Plugin priority
  match: (op: Op) => boolean; // Matching rule for `Op`
  processor: (options: LeafOptions) => Promise<LeafBlock | null>; // Processing function
};
type LineOptions = {
  prev: Line | null;
  current: Line;
  next: Line | null;
  tag: Tag;
  leaves: LeafBlock[];
};
type LinePlugin = {
  key: string; // Plugin overloading
  priority?: number; // Plugin priority
  match: (line: Line) => boolean; // Matching rule for `Line`
  processor: (options: LineOptions) => Promise<LineBlock | null>; // Processing function
};

The entry Zone scheduling function is quite similar to handling word, as there is no nesting relationship for individual block structures. All format configurations of the same type can be achieved using the same plugin. Therefore, here too, it is a matter of matching a single plugin form. Additionally, leaf nodes are processed first, as the content of leaf nodes will determine the nested block format of line elements.


```js
const parseZoneContent = async (
  zoneId: string,
  options: { defaultZoneTag?: Tag }
): Promise<Content[] | null> => {
  const { defaultZoneTag = {} } = options;
  const lines = deltaSet.get(zoneId);
  if (!lines) return null;
  const target: Content[] = [];
  for (let i = 0; i < lines.length; ++i) {
    // ... Fetch line data
    // Shouldn't affect the `Tag` passed from outside
    const tag: Tag = { ...defaultZoneTag };
    // Process node content
    const ops = currentLine.ops;
    const leaves: LeafBlock[] = [];
    for (let k = 0; k < ops.length; ++k) {
      // ... Fetch node data
      const hit = LEAF_PLUGINS.find(leafPlugin => leafPlugin.match(currentOp));
      if (hit) {
        // ... Execute plugin
        result && leaves.push(result);
      }
    }
    // Process line content
    const hit = LINE_PLUGINS.find(linePlugin => linePlugin.match(currentLine));
    if (hit) {
      // ... Execute plugin
      result && target.push(result);
    }
  }
  return target;
};

Next, we define plugins. Here we take text plugin as an example to implement the conversion logic. Basic text styles are encapsulated in the ContentText object, so we only need to handle the properties of the ContentText object. For other Content type objects such as ContentImage, we still need to define plugins to handle them separately.

const TextPlugin: LeafPlugin = {
  key: "TEXT",
  match: () => true,
  processor: async (options: LeafOptions) => {
    const { current, tag } = options;
    if (!isString(current.insert)) return null;
    const config: ContentText = {
      text: current.insert,
    };
    const attrs = current.attributes || {};
    if (attrs.bold) config.bold = true;
    if (attrs.italic) config.italics = true;
    if (attrs.underline) config.decoration = "underline";
    if (tag.fontSize) config.fontSize = tag.fontSize;
    return config;
  },
};

For line type plugins, let's take paragraph plugin as an example to implement the conversion logic. The paragraph plugin is the plugin that should be merged in when no other paragraph format matches. The nested relationship of Content objects mentioned earlier also needs to be handled here. Firstly, for empty lines, a \n should be merged in. If it's an empty object or array, there won't be line breaks. For a single Zone content, there's no need for wrapping, for example, CodeBlock block-level structures can be directly merged into the main document. For multiple types of structures like parallel tables, images, etc., they need to be wrapped in a Table/Columns structure to display correctly. Unlike in OOXML, issues with nested hierarchy won't cause errors when opened, they just affect the display of the related content areas.


```js
const composeParagraph = (leaves: LeafBlock[]): LeafBlock => {
  if (leaves.length === 0) {
    // Need to handle empty lines
    return { text: "\n" };
  } else if (leaves.length === 1 && !leaves[0].text) {
    // No need to wrap a single `Zone`, usually an independent block element
    return leaves[0];
  } else {
    const isContainBlock = leaves.some(leaf => !leaf.text);
    if (isContainBlock) {
      // Need to wrap and nest `BlockTable` // Actual width calculation required to avoid overflow
      return { layout: "noBorders", table: { headerRows: 0, body: [leaves] } };
    } else {
      return { text: leaves };
    }
  }
};
const ParagraphPlugin: LinePlugin = {
  key: "PARAGRAPH",
  match: () => true,
  processor: async (options: LineOptions) => {
    const { leaves } = options;
    return composeParagraph(leaves);
  },
};

Next, let's discuss how to generate Outline/Bookmark. Outline usually refers to the outline displayed on the left side of a PDF when opened. pdfmake does not directly support generating Outline, so we need to use another library to achieve this function. After researching for a long time, I found pdf-lib, which can be used to process existing PDF files and generate Outline. In this example, the generated Outline in the PDF is achieved through an id system for navigation. Another approach is to use pdfjs-dist to parse and store page and position information corresponding to the headings in the PDF, and then write the Outline using pdf-lib. Furthermore, generating Outline in conjunction with Puppeteer for generating PDFs is very useful, primarily because Chromium does not support generating Outline when exporting a PDF. Therefore, adding Outline using pdf-lib is a good complementary capability.

// Generate PDF using `pdfmake`
const printer = new PdfPrinter(FONTS);
const pdfDoc = printer.createPdfKitDocument(doc);
const writableStream = new Stream.Writable();
const slice: Uint8Array[] = [];
writableStream._write = (chunk: Uint8Array, _, next) => {
  slice.push(chunk);
  next();
};
pdfDoc.pipe(writableStream);
const buffer = await new Promise<Buffer>(resolve => {
  writableStream.on("finish", () => {
    const data = Buffer.concat(slice);
    resolve(data);
  });
});
pdfDoc.end();

// Generating outlines using `pdf-lib`
const pdf = await PDFDocument.load(buffer);
const context = pdf.context;
const root = context.nextRef();
const header1 = context.nextRef();
const header11 = context.nextRef();
// ... Create references
const header1Map: DictMap = new Map([]);
// ... Set data
header1Map.set(PDFName.of("Dest"), PDFName.of("Hash1"));
context.assign(header1, PDFDict.fromMapWithContext(header1Map, context));
const header11Map: DictMap = new Map([]);
// ... Set data
header12Map.set(PDFName.of("Dest"), PDFName.of("Hash1.2"));
context.assign(header11, PDFDict.fromMapWithContext(header11Map, context));
// ... Build complete hierarchy
const rootMap: DictMap = new Map([]);
// ... Build reference for root node
context.assign(root, PDFDict.fromMapWithContext(rootMap, context));
pdf.catalog.set(PDFName.of("Outlines"), root);
// Generate and write file
const pdfBytes = await pdf.save();
fs.writeFileSync(__dirname + "/doc-with-outline.pdf", pdfBytes);

The complete PDF data conversion can be found at https://github.com/WindrunnerMax/QuillBlocks/blob/master/examples/ in delta-to-pdf.ts and delta-to-pdf.html, and the pdf-with-outline.ts for adding Outlines. You can test it using ts-node and opening the HTML file in a browser. When testing with ts-node, pay attention to font references. Converting data to generate a PDF is a complex task. Thanks to various open-source projects, we can easily accomplish this. However, when applying it to a production environment, achieving comprehensive PDF export capabilities also requires continuous adaptation to various edge cases, and thorough unit testing is essential to maintain functionality stability.

Daily Challenge

https://github.com/WindrunnerMax/EveryDay

References

https://docx.js.org/
https://github.com/parallax/jsPDF
https://github.com/foliojs/pdfkit
https://github.com/Hopding/pdf-lib
https://quilljs.com/playground/snow
https://github.com/puppeteer/puppeteer
https://github.com/lillallol/outline-pdf
https://github.com/bpampuch/pdfmake/tree/0.2
http://officeopenxml.com/WPcontentOverview.php

ON THIS PAGE

Exploring Rich Text: Online Document Delivery#

Description#

MarkDown#

Translation#

Word#

PDF#

Daily Challenge#

References#