[WebODF] OpRemoveText: A well-engineered stick in the eye

20 Sep 2013

      Hi folks,

So for today's light reading, I thought I'd take a few moments and describe the
intent and workings of OpRemoveText. Recent activity in iterators and with images
has being bumping into the black-box nature of this, and it seemed like a good
topic to discuss.

History
=======
OpRemoveText started life sometime in November, 2012. Originally written as a direct
complement to OpInsertText, the earliest version was simply able to remove text
data within a text node.

Eventually, OpRemoveText grew to support complete removal of empty text nodes
(early April, 2013, Friedrich), then support for removing text across multiple
text nodes (late April, 2013, Aditya), and finally space element removal,
paragraph merging and list item removal.

The initial implementation was 66 lines long (including comments). By late July,
this operation had grown to 250 lines, and was accruing a significant number of
known limitations.

These include things such as:
- Inability to remove images, tables and other ODF objects
- Erratic removal of ODF character elements (e.g., spaces & tabs)
- Destruction of non-odf children when removing ODF containers

The number of non-text items that needed to have special handling to be removed
made it somewhat clear that the existing text neighbourhood approach was slowly
being outgrown.

Current
=======
Mid July, 2013, an alternative approach for OpRemoveText was put forward. The
limitations being specifically addressed in this rewrite were:
- Improve the ability to support ODF objects, and make it easy to extend this for
  future object support
- Eliminate the text-oriented design of the existing implementation
- Preserve non-ODF elements during deletion (e.g., cursor nodes) without specialised
  handling

The original OpRemoveText could be best described as delete-by-default behaviour.
The implementation carried specific logic to handle and cleanup editinfo nodes,
and keep the cursor nodes safely intact. The key issue with this approach was
that integration of foreign elements into an ODF document were not able to be
saved without patching OpRemoveText directly.

In contrast, the current design is actually save-by-default. This is achieved
by performing all node removal using the DomUtils.mergeIntoParent function.

This new approach landed late July, and thus far has survived with very few major
changes or bug fixes.

How the magic works
===================
Removal of content has a few major responsibilities attached:
- Paragraph merging: Merging two paragraphs together has some special edge cases
  around what the resulting joined paragraph style should be.
- Empty container cleanup: In order to keep the document in a relatively performant
  state, removal needs to do things such as discard unnecessary span elements,
  remove empty frames etc.
- Maintain document structure: Removal of some elements should automatically clean
  up parent containers. For example, removing the last item in a list should
  result in the parent list also being removed.

The basic approach used in OpRemoveText to meet these needs is as follows:
1. Given a position & length, translate this to a DOMRange (via position iterators)
2. Fetch a list of all ODF objects or text elements FULLY contained within this range
    - For each ODF object in this list, eliminate the object using domUtils.mergeIntoParent
      to remove the object whilst preserving all it's children
    - For each text node, delete the node entirely
    - Check the parent of each object/text node that was removed. If it is a container
      that should automatically collapse, and it is now empty, remove the parent using
      mergeIntoParent. Repeat this process on it's own parent.
3. Fetch a list of all ODF paragraphs that intersect the current range. Merge these
   together as all content within the range will have been removed. 
   (Special rules for styling apply which I won't go into detail here)

An auto-collapsing container is
- Not a paragraph element (OpRemoveText, line 107)
- Not a root element. E.g., office:text (OpRemoveText, line 107)
- Not a character element (OpRemovetext, line 82)
- Contains no ODF character elements (OpRemoveText, line 90)
- Has no text content (OpRemoveText, line 86)

The key concepts here are:
1. Collecting the set of objects/text elements that will be removed BEFORE doing
   any removal prevents counting issues.
2. Deletion of elements is order-agnostic. If the child is processed after the
   parent element, the parent can still automatically collapse automatically.
3. Auto-collapsing containers vs. specifically removable content starts to clarify
   ODF elements that are navigable content vs. containers for content.

Worked Example: Merging two paragraphs
======================================
<text:p id="p1">
    <text:span style="bold">Paragr<c:anchor/>aph 1 t<text:s/>ext</text:span>
</text:p>
<text:p id="p2">
    <text:span style="italic">Paragraph 2<c:cursor/> text</text:span>
</text:p>

Assuming the user has selected the range between the anchor and cursor nodes and
pressed the delete key:

1. Split text nodes at the selection boundaries. This ensures odtDocument.getTextElements
   will never return partially selected text content
2. Fetch all text elements (line 211). For this selection, this will return:
   ["aph 1 t", <text:s/>, "ext", "Paragraph 2"]
3. Fetch all intersecting paragraphs:
   [<text:p id="p1"/>, <text:p id="p2"/>]

4. For each text element, remove it from the DOM using mergeIntoParent (line 216).
   After this step the DOM looks like:
<text:p id="p1">
    <text:span style="bold">Paragr<c:anchor/></text:span>
</text:p>
<text:p id="p2">
    <text:span style="italic"><c:cursor/> text</text:span>
</text:p>

5. For each paragraph intersecting the selection, merge the paragraph contents
   into the first paragraph touching the selection (line 220). After this step,
   the DOM looks like:
<text:p id="p1">
    <text:span style="bold">Paragr<c:anchor/></text:span>
    <text:span style="italic"><c:cursor/> text</text:span>
</text:p>
<text:p id="p2">
</text:p>

6. Finally, remove each paragraph that was merged except the first.
<text:p id="p1">
    <text:span style="bold">Paragr<c:anchor/></text:span>
    <text:span style="italic"><c:cursor/> text</text:span>
</text:p>

7. Fix the cursor positions (i.e., collapse now collapsed cursors etc.). In this
   instance, there are is no position difference between the front and end now
   (i.e., 0 steps difference), so the cursor is collapsed.
<text:p id="p1">
    <text:span style="bold">Paragr<c:cursor/></text:span>
    <text:span style="italic"> text</text:span>
</text:p>

Note, we never had to explicitly find and save the cursor or anchor nodes. These
migrated up the hierarchy via the mergeIntoParent removal process of the ODF
elements we wanted to delete.

Worked Example: Removing a list item
====================================
Now, a slightly more complex removal example.

<text:list>
    <text:list-item>
        <text:p id="p1">
            <text:span style="bold">Paragr<c:anchor/>aph 1 t<text:s/>ext</text:span>
        </text:p>
    <text:list-item>
    <text:list-item>
        <text:p id="p2">
            <text:span style="italic">Paragraph 2<c:cursor/> text</text:span>
        </text:p>
    <text:list-item>
<text:list>

Assuming the user has selected the range between the anchor and cursor nodes:

1. Split text nodes at the selection boundaries. This ensures odtDocument.getTextElements
   will never return partially selected text content
2. Fetch all text elements (line 211). For this selection, this will return:
   ["aph 1 t", <text:s/>, "ext", "Paragraph 2"]
3. Fetch all intersecting paragraphs:
   [<text:p id="p1"/>, <text:p id="p2"/>]

4. For each text element, remove it from the DOM using mergeIntoParent (line 216).
   After this step the DOM looks like:
<text:list>
    <text:list-item>
        <text:p id="p1">
            <text:span style="bold">Paragr<c:anchor/></text:span>
        </text:p>
    <text:list-item>
    <text:list-item>
        <text:p id="p2">
            <text:span style="italic"><c:cursor/> text</text:span>
        </text:p>
    <text:list-item>
<text:list>

5. For each paragraph intersecting the selection, merge the paragraph contents
   into the first paragraph touching the selection (line 220). After this step,
   the DOM looks like:
<text:list>
    <text:list-item>
        <text:p id="p1">
            <text:span style="bold">Paragr<c:anchor/></text:span>
            <text:span style="italic"><c:cursor/> text</text:span>
        </text:p>
    <text:list-item>
    <text:list-item>
        <text:p id="p2">
        </text:p>
    <text:list-item>
<text:list>

6. Finally, remove each paragraph that was merged except the first. Paragraph
   removal allows for the same container collapsing behaviour that text element
   removal does (line 166 & 169). Both a list, and a list-item fit the definition
   of an auto-collapsing container based on the previously defined rules, so
   these could get automatically cleaned up when the paragraph element is removed.

   After removing p2, the auto-collapse behaviour would see that the text:list-item
   is removable if it is empty. In this case, there is no more ODF characters or
   text content, so this element is collapsed. The list-item's parent list would
   also be checked to see if it can be removed. As the list still has another
   list-item, and the list-item contains a paragraph, and teh paragraph contains
   text and character data, the list will NOT be collapsed.
<text:list>
    <text:list-item>
        <text:p id="p1">
            <text:span style="bold">Paragr<c:anchor/></text:span>
            <text:span style="italic"><c:cursor/> text</text:span>
        </text:p>
    <text:list-item>
<text:list>

7. Fix the cursor positions.
<text:list>
    <text:list-item>
        <text:p id="p1">
            <text:span style="bold">Paragr<c:cursor/></text:span>
            <text:span style="italic"> text</text:span>
        </text:p>
    <text:list-item>
<text:list>

Worked Example: Removing a list
===============================
Finally, removing a list entirely.

<text:p id="p1">
    <text:span style="bold">Paragr<c:anchor/>aph 1 t<text:s/>ext</text:span>
</text:p>
<text:list>
    <text:list-item>
        <text:p id="p2">
            <text:span style="italic">Paragraph 2<c:cursor/> text</text:span>
        </text:p>
    <text:list-item>
<text:list>

Assuming the user has selected the range between the anchor and cursor nodes:

1. Split text nodes at the selection boundaries. This ensures odtDocument.getTextElements
   will never return partially selected text content
2. Fetch all text elements (line 211). For this selection, this will return:
   ["aph 1 t", <text:s/>, "ext", "Paragraph 2"]
3. Fetch all intersecting paragraphs:
   [<text:p id="p1"/>, <text:p id="p2"/>]

4. For each text element, remove it from the DOM using mergeIntoParent (line 216).
   After this step the DOM looks like:
<text:p id="p1">
    <text:span style="bold">Paragr<c:anchor/></text:span>
</text:p>
<text:list>
    <text:list-item>
        <text:p id="p2">
            <text:span style="italic"><c:cursor/> text</text:span>
        </text:p>
    <text:list-item>
<text:list>

5. For each paragraph intersecting the selection, merge the paragraph contents
   into the first paragraph touching the selection (line 220). After this step,
   the DOM looks like:
<text:p id="p1">
    <text:span style="bold">Paragr<c:anchor/></text:span>
    <text:span style="italic"><c:cursor/> text</text:span>
</text:p>
<text:list>
    <text:list-item>
        <text:p id="p2">
        </text:p>
    <text:list-item>
<text:list>

6. Finally, remove each paragraph that was merged except the first. Remembering
   that paragraph removal includes auto-collapsing containers. When p2 is removed,
   it's text:list-item is checked to see if it can collapse. In this case, there
   is no content left, so the list-item will remove itself. It will then check
   if the list can collapse, which again is empty. As a result of these checks,
   the list is properly cleaned up once the content is removed.
<text:p id="p1">
    <text:span style="bold">Paragr<c:anchor/></text:span>
    <text:span style="italic"><c:cursor/> text</text:span>
</text:p>

7. Fix the cursor positions as per usual
<text:p id="p1">
    <text:span style="bold">Paragr<c:cursor/></text:span>
    <text:span style="italic"> text</text:span>
</text:p>

Well... if that hasn't put you to sleep yet I'm impressed. Any queries, questions,
or complaints... as ever, I'm happy to answer anything I can.

Happy hacking!

Philip

[WebODF] OpRemoveText: A well-engineered stick in the eye

Philip Peitsch