A .docx file is a ZIP archive containing XML files. Task Approach
pandoc or unpack for raw XMLdocx-js - see Creating New Documents below.doc files must be converted before editing:python scripts/office/soffice.py --headless --convert-to docx document.doc# Text extraction with tracked changes pandoc --track-changes=all document.docx -o output.md # Raw XML access python scripts/office/unpack.py document.docx unpacked/ `### Converting to Images` python scripts/office/soffice.py --headless --convert-to pdf document.docx pdftoppm -jpeg -r 150 document.pdf page
python scripts/accept_changes.py input.docx output.docxnpm install -g docxconst { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun, Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink, TableOfContents, HeadingLevel, BorderStyle, WidthType, ShadingType, VerticalAlign, PageNumber, PageBreak } = require('docx'); const doc = new Document({ sections: [{ children: [/* content */] }] }); Packer.toBuffer(doc).then(buffer => fs.writeFileSync("doc.docx", buffer));
python scripts/office/validate.py doc.docx// CRITICAL: docx-js defaults to A4, not US Letter // Always set page size explicitly for consistent results sections: [{ properties: { page: { size: { width: 12240, // 8.5 inches in DXA height: 15840 // 11 inches in DXA }, margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } // 1 inch margins } }, children: [/* content */] }]
size: { width: 12240, // Pass SHORT edge as width height: 15840, // Pass LONG edge as height orientation: PageOrientation.LANDSCAPE // docx-js swaps them in the XML }, // Content width = 15840 - left margin - right margin (uses the long edge)
const doc = new Document({ styles: { default: { document: { run: { font: "Arial", size: 24 } } }, // 12pt default paragraphStyles: [ // IMPORTANT: Use exact IDs to override built-in styles { id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal", quickFormat: true, run: { size: 32, bold: true, font: "Arial" }, paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } }, // outlineLevel required for TOC { id: "Heading2", name: "Heading 2", basedOn: "Normal", next: "Normal", quickFormat: true, run: { size: 28, bold: true, font: "Arial" }, paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 } }, ] }, sections: [{ children: [ new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Title")] }), ] }] }); `### Lists (NEVER use unicode bullets)` // ❌ WRONG - never manually insert bullet characters new Paragraph({ children: [new TextRun("• Item")] }) // BAD new Paragraph({ children: [new TextRun("\u2022 Item")] }) // BAD // ✅ CORRECT - use numbering config with LevelFormat.BULLET const doc = new Document({ numbering: { config: [ { reference: "bullets", levels: [{ level: 0, format: LevelFormat.BULLET, text: "•", alignment: AlignmentType.LEFT, style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] }, { reference: "numbers", levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT, style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] }, ] }, sections: [{ children: [ new Paragraph({ numbering: { reference: "bullets", level: 0 }, children: [new TextRun("Bullet item")] }), new Paragraph({ numbering: { reference: "numbers", level: 0 }, children: [new TextRun("Numbered item")] }), ] }] }); // ⚠️ Each reference creates INDEPENDENT numbering // Same reference = continues (1,2,3 then 4,5,6) // Different reference = restarts (1,2,3 then 1,2,3)
columnWidths on the table AND width on each cell. Without both, tables render incorrectly on some platforms.// CRITICAL: Always set table width for consistent rendering // CRITICAL: Use ShadingType.CLEAR (not SOLID) to prevent black backgrounds const border = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" }; const borders = { top: border, bottom: border, left: border, right: border }; new Table({ width: { size: 9360, type: WidthType.DXA }, // Always use DXA (percentages break in Google Docs) columnWidths: [4680, 4680], // Must sum to table width (DXA: 1440 = 1 inch) rows: [ new TableRow({ children: [ new TableCell({ borders, width: { size: 4680, type: WidthType.DXA }, // Also set on each cell shading: { fill: "D5E8F0", type: ShadingType.CLEAR }, // CLEAR not SOLID margins: { top: 80, bottom: 80, left: 120, right: 120 }, // Cell padding (internal, not added to width) children: [new Paragraph({ children: [new TextRun("Cell")] })] }) ] }) ] })
WidthType.DXA — WidthType.PERCENTAGE breaks in Google Docs.// Table width = sum of columnWidths = content width // US Letter with 1" margins: 12240 - 2880 = 9360 DXA width: { size: 9360, type: WidthType.DXA }, columnWidths: [7000, 2360] // Must sum to table width
WidthType.DXA — never WidthType.PERCENTAGE (incompatible with Google Docs)columnWidthswidth must match corresponding columnWidthmargins are internal padding - they reduce content area, not add to cell width// CRITICAL: type parameter is REQUIRED new Paragraph({ children: [new ImageRun({ type: "png", // Required: png, jpg, jpeg, gif, bmp, svg data: fs.readFileSync("image.png"), transformation: { width: 200, height: 150 }, altText: { title: "Title", description: "Desc", name: "Name" } // All three required })] }) `### Page Breaks` // CRITICAL: PageBreak must be inside a Paragraph new Paragraph({ children: [new PageBreak()] }) // Or use pageBreakBefore new Paragraph({ pageBreakBefore: true, children: [new TextRun("New page")] }) `### Table of Contents` // CRITICAL: Headings must use HeadingLevel ONLY - no custom styles new TableOfContents("Table of Contents", { hyperlink: true, headingStyleRange: "1-3" }) `### Headers/Footers` sections: [{ properties: { page: { margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } } // 1440 = 1 inch }, headers: { default: new Header({ children: [new Paragraph({ children: [new TextRun("Header")] })] }) }, footers: { default: new Footer({ children: [new Paragraph({ children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] })] })] }) }, children: [/* content */] }]
width, long edge as height, and set orientation: PageOrientation.LANDSCAPE\n - use separate Paragraph elementsLevelFormat.BULLET with numbering configtype - always specify png/jpg/etcwidth with DXA - never use WidthType.PERCENTAGE (breaks in Google Docs)columnWidths array AND cell width, both must matchmargins: { top: 80, bottom: 80, left: 120, right: 120 } for readable paddingShadingType.CLEAR - never SOLID for table shadingoutlineLevel - required for TOC (0 for H1, 1 for H2, etc.)python scripts/office/unpack.py document.docx unpacked/“ etc.) so they survive editing. Use --merge-runs false to skip run merging.unpacked/word/. See XML Reference below for patterns.<!-- Use these entities for professional typography --> <w:t>Here’s a quote: “Hello”</w:t>
‘’“”comment.py to handle boilerplate across multiple XML files (text must be pre-escaped XML):python scripts/comment.py unpacked/ 0 "Comment text with & and ’" python scripts/comment.py unpacked/ 1 "Reply text" --parent 0 # reply to comment 0 python scripts/comment.py unpacked/ 0 "Text" --author "Custom Author" # custom author name
python scripts/office/pack.py unpacked/ output.docx --original document.docx--validate false to skip.durableId >= 0x7FFFFFFF (regenerates valid ID)xml:space="preserve" on <w:t> with whitespace<w:r> elements: When adding tracked changes, replace the whole <w:r>...</w:r> block with <w:del>...<w:ins>... as siblings. Don't inject tracked change tags inside a run.<w:rPr> formatting: Copy the original run's <w:rPr> block into your tracked change runs to maintain bold, font size, etc.<w:pPr>: <w:pStyle>, <w:numPr>, <w:spacing>, <w:ind>, <w:jc>, <w:rPr> lastxml:space="preserve" to <w:t> with leading/trailing spaces00AB1234)<w:ins w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z"> <w:r><w:t>inserted text</w:t></w:r> </w:ins> `**Deletion:**` <w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z"> <w:r><w:delText>deleted text</w:delText></w:r> </w:del>
<w:del>: Use <w:delText> instead of <w:t>, and <w:delInstrText> instead of <w:instrText>.<!-- Change "30 days" to "60 days" --> <w:r><w:t>The term is </w:t></w:r> <w:del w:id="1" w:author="Claude" w:date="..."> <w:r><w:delText>30</w:delText></w:r> </w:del> <w:ins w:id="2" w:author="Claude" w:date="..."> <w:r><w:t>60</w:t></w:r> </w:ins> <w:r><w:t> days.</w:t></w:r> `**Deleting entire paragraphs/list items** - when removing ALL content from a paragraph, also mark the paragraph mark as deleted so it merges with the next paragraph. Add `<w:del/>` inside `<w:pPr><w:rPr>`:` <w:p> <w:pPr> <w:numPr>...</w:numPr> <!-- list numbering if present --> <w:rPr> <w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z"/> </w:rPr> </w:pPr> <w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z"> <w:r><w:delText>Entire paragraph content being deleted...</w:delText></w:r> </w:del> </w:p>
<w:del/> in <w:pPr><w:rPr>, accepting changes leaves an empty paragraph/list item.<w:ins w:author="Jane" w:id="5"> <w:del w:author="Claude" w:id="10"> <w:r><w:delText>their inserted text</w:delText></w:r> </w:del> </w:ins> `**Restoring another author's deletion** - add insertion after (don't modify their deletion):` <w:del w:author="Jane" w:id="5"> <w:r><w:delText>deleted text</w:delText></w:r> </w:del> <w:ins w:author="Claude" w:id="10"> <w:r><w:t>deleted text</w:t></w:r> </w:ins>
comment.py (see Step 2), add markers to document.xml. For replies, use --parent flag and nest markers inside the parent's.<w:commentRangeStart> and <w:commentRangeEnd> are siblings of <w:r>, never inside <w:r>.<!-- Comment markers are direct children of w:p, never inside w:r --> <w:commentRangeStart w:id="0"/> <w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z"> <w:r><w:delText>deleted</w:delText></w:r> </w:del> <w:r><w:t> more text</w:t></w:r> <w:commentRangeEnd w:id="0"/> <w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r> <!-- Comment 0 with reply 1 nested inside --> <w:commentRangeStart w:id="0"/> <w:commentRangeStart w:id="1"/> <w:r><w:t>text</w:t></w:r> <w:commentRangeEnd w:id="1"/> <w:commentRangeEnd w:id="0"/> <w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r> <w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="1"/></w:r>
word/media/word/_rels/document.xml.rels:<Relationship Id="rId5" Type=".../image" Target="media/image1.png"/>[Content_Types].xml:<Default Extension="png" ContentType="image/png"/><w:drawing> <wp:inline> <wp:extent cx="914400" cy="914400"/> <!-- EMUs: 914400 = 1 inch --> <a:graphic> <a:graphicData uri=".../picture"> <pic:pic> <pic:blipFill><a:blip r:embed="rId5"/></pic:blipFill> </pic:pic> </a:graphicData> </a:graphic> </wp:inline> </w:drawing>
npm install -g docx (new documents)scripts/office/soffice.py)pdftoppm for images