CollateX Python documentation main page
Overview
This page documents the API for CollateX Python 2.2, with particular attention to the input and output formats.
Information about the Gothenburg model of textual variation and the variant graph data model is available at the main CollateX site at https://collatex.net.
Tutorial information about using CollateX Python is available at https://github.com/DiXiT-eu/collatex-tutorial. Those materials were written for an earlier release of CollateX Python, and may be superseded in part by the present page.
The latest stable version of CollateX can be installed under Python 3 with pip install --upgrade collatex
(see below for additional installation information). The latest development version can be installed with pip install --upgrade --pre collatex
. The CollateX source is available at https://github.com/interedition/collatex, where CollateX Python is in the collatex-pythonport
subdirectory. Instructions for running CollateX Python from within a Docker container are available at https://github.com/djbpitt/collatex-docker.
Installation
Basic installation instructions for CollateX Python are available at https://github.com/DiXiT-eu/collatex-tutorial/blob/master/unit1/Installation.ipynb.
If you want to be able to render variant graphs in the Jupyter Notebook interface (which is optional), you must install both the Graphviz stand-alone program and the graphviz
Python package. Graphviz (the stand-alone program) installation has been simplified since the time the basic installation instructions were written; use the newer method at https://graphviz.gitlab.io/download/. Install the graphviz
Python package with pip install graphviz
.
Getting started
Import and use the CollateX Python package as follows:
from collatex import *
collation = Collation()
collation.add_plain_witness("A", "The quick brown fox jumps over the dog.")
collation.add_plain_witness("B", "The brown fox jumps over the lazy dog.")
alignment_table = collate(collation)
print(alignment_table)
The preceding reads plain-text data, applies default tokenization and normalization, aligns the witnesses, and outputs an ASCII alignment table (see below). As described at https://github.com/DiXiT-eu/collatex-tutorial, you can replace the default tokenization and normalization with methods customized to fit the shape of your data, and you can read input from files, instead of specifying it literally. These procedures are not described further in this document. The input and output formats supported by CollateX Python are described below, as are the parameters that control the alignment process.
Alignment parameters
The first argument to the collate()
function is the name of the Collation
object. The output
, layout
, and indent
parameters control the shape of the output, and are discussed below. There are also two parameters that control the way the alignment is performed or rendered independently of output format: segmentation
and near_match
. Both take Boolean values (True
or False
).
The segmentation
parameter
The segmentation
parameter determines whether each token is output separately (False
) or whether adjacent tokens that agree in whether they include variation or not are merged into the same output node or cell (True
). The default is True
, so collate(collation)
, using the sample input above, produces:
+---+-----+-------+--------------------------+------+------+
| A | The | quick | brown fox jumps over the | - | dog. |
| B | The | - | brown fox jumps over the | lazy | dog. |
+---+-----+-------+--------------------------+------+------+
while collate(collation, segmentation=False)
produces:
+---+-----+-------+-------+-----+-------+------+-----+------+-----+---+
| A | The | quick | brown | fox | jumps | over | the | - | dog | . |
| B | The | - | brown | fox | jumps | over | the | lazy | dog | . |
+---+-----+-------+-------+-----+-------+------+-----+------+-----+---+
The near_match
parameter
Understanding near matching (also called fuzzy matching) requires understanding how CollateX Python performs alignment. By default, CollateX aligns only tokens that are string-equal (after normalization). Additionally, some non-matching tokens may wind up aligned because they are sandwiched between matching tokens; we call this a forced match. For example:
from collatex import *
collation = Collation()
collation.add_plain_witness("A", "The gray koala")
collation.add_plain_witness("B", "The brown koala")
alignment_table = collate(collation)
print(alignment_table)
outputs:
+---+-----+-------+-------+
| A | The | gray | koala |
| B | The | brown | koala |
+---+-----+-------+-------+
Although “gray” and “brown” are not string-equal, they are forced into alignment because they are sandwiched between exact matches at “The” (before) and “koala” (after).
The near_match
parameter controls the behavior of CollateX Python in some situations where no exact alignment is possible because there is neither string-equality nor a forced-match environment. Consider:
from collatex import *
collation = Collation()
collation.add_plain_witness("A", "The big gray koala")
collation.add_plain_witness("B", "The grey koala")
alignment_table = collate(collation, segmentation=False)
print(alignment_table)
which outputs:
+---+-----+------+------+-------+
| A | The | big | gray | koala |
| B | The | grey | - | koala |
+---+-----+------+------+-------+
Because “gray” and “grey” are not string-equal, CollateX Python does not know to align them, which means that it does not know whether “grey” in Witness B should be aligned with “big” or with “gray” in witness A. In situations like this, CollateX Python always chooses the leftmost option, which means that in this case it aligns “grey” with “big”, rather than with “gray”.
Turning on near matching instructs CollateX Python to scrutinize, after performing basic alignment (that is, as part of the Analysis step in the Gothenburg model), situations where the placement of a token is uncertain because 1) it is adjacent to a gap, and 2) it is not string-equal to any value in any of the columns in which it might be placed. In these situations, turning on near matching with near_match=True
will cause CollateX Python to abandon its default rule to place tokens in the leftmost position. Instead, CollateX Python will adjust the placement of such tokens according to the closest match. As a result, changing the collation instruction above to:
collate(collation, near_match=True, segmentation=False)
produces:
+---+-----+-----+------+-------+
| A | The | big | gray | koala |
| B | The | - | grey | koala |
+---+-----+-----+------+-------+
The definition of closest match is complicated because, in the case of multiple witnesses, a token may be closer to some readings in one column than to others. CollateX Python uses the closest match in each column, where “closest” is determined by the Levenshtein.ratio() function. This is not guaranteed to correct all initial misalignments that could be improved by identifying a nearest match.
Because near matching operates on individual tokens, segmentation
must be set to False
whenever near matching is used. Failure to specify segmentation=False
when performing near matching will raise an error.
Input
CollateX Python accepts input as either plain text or pretokenized JSON. In either case, the input may be 1) specified literally in the code (as in the examples above); 2) read directly from the file system or elsewhere; or 3) computed dynamically. These three alternatives are discussed in the general tutorials at https://github.com/DiXiT-eu/collatex-tutorial.
Plain text input
Plain text input is illustrated above. The witnesses are added to the Collation
object, which can then be passed as the first argument to the collate()
function. Plain text input uses default tokenization (split on whitespace, treat punctuation as separate tokens) and default normalization (strip trailing whitespace). If you want to perform custom tokenization or normalization, you must create pretokenized JSON input.
Pretokenized JSON input
In the following example, a JSON object has been assigned to the variable json_input
, which can then be passed directly as the first argument to the collate()
function. The structure CollateX requires for JSON input is described and illustrated at https://collatex.net/doc/.
import json
from collatex import *
collation = Collation()
json_input = """{
"witnesses": [
{
"id": "A",
"tokens": [
{
"t": "The ",
"n": "The"
},
{
"t": "quick ",
"n": "quick"
},
{
"t": "brown ",
"n": "brown"
},
{
"t": "fox ",
"n": "fox"
},
{
"t": "jumps ",
"n": "jumps"
},
{
"t": "over ",
"n": "over"
},
{
"t": "the ",
"n": "the"
},
{
"t": "dog",
"n": "dog"
},
{
"t": ".",
"n": "."
}
]
},
{
"id": "B",
"tokens": [
{
"t": "The ",
"n": "The"
},
{
"t": "brown ",
"n": "brown"
},
{
"t": "fox ",
"n": "fox"
},
{
"t": "jumps ",
"n": "jumps"
},
{
"t": "over ",
"n": "over"
},
{
"t": "the ",
"n": "the"
},
{
"t": "lazy ",
"n": "lazy"
},
{
"t": "dog",
"n": "dog"
},
{
"t": ".",
"n": "."
}
]
}
]
}"""
print(collate(json.loads(json_input)))
The output is
+---+-----+-------+--------------------------+------+------+
| A | The | quick | brown fox jumps over the | - | dog. |
| B | The | - | brown fox jumps over the | lazy | dog. |
+---+-----+-------+--------------------------+------+------+
Output
Overview
CollateX Python supports the following output formats: ASCII table, HTML table (default and colorized, both only in the Jupyter Notebook interface), SVG variant graph (default and simple, both only in the Jupyter Notebook interface; SVG output requires the Graphviz executable and Python graphviz
package), CSV, TSV, generic XML, and TEI-XML. Output support is planned for GraphML; support is also planned for saving HTML and SVG output for reuse outside the Jupyter Notebook interface.
Output formats
The output format is specified with the output
parameter to the collate()
functions, e.g., collate(collation, output="svg")
. The default is the ASCII table, which can also be specified as output="table"
. In the following examples, the variable collation
is a Collation
object.
ASCII table
collate(collation)
, without any output
value, creates a horizontal ASCII table, along the lines of:
+---+-----+-------+--------------------------+------+------+
| A | The | quick | brown fox jumps over the | - | dog. |
| B | The | - | brown fox jumps over the | lazy | dog. |
+---+-----+-------+--------------------------+------+------+
The ASCII table output is not printed by default. The typical way to use it inside the Jupyter Notebook interface is:
alignment_table = collate(collation)
print(alignment_table)
You can create a vertical table (most useful when there are many witnesses) with collate(collation, layout="vertical")
. The output looks like:
+----------------------+----------------------+
| A | B |
+----------------------+----------------------+
| The | The |
+----------------------+----------------------+
| quick | - |
+----------------------+----------------------+
| brown fox jumps over | brown fox jumps over |
| the | the |
+----------------------+----------------------+
| - | lazy |
+----------------------+----------------------+
| dog. | dog. |
+----------------------+----------------------+
HTML table
CollateX Python supports two HTML output methods, html
and html2
. Unlike the ASCII table, which must be printed with a print()
statement, both HTML formats automatically write their output to the screen inside the Jupyter Notebook interface. These output formats are intended for use only inside Jupyter Notebook, and CollateX Python currently does not expose a method to save them for use elsewhere.
Create HTML output with:
collate(collation, output="html")
By default the html
method, like the ASCII table method, creates a horizontal table. You can create a vertical table instead with:
collate(collation, output="html", layout="vertical")
The html2
method produces only vertical output (the layout
parameter is ignored) and the output is colorized, which makes it easier to distinguish zones with variation (red background) and those without (cyan). The following is the beginning of the html2 output of a collation of the six editions of Charles Darwin’s On the origin of species published in the author’s lifetime:
SVG variant graph
Two types of SVG output are supported for visualizing the variant graph, svg_simple
and svg
. SVG output, like HTML output and unlike ASCII table output, is rendered automatically by the collate()
function inside the Jupyter Notebook interface. The two SVG output formats are intended for use only inside Jupyter Notebook, and CollateX Python currently does not expose a method to save them for use elsewhere. (CollateX Python currently writes the SVG file to disk with the filename Digraph.gv.svg
in the current working directory before rendering it inside Jupyter Notebook. This behavior is a side-effect and is not guaranteed to be supported in future releases.)
The svg
output method outputs a two-column table for each node in the variant graph. The upper left cell contains the n
(normalized) value of the token and the upper right cell contains the number of witnesses that share that n
value. Subsequent rows contains the t
(textual, that is, diplomatic) value in the left column and the sigla of witnesses that attest that t
value in the right column. For example, the following code:
from collatex import *
import json
collation = Collation()
json_input = """{
"witnesses": [
{
"id": "A",
"tokens": [
{
"t": "The ",
"n": "The"
},
{
"t": "gray ",
"n": "gray"
},
{
"t": "koala",
"n": "koala"
}
]
},
{
"id": "B",
"tokens": [
{
"t": "The ",
"n": "The"
},
{
"t": "grey ",
"n": "gray"
},
{
"t": "koala",
"n": "koala"
}
]
},
{
"id": "C",
"tokens": [
{
"t": "The ",
"n": "The"
},
{
"t": "brown ",
"n": "brown"
},
{
"t": "koala",
"n": "koala"
}
]
}
]
}"""
collate(json.loads(json_input), output="svg")
produces this output:
The SVG output creates start
and end
nodes that mark the beginnings and ends of all witnesses. All three witnesses attest the same readings for “The” and “koala”. The readings diverge with respect to the color: Witness C attests “brown” and Witnesses A and B share an n
value of “gray”, but with different t
values (“gray” for Witness A and “grey” for Witness B). The edges are labeled accordings to the witnesses; the complete reading of any witness can be reconstructed by following the labeled edges for that witness.
Separate information about n
and t
values is most important in cases involving complex custom normalization. For simpler output, the svg_simple
type renders only the n
values, and produces:
CSV and TSV
The output methods csv
and tsv
produce comma-separated-value (CSV) and tab-separated-value (TSV) output, respectively, following the layout of the basic ASCII table, where each row corresponds to a witness. For example, with the JSON input above, collate(json.loads(json_input), output="csv")
, produces:
A,The ,gray ,koala\nB,The ,grey ,koala\nC,The ,brown ,koala\n
and collate(json.loads(json_input), output="tsv")
produces:
A\tThe \tgray \tkoala\nB\tThe \tgrey \tkoala\nC\tThe \tbrown \tkoala\n
Generic XML
Using collate(collation, output="xml")
creates the following string result (as a single long line; the pretty-printing in this example has been introduced manually):
<root>
<app>
<rdg wit="#A">The </rdg>
<rdg wit="#B">The </rdg>
<rdg wit="#C">The </rdg>
</app>
<app>
<rdg wit="#A">gray </rdg>
<rdg wit="#B">grey </rdg>
<rdg wit="#C">brown </rdg>
</app>
<app>
<rdg wit="#A">koala</rdg>
<rdg wit="#B">koala</rdg>
<rdg wit="#C">koala</rdg>
</app>
</root>
String values are the t
properties; the n
properties are not exported. The schema is based on TEI parallel segmentation, except that:
- All output is wrapped in
<app>
elements, even where there is no variation. - Each witness is a separate
<rdg>
element, even where it agrees with other witnesses.
It is intended that users who require a specific type of XML output will postprocess this generic XML with XSLT or other means.
TEI-XML
The following example illustrates different patterns of agreement and variation:
from collatex import *
collation = Collation()
collation.add_plain_witness("A","The big old gray koala:")
collation.add_plain_witness("B", "The big gray fuzzy koala.")
collation.add_plain_witness("C","The grey fuzzy wombat!")
table = collate(collation, segmentation=False, near_match=True)
print(table)
The ASCII table output looks like:
+---+-----+-----+-----+------+-------+--------+---+
| A | The | big | old | gray | - | koala | : |
| B | The | big | - | gray | fuzzy | koala | . |
| C | The | - | - | grey | fuzzy | wombat | ! |
+---+-----+-----+-----+------+-------+--------+---+
TEI output can be specified with collate(collation, output="tei")
. When we apply the following to the same input:
tei = collate(collation, output="tei", segmentation=False, near_match=True)
print(tei)
it produces the following output as a single line. The line breaks below were introduced manually for ease of reading; they are inside tags to avoid deforming the whitespace handling inside text nodes.
<?xml version="1.0" ?><cx:apparatus xmlns="http://www.tei-c.org/ns/1.0"
xmlns:cx="http://interedition.eu/collatex/ns/1.0">The <app><rdg
wit="#A #B">big</rdg></app> <app><rdg
wit="#A">old</rdg></app> <app><rdg
wit="#A #B">gray</rdg><rdg
wit="#C">grey</rdg></app> <app><rdg
wit="#B #C">fuzzy</rdg></app> <app><rdg
wit="#A #B">koala</rdg><rdg
wit="#C">wombat</rdg></app><app><rdg
wit="#A">:</rdg><rdg
wit="#B">.</rdg><rdg
wit="#C">!</rdg></app></cx:apparatus>
CollateX combines information about trailing whitespace with the preceding token inside the t
value, and in the TEI output those spaces are moved from inside the <rdg>
to after the <app>
. This is usually what users expect, except that information about whitespace differences in the input (e.g., the same word followed by a space character in one witness but not in another, or by a space vs two spaces, or by a space vs a newline) is not preserved in the TEI output.
CollateX Python TEI output wraps the collation information in a <cx:apparatus>
element in a CollateX namespace (http://interedition.eu/collatex/ns/1.0
) that is bound to the prefix cx:
. The wrapper also creates a default namespace declaration for the TEI namespace, which means that all <app>
and <rdg>
elements are in the TEI namespace.
More legible output available by setting the indent
parameter to True
, in which case running:
tei = collate(collation, output="tei", segmentation=False, near_match=True, indent=True)
print(tei)
on the same input produces:
<?xml version="1.0" ?>
<cx:apparatus xmlns="http://www.tei-c.org/ns/1.0" xmlns:cx="http://interedition.eu/collatex/ns/1.0">
The
<app>
<rdg wit="#A #B">big</rdg>
</app>
<app>
<rdg wit="#A">old</rdg>
</app>
<app>
<rdg wit="#A #B">gray</rdg>
<rdg wit="#C">grey</rdg>
</app>
<app>
<rdg wit="#B #C">fuzzy</rdg>
</app>
<app>
<rdg wit="#A #B">koala</rdg>
<rdg wit="#C">wombat</rdg>
</app>
<app>
<rdg wit="#A">:</rdg>
<rdg wit="#B">.</rdg>
<rdg wit="#C">!</rdg>
</app>
</cx:apparatus>
Pretty-printing should be used only for examination, and not for subsequent processing, since it incorrectly inserts XML-significant whitespace inside and around text nodes (see “The ”, near the beginning of the output).
JSON
Setting the output
value to "json"
produces JSON output. This is the most complete output format, and therefore a common choice for subsequent preprocessing.
By default the output produces is equivalent to the horizontal alignment tables above. If the layout option is set to vertical
then the export will be equivalent to the vertical alignment tables above. The latter produces a format which is structurally the same as the json
output option from the Java
microservices version of collateX. The only difference between the two is that in the Python export None
is used for empty cells and in the Java microservices an empty array is used.
Script
from collatex import *
collation = Collation()
collation.add_plain_witness("A", "The quick brown fox jumps over the dog.")
collation.add_plain_witness("B", "The brown fox jumps over the lazy dog.")
alignment_table = collate(collation, output="json")
print(alignment_table)
Output
{
"table": [
[
[
{
"_sigil": "A",
"_token_array_position": 0,
"n": "The",
"t": "The "
}
],
[
{
"_sigil": "A",
"_token_array_position": 1,
"n": "quick",
"t": "quick "
}
],
[
{
"_sigil": "A",
"_token_array_position": 2,
"n": "brown",
"t": "brown "
},
{
"_sigil": "A",
"_token_array_position": 3,
"n": "fox",
"t": "fox "
},
{
"_sigil": "A",
"_token_array_position": 4,
"n": "jumps",
"t": "jumps "
},
{
"_sigil": "A",
"_token_array_position": 5,
"n": "over",
"t": "over "
},
{
"_sigil": "A",
"_token_array_position": 6,
"n": "the",
"t": "the "
}
],
null,
[
{
"_sigil": "A",
"_token_array_position": 7,
"n": "dog",
"t": "dog"
},
{
"_sigil": "A",
"_token_array_position": 8,
"n": ".",
"t": "."
}
]
],
[
[
{
"_sigil": "B",
"_token_array_position": 10,
"n": "The",
"t": "The "
}
],
null,
[
{
"_sigil": "B",
"_token_array_position": 11,
"n": "brown",
"t": "brown "
},
{
"_sigil": "B",
"_token_array_position": 12,
"n": "fox",
"t": "fox "
},
{
"_sigil": "B",
"_token_array_position": 13,
"n": "jumps",
"t": "jumps "
},
{
"_sigil": "B",
"_token_array_position": 14,
"n": "over",
"t": "over "
},
{
"_sigil": "B",
"_token_array_position": 15,
"n": "the",
"t": "the "
}
],
[
{
"_sigil": "B",
"_token_array_position": 16,
"n": "lazy",
"t": "lazy "
}
],
[
{
"_sigil": "B",
"_token_array_position": 17,
"n": "dog",
"t": "dog"
},
{
"_sigil": "B",
"_token_array_position": 18,
"n": ".",
"t": "."
}
]
]
],
"witnesses": [
"A",
"B"
]
}
Supplementary output parameters
The layout
parameter
The layout
parameter controls whether table output is “horizontal” (which is the default) or “vertical”. It is relevant only for output types table
, html
and json
. Otherwise it is ignored: html2
output is always vertical, and the other output types are not tabular.
The indent
parameter
The indent
parameter controls whether TEI-XML output is pretty-printed. The default is to serialize the entire XML output in a single line; setting indent
to any value other than None
will cause the output to be pretty-printed instead. As with the @indent
attribute on <xsl:output>
, pretty-printing inserts whitespace that may impinge on the quality of the output. The indent
parameter is ignored for all methods except tei
.
Summary of output types
In the following table, possible values of the output
parameter are listed in the left column, and their ability to combine with the segmentation
, layout
, and indent
parameters is indicated (“yes” ~ “no”) in the other columns.
output |
segmentation |
near_match |
layout |
indent |
---|---|---|---|---|
table | yes | yes | yes | no |
html | yes | yes | yes | no |
html2 | yes | yes | no | no |
svg_simple | yes | yes | no | no |
svg | yes | yes | no | no |
xml | yes | yes | no | no |
tei | yes | yes | no | yes |
json | yes | yes | yes | no |
Recall that near matching is incompatible with segmentation, so near_match=True
requires segmentation=False
.