In Literate Programming, Tim Daly wrote:
The best programming language is English. Everything else is notation.
Why? he went on to explain:
Literate programming is like when you are explaining your code to a colleague. You have to declare what you were thinking when you wrote the code, step by step, piece by piece. You are telling the whole story: which step is essential, why the order must be so, or what is this used for. All materials are organized in a structure that makes sense to you (not to the compiler).Consider the best possible world. You've been hired at a company and join a team that is already working on a program. They hand you a book, tell you to go home and read it over the next two weeks. At the end of the two weeks you can work on the program as effectively anyone on the team. The team has successfully communicated from one human to another.
What is in the book? Remember our calculus textbook? It started from the ideas like limits and gradually developed the ideas until they could be expressed in equations. By the time you got to the equations you already understood the concepts. You could look at the equations and see why they matched the text. It is the why that is the important part. It is the part that our programs are missing.
The book you took home uses the same method. You started with the problem in chapter 1. Chapter 2 expresses the ideas needed to solve the problem. The next few chapters expand on each idea, gradually becoming more specific until the idea is reduced to code. By the time you get to the code it should be perfectly clear what the code should look like. Any part of the code you don't understand means that the book needs some additional words.
Technically, literate programming has two components: weave and tangle. By weaving, the source code is converted to printable document, such as PDF or HTML. By tangling, the codes involved in the document are extracted and organized in order.
The basic building blocks are chunks and chunk references.
So, for example, we can define a chunk named hello.c
:
#include <stdio.h>
int main() {
printf("Hello, world!\n");
return 0;
}
The original literate programming designed by Donald Knuth is independent of specific programming languages. It is the pseudo code that represents the algorithm instead of a language specific implementation that is presented in a chunk. The tangling process also translates the pseudo code into a language specific implementation, such as PASCAL (using WEB) or C (using CWEB). In this document, however, we write language specific implementation instead pseudo code in chunk. The reason is that different languages solve the same problem in intrinsically different ways. For example, the trick that uses dictionary to reduce temporal complexity is common in object oriented languages. But, this trick fails in pure functional languages like Haskell where mutable variable such as dictionary implemented by hash-table is not allowed. For this reason, we suggest to use language specific implementation instead of pseudo code in chunk. After all, it is the idea underlying the code (that is, the context) instead of the code itself that is really matters for our understanding.
The main reason is that the HTML file can be displayed directly in a browser. So, there is no need to weave. Yet another benefit is that HTML provides a standard for markup languages. Any other markup formats (such as markdown) can be safely converted to HTML.
Comparing with TeX, HTML is much modern and widely used. It can be displayed on any device including mobile phones. There is no need to install a huge TeX Live environment (the minimal version is about 1GB). All you need is a browser.
The only issue is that some languages, such as HTML itself and C++, heavily
use characters "<" and ">" in the same way as HTML, such as
vector<string>
in C++. We have to escape these "tag
characters". But this issue is not overall. For example, the "<" character
in x < 0
is not "tag-like", thus there is no need to escape.
As usual, we use the "class" attribute to indicate chunk and chunk reference.
Thus, block chunk is represented by div
element with class "chunk"
and a name attribute. Chunk name shall be plain text. For example:
<div class="chunk" name="hello.c">
...code block...
</div>
And inline chunk is represented by span
element with class "chunk"
and a name attribute. For example:
<span class="chunk" name="say hello"> ...one line code... </span>
Chunk reference, which is always inline, is represented by span
element with class "chunkref". The referred chunk name locates in the content
of the span
element. Thus,
<span class="chunkref">say hello</span>
Different chunks may share the same name. For example:
<div class="chunk" name="include">
#include <stdio.h>
</div>
and then after some explanation,
<div class="chunk" name="include">
#include <math.h>
</div>
While tangling, chunks with the same name are stacked in due order.
Thus, the previous example is equivalent to:
<div class="chunk" name="include">
#include <stdio.h>
#include <math.h>
</div>
Additionally, we can add a "append-newline" attribute to block chunk. If a block chunk has this attribute, a newline character will be appended to the end of the code block when it is tangled. For example:
<div class="chunk" name="hello.c" append-newline>
void say_hello () {
printf("Hello, world!\n");
}
</div>
followed by
<div class="chunk" name="hello.c">
void say_hello_again () {
printf("Hello, world!\n");
}
</div>
will be tangled into
void say_hello () {
printf("Hello, world!\n");
}
void say_hello_again () {
printf("Hello, world!\n");
}
There is an empty line at the end of the first function. It makes the code
much more readable. Otherwise, functions will be densely packed together.
The second block chunk does not have the "append-newline" attribute, so
there is no empty line at the end of the second function.
If "append-newline" is set to a number, multiple empty lines will be
appended. For example, append-newline="2"
means appending two
newline characters.
This is the case for Python (see PEP8).
While you are reading through this document, you may meet the block chunks
named literate.js
and literate.css
. Then, click
the head of each block chunk, and the tangled code will popup in a new tab.
This is the JavaScript and CSS code for writing literate programming in HTML,
just like this one.
If you find this document too long to read, then just download the JavaScript
and CSS code from the repository hosted on
GitHub (or
Gitee for Chinese users).
They locate in the literate.js
and the literate.css
respectively.
After obtaining the two files, put them in your HTML head. For example,
<link rel="stylesheet" type="text/css" href="literate.css"/>
<script type="text/javascript" src="literate.js"></script>
Then, remember to weave when onload. For example,
<script>
window.onload = weaveAll;
</script>
Now, you can start your own journey of literate programming!
In the rest of this document, we implement the JavaScript code for our purpose. We also add some style to make it pretty. You can try to click the head of each block chunk, and get a surprise (a new tab will popup). Clicking chunk reference is fun too.
Let us continue and (try to) enjoy this trip.
Chunks are woven to display in browser. We use JavaScript to convert the HTML elements of chunks and chunk references into the those that are more suitable for display and cooperating with CSS.
Weaving will in-place modify the innerHTML
of chunks and chunk
references. Then, when we want to tangle them, we need to unweave them first.
For avoiding this complexity, we store the original chunk or chunk reference
in a new element (the same type as the original) with the class "unwoven"
and set it as hidden for not displaying in browser. Therefore, unweaving is
nothing but extracting the element with class "unwoven" from the woven.
(If not found, throw an error.)
We first deal with chunk reference. Chunk reference is displayed in the format
⟨<a href="chunk-name">chunk name</a>⟩
within the original <span class="chunkref">
element.
The hyperlink links to the first chunk with the referred chunk name.
To do so, we have to add an id
attribute to that first chunk.
span
element for the unwoven, which stores the original
innerHTML
before weaving. For not displaying the unwoven, we set
to be hidden.
id
to it for linking. Since there may be whitespace in the chunk name, which is
not valid in ID, we use "-" for replacing.
chunkRef
element and add the link and
the unwoven span
to it.
Then, consider inline chunk. Inline chunk is displayed in the format
<span class="inline-chunk-head">⟨chunk name⟩≡</span>
<code>... one line code ...</code>
within the original <span class="chunk">
element.
We add a span
element for the head:
code
element:
weaveChunkRef
with a little adaptation.
Finally, consider block chunk. Almost the same as inline chunk, block chunk should be displayed in the format
<span class="block-chunk-head">⟨chunk name⟩≡</span>
<pre><code>
... code block ...
</code></pre>
within the original <div class="chunk">
element.
But, we add a link to the head, clicking which will tangle the chunk.
Thus, the head span turns into
⟨<a onclick="tangle(chunkName)">chunkName</a>⟩≡
For the tangle link, we have:
div
element. So, the code for weaving
block chunk simply an adaptation of weaveInlineChunk
.
A widely known problem is the whitespace before the "code block". Browser
will display these whitespace, which is not what we intend. To solve this,
we have to regularize the code block wrapped by a pre
element.
For example, code block like this:
<pre>
if (i = 0) {
i++;
}
</pre>
has the raw text (\n
for newline and \s
for whitespace):
<pre><code>\n\s\s\s\s\s\sif\s(i\s=\s0)\s{\n\s\s\s\s\s\s\s\si++;\n\s\s\s\s\s\s}\n\s\s\s\s</code></pre>
There are six extra whitespace in front of each line, and four whitespace
after the while code block. There is also an extra newline in front of the
code block.
We are to remove these extra characters.
whitespaceCount
) and subtract indentation from it.
Consider the line i++;
in the previous example, it has
whitespaceCount = 8
and indentation = 6
, so this
line is indented by 2 whitespace (whitespaceCount - indentation
).
In addition, a flag for the state of counting is found helpful.
result
is empty. In this case, we omit the newlines.
If we have not encountered any character of the code block, but whitespace,
indicating that this maybe the first line of code, we increase indentation
.
indentation
, expecting for the first line of
code.
result
is empty, and only
encountering the newline and whitespace. Now, if a code character appears
we append it to result
.
result
is not empty any more, and thus indentation
is fixed.
We continue appending new code character to result
until we meet a
newline. In this case, we set counting
to true
, and
start counting whitespace. The counting ends when we encounter a code character.
In the end, do not forget to append this newline character to result
.
whitespaceCount
is greater than indentation
,
as it should be.)
In the end, we append the code character to result
.
result
.
<pre><code>if\s(i\s=\s0)\s{\n\s\si++;\n}\n</code></pre>
Notice that there is still an extra newline character in the end of the code
block. And there may be multiple ending newline characters in other examples.
To remove it, we iterate the result in backward direction and remove newlines
until encountering another character.
We have implemented the function for regularizing code in block chunk. Other
code blocks in the HTML file wrapped in <pre><code>
element also need regularization. So, we regularize both of them; the order
is irrelevant.
span
) or block (a div
).
The main process of tangling is implemented by a function _tangle
,
which accepts a string chunkName
and returns a string for the
tangled code:
To define _tangle
, we first collect all chunks that have the name
chunkName
.
As stated in a previous section, there can be
multiple chunks with the same name.
for (int i = 0; i < unwoven.getElementsByClass("chunkref").length; i++)
leads to mistakes. Instead, we employ a function that get the first child of
the unwoven chunk that represents a chunk reference. After replacement, the
first child of chunk reference will automatically turns to the next.
code
string. Besides, we also add a newline character
between different chunks that share the same chunk name.
appendNewline
will be null
.
And we set appendNewline
to zero (append zero newline).
appendNewline
is an empty string),
we append a single newline character.
appendNewline
, we append that many
newline characters to the end of code
.
The implementation of getFirstChildByClass
is straight forward.
For the replacement, we first get the referred chunk name. To distinguish
with the input chunkName
, we call it subChunkName
.
subChunkName
, we call it subCode
.
subCode
so that they share the same indentation.
(The first line has correct indentation, since it is written by you.) So,
for a block chunk, we have to determine the indentation and add it to each
line.
subCode
. Since replaceWith
method requires an
element (node) as input, we shall wrap the string subCode
by
a span
element.
By clicking the "⟨chunk name⟩≡", function tangle(chunkName)
is
triggered. We popup a new window to show the tangled code, thus:
Before ending this section, we have to show how the functions getIndentation
and indent
are implemented. They turns out to be simple.
The getIndentation
function just returns the indentation
of the first line of the code. For example, in the following code,
void main() {
<span class="chunkref">say hello</span>;
}
the chunk reference (the second line) has indentation 2. We have to figure
it out. To do so, we first extract all the string before the chunk reference,
which, in this example, is
void main() {
(end here)
This is done by the following function (We follow
this answer.
This implementation, however, cannot deal with HTML tags properly.
Precisely, any HTML tag in the returned textBefore
are excluded.
But since it is called for tangling, there is not HTML tags in the text
before the node
, this implementation is still valid. Temporally,
I cannot figure out a better implementation.):
indent
function is also straight forward. We simply prepend
indentation
many whitespace before each line.
Congratulations! You are in the end of this journey. Everything left is to
weave this HTML file when it is loaded in your browser.
(Recall that regularization has been
included as a part of weaveAll
.)
So, we shall write in literate.js
:
onload = weaveAll;
But, what if you want to load more JavaScript functions while loading the
document? Who knows what you are going to do. So, it would be left to you,
the user, to determine what the onload
should be like.
Up to now, we have completed the JavaScript code for literate programming.
Wait a minute for adding some style to make the document pretty. We use CSS to style the chunks and chunk references.
We add a thin border to inline code and pad it a little to make it stand out.
We put code block into a block, of course. The block is wrapped by a dashed grey border. We also adjust the position of the code, to move it toward the center of the border.
For chunk reference, we delete the ugly underline of hyperlink, which links to the first chunk with the referred name, and pad a little for making it pretty. The original hyperlink color is too bright, we use dark red instead. (Comparing with dark blue, red color is more distinguishable.)
For each chunk, we add a narrow margin to make it appear a little isolated from the content.
For block chunk, we make the head bold, indicating a definition (we always make something bold when it is defined).
For inline chunk, we simply make the head bold, indicating a definition, and add a very narrow padding to isolate it from the content.