Word Cloud with SvelteKit and D3

Word cloud of this post

I mostly do Python and server-related work, so why suddenly a word cloud, in Svelte and JavaScript?

I made a mistake in my last post. I didn't know Blogger would use the first image I inserted into my post as the thumbnail picture of my post. This thumbnail picture apprears in both my posts view and when I share my post elsewhere. The first picture in my last post was a screenshot of the Python extension in Visual Studio Code when I discussed how many children mistook that as the Python interpreter. So when I shared my post, it looked like this.

This is obviously less than ideal, and it gives an impression that I made a same mistake as the kids. But I had no idea what my thumbnails should look like yet. Word clouds are visually pleasing, although not suitable for statistics. I think they will make good thumbnail pictures for my blog posts for a while, before I come up with a better idea. So I made a small web tool that makes word cloud from a Markdown document, which is what I use to write posts. I could have used Jason Davies' Word Cloud Generator, but I would then miss the opportunity to practice.

Setting up a SvelteKit library project

I set up a SvelteKit library project following the wizard in the SvelteKit CLI. When asked what to create I selected "Skeleton library project". The difference between a library project and a svelte project is the former only packages the code in the lib folder when building a production bundle (i.e., when running npm run build). Although in both cases one can use page components and run npm run dev to preview the pages, the pages in a library project is only for preview or showcase. They are not in the final build.

This means I can experiment in page components and be as coupled as I'd like, and then refactor the code into small components and put them in the lib folder. This is especially useful when I'm not as familiar with the framework or libraries used, in this case, D3 and d3-cloud. Although I have helped my clients with their web UI before, I don't consider it my thing. The separation of experimentation and production code is also somewhat more assuring than refactoring "in place".

However, despite this, separating responsibilities to start with is always a good idea, especially if the decision is obvious.

Getting word counts

Converting Markdown to text

The idea is to convert the Markdown text into HTML first, then get the text content from the page, then count the words in that text, and finally draw the word counts on a word cloud.

SvelteKit allows development of both browser and server code in one project, although the files need to be separate, and the server has to be in Node.js. The JavaScript put in lib/server folder and any +page.server.js and +layout.server.js files run on the server only, while the JavaScript in a component and lib folder runs on both (called "universal"). On the server, there's no document or window, while on the browser there's no file system access. Those are just some differences.

I put the code to parse Markdown on the server only because the library I used ("micromark") used Node.js file system module to parse a Markdown file, so it could run on the server. I didn't check whether it could run in the browser if I didn't use that functionality though. DOM API will not run on the server since there's no such thing as an HTML document on the server. For that reason, I used JSDOM, which implements Web APIs on the server, to get the text content of the page generated. This will then be the text to count words from.

Counting words

I've now got the document I want to count words in. It is a document, though, not a list of words yet. Getting a list of all the words in a document is a challenging task. One might think of a solution that splits the document by non-word characters.

const splitWords = (doc) => doc.split(/\W+/);

There are several problems with this. First, it won't work with languages like Japanese.

// "Your face is like this potato."
splitWords("あなたの顔は、このジャガイモに似ている。")
// Returns: ['', '']

Second, there are well-known expressions that end with non-word characters, the most famous being "C++".

splitWords("Are you still not learning C++ yet?");
// Returns: ['Are', 'you', 'still', 'not', 'learning', 'C', 'yet', '']

Third, abbreviations like "U.S." will be split into "U" and "S".

// Surprisingly, "U.S." has dots in between while "UK" does not.
splitWords("The U.S. has more bears than the UK.");
// Returns: ['The', 'U', 'S', 'has', 'more', 'bears', 'than', 'the', 'UK', '']

For these reasons, I used a tokeniser from an NLP library ("NLP.js"), since splitting a document into words ("tokens") is the first step in almost all NLP tasks. The library allows separate installation of its components, which is useful in my case since I'm not doing any NLP task other than tokenisation. The tokeniser only installed three packages altogether, while the whole NLP.js would install over 60. The library is not perfect. It does not support Japanese. It makes a same mistake with the "U.S." example. But it does tokenise "C++" correctly, which is more important in my use case. For the "U.S." case, my solution was to remove the dots before tokenisation.

Removing the dots introduced another problem. Two common things to do when counting words in a documents are to lowercase words and remove stopwords. Lowercasing makes sure that "bear" at the beginning of a sentence is counted together with the word in a sentence. But that also means "US" from the last step will become "us", which is a stopword in some dictionaries and can be removed. My American colleagues will become underrepresented because of this. For this reason, I only lowercased words that has their first letter in upper case, so "Bear" is lowercased but not "UK". This does cause problems to names like "Taylor Swift" where "swift" is an English word, but I consider that less important for now.

(Afterthought: there is a better way of dealing with this, but I will explore that when I refactor the project.)

All the above was to make sure I have a right set of words. Counting them in JavaScript is straightforward. It is at this point that I start to miss the Counter in Python. The Counter class implements counting objects using hashcode and potentially in one pass of the iterable object. In JavaScript, most programmers do the same using this.

for(let word of uniqueWords) {
    const count = words.filter(w => w === word).length;
    //...
}

This one liner is simple, but it traverses the iterable object once for each word, much less efficient than putting words as keys and their counts as values in a dictionary-like structure and updating it as one traverses the iterable.

Understanding D3 and `d3-cloud`

It would be nice if we could just put in the word counts into a word cloud function and it will just draw a word cloud for us. In fact I believe the wordcloud Python library does this. But what that hides is the steps needed to make a word cloud. In particular, the separation of source data, cloud definition and cloud presentation makes D3 and d3-cloud very flexible. I might even be able to use d3-cloud outside of a D3 context.

There are usually three steps in drawing a word cloud: selecting a canvas, running the placement algorithm and drawing the cloud.

Selecting the canvas

The first step is to select a place for D3 to draw visualisations on. This is used to reset the canvas or to be referenced in event handlers, in particular the "end" event, which happens when graphics calculation has ended. Usually, the "canvas" element is defined completely in a d3.select(...) chain. But, in Svelte it's possible to bind an element to a variable. Since I know the preview canvas will always be there, I can write it in my Svelte component's HTML...

<script lang="ts">
    let svg: SVGElement;
</script>

<svg bind:this={svg}></svg>

... and later select it like this.

const canvas = d3.select(svg);

This is particularly useful if, for example, the <svg> element also has a bunch of Tailwind classes in its class attribute and some viewBox attribute settings. We won't need to do this separately in the d3.select(...) chain any more.

Running the placement algorithm

The next thing to do is to put the word counts into the placement algorithm. This algorithm requires input of a d3.cloud.Word object, which includes properties like text, font, font size, rotation and padding etc. The algorithm also requires the size of the canvas (but not the canvas element itself). The algorithm calculates the coordinates of each object and calls its "end" event with all the objects with their calculations. Note that this algorithm does not actually draw the objects, which is important because it's then possible to draw them on different canvases.

Drawing the cloud

I actually need to draw this word cloud onto two canvases. One is the preview canvas using the SVG element, and the other is an off-screen canvas to put onto a PNG file for me to download and put in my blog post. The fact that the algorithm is not coupled with the canvas element selected makes this possible. Drawing on an SVG element and a canvas element are also surprisingly similar.

When drawing on an SVG element, one sets the appropriate attributes on the <text> elements in the SVG element. Font and font size have their own attributes. Colour is set as a fill. Position and rotation are set as a transform option. When drawing on a canvas element, one first transforms the coordinate system with the position and rotation, similar to the transform option above. The text is drawn at the new origin with appropriate font and font size. The canvas' drawing context needs to be saved before drawing and restored after.

Settings

Now that I've got a word cloud, I need to think of a way to customise it on the page. I have implemented customisation options that are separate from a D3 context so that it's possible to use it with other drawing strategies. They are not yet added to the page, since the current situation is enough for me to generate a meaningful word cloud picture to use in my blog posts. I will consider adding them to the page, but there is one complexity.

The customisation settings are separated in two groups, one around font, font size and rotation, the other around colours. They are separate because, a change in the former requires a rerun of the placement algorithm, but a change in the latter does not. Because changing the colour does not require a re-placement, it should not. So a blanketed solution of re-running the whole thing when any setting has changed is not appropriate. Since this code will need to be refactored anyway, I defer this consideration to then.

The Real Gabe Blog

Search This Blog