2024-09-10-mdn-intl-segmenter.md

Locale-sensitive text segmentation in JavaScript

Earlier this year, JavaScript Intl.Segmenter gained support in all three major browser engines, meaning it’s achieved Baseline status of “widely available”. Now, your applications can natively retrieve meaningful information from strings in a variety of locales in the latest browsers. This is great news for developers that are building locale-aware apps or UIs and writing custom handling or using third-party libraries for this purpose. Let’s explore what this opens up with some hands-on examples.

This post is also published on the MDN Blog if you’d like to check it out there!

# What is text segmentation used for?

Text segmentation is a way to divide text into units like characters, words, and sentences. Let’s say you have the following Japanese text and you’d like to perform a word count:

吾輩は猫である。名前はたぬき。

If you’re unfamiliar with Japanese, you might try built-in string methods in your first attempt. For English strings, a rough way to count the words is to split by space characters:

const str = "How many words. Are there?";
const words = str.split(" ");
console.log(words);
// ["How","many","words.","Are","there?"]
console.log(words.length);
// 5

The punctuation is mixed in with the word matches, and this will be inaccurate, but it’s a good approximation. The problem is we don’t have any spaces separating the characters in the Japanese string. Maybe your next idea would be to reach for str.length to count the characters. Using string length, you’d get 15, and if you remove the full stops () you might guess 13 words. The problem is we actually have 8 words in the string without punctuation:

'吾輩' 'は' '猫' 'で' 'ある' '名前' 'は' 'たぬき'

If you rely on string methods for a word count, you’ll quickly run into trouble as you can’t reliably split by specific character and you can’t use spaces as separators like you can in English.

This is what locale-sensitive segmentation is built for. The format for creating a segmenter in the Intl namespace is as follows:

new Intl.Segmenter(locales, options);

Let’s try passing the string into the segmenter with the ja-JP locale for Japanese, and we explicitly set each segment to be of word-level granularity:

const jaSegmenter = new Intl.Segmenter("ja-JP", { granularity: "word" });
const segments = jaSegmenter.segment("吾輩は猫である。名前はたぬき。");

console.log(Array.from(segments));

This example logs the following array to the console:

[
  {
    "segment": "吾輩",
    "index": 0,
    "input": "吾輩は猫である。名前はたぬき。",
    "isWordLike": true
  },
  {
    "segment": "",
    "index": 2,
    "input": "吾輩は猫である。名前はたぬき。",
    "isWordLike": true
  },
  {
    "segment": "",
    "index": 3,
    "input": "吾輩は猫である。名前はたぬき。",
    "isWordLike": true
  },
  // etc.

For each item in the array, we get the segment, it’s index as it appears in the original string, the full input string, and a Boolean isWordLike to disambiguate words from punctuation etc. Now we have a robust and structured way to interact with the words that is locale-aware. The segmenter’s granularity is word in this example, so we can filter each item based on whether it’s isWordLike to ignore punctuation:

const jaString = "吾輩は猫である。名前はたぬき。";

const jaSegmenter = new Intl.Segmenter("ja-JP", { granularity: "word" });
const segments = jaSegmenter.segment(jaString);

const words = Array.from(segments)
  .filter((item) => item.isWordLike)
  .map((item) => item.segment);

console.log(words);
// ["吾輩","は","猫","で","ある","名前","は","たぬき"]
console.log(words.length);
// 8

This looks much better. We have an array with Japanese words using the segmenter, ready for adding locale-aware word count to our application. We’ll explore that use case a bit more with a small example in the following sections. Before that, we’ll take a look at the rest of the options that you can pass into a segmenter.

# Intl.Segmenter options and configuration

We’ve seen above that you can split input by word according to the locale. If you don’t pass any options, the default behavior is to split by grapheme, which is the user-perceived character. This is useful if you’re doing a character count on strings in different encodings or languages where characters are made of combined characters such as किंतु in Hindi:

const str = "किंतु";
console.log(str.length);
// 5 <- oops

const hindiSegmenter = new Intl.Segmenter("hi");
const hindiSegments = hindiSegmenter.segment(str);
const hiGraphemes = Array.from(hindiSegments).map((item) => item.segment);

console.log(hiGraphemes);
// ["किं","तु"]
console.log(hiGraphemes.length);
// 2 <- looks better

The last option you might need is to segment text by sentence, which is also very convenient if you don’t want to keep track of language-specific full-stops. Some languages may use a period character ., but this is not always consistent. Let’s take the following example:

const hindiText = "वाक्य एक। वाक्य दो।"; // <- what do I split on here?

const hiSegmenter = new Intl.Segmenter("hi", { granularity: "sentence" });
const hiSegments = hiSegmenter.segment(hindiText);
const hiSentences = Array.from(hiSegments).map((item) => item.segment);

console.log(hiSentences);
// ["वाक्य एक। ","वाक्य दो।"]
console.log(hiSentences.length);
// 2

In another Hindi example, we have a character that looks similar to a pipe () that’s separating sentences. Now you don’t have to track Western periods or other locale-specific equivalents to split text into sentences.

If you want to check support for a locale, you can use supportedLocalesOf. This returns an array with the provided locales that are supported in segmentation without having to fall back to the default locale. The following checks if the segmenter can use Hindi, Japanese, and German for segmentation:

console.log(Intl.Segmenter.supportedLocalesOf(["hi", "ja-JP", "de"]));
// Array ["hi", "ja-JP", "de"] <- all are supported

# Japanese locale word count example

If your browser supports Intl.Segmenter, you can try out the following example. There’s some Japanese text from Wikipedia, and a <pre> element below it to show the output of our script.

If you’ve followed all of the snippets so far, you shouldn’t see anything surprising here. The only difference is that we’re getting the selected text using window.getSelection before passing that into the segmenter which we’ve wrapped in a function. After that, we’re listening for the mouseup event when it’s fired on the paragraph and adding the output of the countSelection function to the <pre> element:

function countSelection() {
  const selection = window.getSelection();
  const selectedText = selection.toString();

  const jaSegmenter = new Intl.Segmenter("ja-JP", { granularity: "word" });
  const segments = jaSegmenter.segment(selectedText);

  const words = Array.from(segments)
    .filter((item) => item.isWordLike)
    .map((item) => item.segment);

  document.getElementById(
    "word-count"
  ).textContent = `Word count: ${words.length}\n - "${words}"`;
}

document
  .getElementById("text-content")
  .addEventListener("mouseup", countSelection);

To try it out, select some of the Japanese text below with your mouse. On mouseup, we log the word count along with the output of the segmenter with word-level granularity:

# That’s a wrap

Locale-sensitive text segmentation with JavaScript is now more ergonomic in the latest browsers. This feature is particularly useful for handling non-Latin languages, where the usual string manipulation methods will be unreliable. If your app needs to handle multiple locales and you’re regularly working with text manipulation, Intl.Segmenter can help segment text by word, characters, or sentences based on locale. This simplifies tasks such as word or character count, sentence-splitting, string comparisons and more advanced text processing.

Feel free to get in touch on Bluesky if you have any feedback or if you want to say hi. See you next time!

Honorable mentions: Thanks to these people who gave me a shout-out for my blog post:

# See also

If you want to learn more about Intl.Segmenter, you can have a read through these other resources:

Published: