Java 9 - Compact String and String New Methods

Compact String

Java Internal String Representation

Java was originally developed to support UCS-2, also referred to as Unicode at the time, using 16 bits per character allowing for 65,536 characters. It’s only in 2004 with Java 5 that UTF-16 support was introduced by adding a method to extract 32 bits code point from chars. From that time onward, a Java String is represented internally in the JVM using bytes, encoded as UTF-16. UTF-16 uses 2 bytes to represent a single character. Thus, the characters of a Java String are represented using a char array.

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode. But, Unicode documents often require up to twice as much disk space as ASCII or Latin-1 documents. The first 256 characters of Unicode are identical to Latin-1 (Please refer to ASCII, ISO 8859, and Unicode). Statistically most of encoding only required 8 bits - Latin-1 character representation (The first 256 characters of Unicode are identical to Latin-1). As example, an ASCII character can be represented just using a single byte.

UseCompressedStrings

Option XX:+UseCompressedStrings was introduced in Java 6 Update 21 Performance Release to use a byte[] for Strings which can be represented as pure ASCII. You can check this option here.

The feature was experimental, not open-source, and only led to gains in a small number of cases as it needed to transform the US-ASCII byte[] array to a UTF-16 char[] to do most of its operations. Due to the absence of real gain in production like environments, and the high maintenance cost, it was dropped from Java 7.

Compact Strings - Java 9

From Java 9 and forward, The JVM can optimize strings using a new Java feature called compact strings. Instead of having a char[] array, String is now represented as a byte[] array. Depending on which characters it contains, it will use either UTF-16 or Latin-1 to produce either one or two bytes per character. If the JVM detect if a string only contains ISO-8859-1/Latin-1 characters, the String will only use 1 byte per character internally.

Whether a String can be represented as a compact string or not is detected when the string is created. A String is immutable once created - so this is safe to do. This feature is enabled by default and can be switch off using the -XX:-CompactStrings. Note that switching it off does not revert to a char[] backed implementation, it will just store all the Strings as UTF-16.

Most of the String operations now check the coder and dispatch to the specific implementation:

public String toLowerCase(Locale locale) {
    return isLatin1() ? StringLatin1.toLowerCase(this, value, locale)
                      : StringUTF16.toLowerCase(this, value, locale);
}

public String toUpperCase(Locale locale) {
    return isLatin1() ? StringLatin1.toUpperCase(this, value, locale)
                      : StringUTF16.toUpperCase(this, value, locale);
}

public String trim() {
    String ret = isLatin1() ? StringLatin1.trim(value)
                            : StringUTF16.trim(value);
    return ret == null ? this : ret;
}
                    

StringLatin1
package java.lang;

...

final class StringLatin1 {

    ...
	
	public static String toLowerCase(String str, byte[] value, Locale locale) {
        ...
        return new String(result, LATIN1);
    }

    ...

    public static String toUpperCase(String str, byte[] value, Locale locale) {
        ...
        return new String(result, LATIN1);
    }

	...
	
    public static String trim(byte[] value) {
        ...
        return ((st > 0) || (len < value.length)) ?
            newString(value, st, len - st) : null;
    }
	
	...
}
                    

StringUTF16
package java.lang;

...

final class StringUTF16 {

    ...
	
	public static String toLowerCase(String str, byte[] value, Locale locale) {
        ...
        if (bits > 0xFF) {
            return new String(result, UTF16);
        } else {
            return newString(result, 0, len);
        }
    }

    ...

    public static String toUpperCase(String str, byte[] value, Locale locale) {
        ...
        if (bits > 0xFF) {
            return new String(result, UTF16);
        } else {
            return newString(result, 0, len);
        }
    }

    ...

    public static String trim(byte[] value) {
        ...
        return ((st > 0) || (len < length )) ?
            new String(Arrays.copyOfRange(value, st << 1, len << 1), UTF16) :
            null;
    }

    ...
}
                    

where checking for isLatin():

private boolean isLatin1() {
    return COMPACT_STRINGS && coder == LATIN1;
}
                    

which COMPACT_STRINGS actual value for this field is injected by JVM.

And coder can be:

this.coder = LATIN1; this.coder = UTF16;

Java 9 String Methods

There are two methods added in String class in Java 9 release. They are chars() and codePoints(). Both methods return IntStream object.

chars()

  • IntStream chars(): Returns a stream of int zero-extending the char values from this sequence.
import java.util.stream.IntStream;

public class StringChars {
    
    public static void main(String[] args) {
        String str = "Programming With Java";
        IntStream stream = str.chars();
        stream.forEach(x -> System.out.printf("-%s", (char)x));
    }
}
                    

-P-r-o-g-r-a-m-m-i-n-g- -W-i-t-h- -J-a-v-a

codePoints()

  • IntStream codePoints()​: Returns a stream of code point values from this sequence.
import java.util.stream.IntStream;

public class StringCodePoints {
    
    public static void main(String[] args) {
        String str = "Greek Alphabets α-Ω";
        IntStream stream = str.codePoints();
        stream.forEach(x -> System.out.print(
                new StringBuilder().appendCodePoint(Character.toChars(x)[0]).toString()));
    }
}
                    

Greek Alphabets α-Ω