20 March 2013

I have seen html developers writing below piece of code in the head tag of the html document.

<meta charset='utf-8'>

        OR

<meta http-equiv='Content-Type' content='text/html; charset=utf-8'>

What does charset='utf-8' mean? What if this tag is ommited from the document?

Charset is used to specify character encoding.

Lets go to the basics, a text is a collection of characters and its stored in the computer as bytes. When we save anything to our computer it exists as bytes. Characters are represented by numbers and stored in sequence of bytes. Sometimes more than one byte is used to represent a single character.

Character encoding governs the way these bytes will be converted back to characters.

What is character encoding?

Each characters is represented by a number called a code point, and code point are stored in the memory in one or more bytes. Character encoding is a mapping between the bytes representing the code points and the characters. Its a key to interpret the data stored in bytes. Lot of character set and character encoding exists, which defines different ways of mapping bytes to characters. For example: a code point of 97 point to A in one character encoding, the same value may point to a different character by different character encoding.

Now what happens when we omit the encoding declaration? When a developer miss the meta tag declaring the char encoding, then the char encoding of the content is left to be interpreted by the browser.

Have you ever noticed garbled characters on a web page? See the pic below.

garbled-text

So the absence of character may lead garbled text compromising on readability and also on search engine(SEO) failing to make sense of the text and will not display the content in search result

One more important thing, "Fonts" are nothing but representation of characters in symbolic form. A font is a collection of glyph definations, defining shapes for characters.

Once the bytes are interpreted as a character via a character encoding, the application looks for fonts which can be used to display these characters. If the encoding is wrong then the shape used to denote that character will be wrong.

If a font does not have a glyph of a particular character, it may look into other fonts and display wrong info or a square box, question mark or any other character.

Browser's Role

Browsers identifies the character encoding of a document via a algorithm. In absence of the character encoding declaration, it may calculate the character encoding incorrectly and may render the page incorrectly with garbled characters.

Specifying the character encoding speeds up a webpage rendering as browser does not have tp calculate the encoding and saves time.

Different ways of specifying character encodings

Character encoding can be specified by meta tags specified above in the article or they can be set by the server.

In php it can be done using the header function like this

header('Content-type: text/html; charset=utf-8');

In python:

print "Content-Type: text/html; charset=utf-8\n\n";

In JSP:

<%@ page contentType="text/html; charset=UTF-8" %>

In XML:

<?xml version="1.0" encoding="UTF-8"?>


Apache Sever configuration

It can also be configured in Apache server, via .htaccess file. Just add the following like to the file

AddCharset UTF-8 .html

We need to configure our text editors to save data in whichever encoding we want out data to be in. For sublime it can be done like image below.


encoding pref


More Information



blog comments powered by Disqus