2.1.10

Outline the way in which data is represented in the computer.

*Teaching Note:*

To include strings, integers, characters and colours. This should include considering the space taken by data, for instance the relation between the hexadecimal representation of colours and the number of colours available.

TOK, INT Does binary represent an example of a lingua franca?

S/E, INT Comparing the number of characters needed in the Latin alphabet with those in Arabic and Asian languages to understand the need for Unicode.

Sample Question:

sdfsdfsf

JSR Notes:

This is actually a HUGE question to really understand it, but ultimately only one little assessment statement, so the way I'll approach it is with three "tiers":

- the actual "canned" summary information you will need for an IB exam question
- the teaching notes themselves tweaked out a little further
- the full explanation for real understanding

In class, we'll have to do these below stages in reverse order, but for study, if you can understand the top tier, and are able to do the write the same sort of answer in your own words, you're good to go; otherwise, keep on reviewing on down the notes.

__The Summary IB Exam Question & Answer__

** Q1A**: "Outline the way in which data is represented in the computer" " include(ing) strings, integers, characters and colours.

** Q1B**: "(Discuss) the space taken by data, for instance the relation between the hexadecimal representation of colours and the number of colours available."

** Q1C**: "Compare the number of characters needed in the Latin alphabet with those in Arabic and Asian languages to understand the need for Unicode."

** Q1D**: "Does binary represent an example of a lingua franca?"

** Answer to Q1A** - All data is represented on a computer in a binary way. In the RAM of a computer this binary representation is in the form of open circuits (the 1s of the binary code) and closed circuits (the 0s of the binary code). On hard drives, the binary representation is in the form of little regions that have one level of magnetism or another.

So all data, be they numberic data or color data or whatever, ultimately have to be able to be translated into a binary form. For strings (i.e. words), they are coded one character at a time, with each letter having a certain standard binary code. The original coding system for this was called ASCII, and in it, capital A, for example, had the standard 8 binary digit representation of 0100 0001. Integers are able to be stored simply as the binary equivalent of their decimal value. So written in an integer context, the decimal number 6 would be (in an 8 bit form) 0000 0110. For colors, it depends on the color model, but using the common RGB model, one byte is used for each of the red (R), green (G) and blue (B) values. Red would then be 1111 1111 0000 0000 0000 0000. Since this model uses three bytes per pixel, it is referred to as 24-bit color.

** Answer to Q1B**: The number of bits used per character or number or color determines the number of characters/numbers/colors which can be represented. With all data, each extra bit (binary digit) which is used to represent individual characters/numbers/colors doubles the number of characters/numbers/colors available. But since computers work with groups of 8 bits usually, we tend to think of the influence of having an extra 8 bits used per thing. And so an extra 8 bits will double the initial number of colors etc. 8 times, or increase it by a factor of 2^8, or 256.

In the example of color, we usually work with hexadecimal values, in fact, and 8 binary digits is equivalent to two hex digits. So when a color model goes from 8 bit color (for example FF) to 16 bit color (for example, the color 33FF), to 24 bit color (for example, the color AA33FF) - that is to say **with every two addition hex digits, 16^2 or 256 more colors can be represented**. So the number of colors available in those models thereby is 256^1, 256^2, and 256^3, which equals 256, 65,536, and over 16 million. So 24 bit color can represent over 16 million different colors.

** Answer to Q1C**: The number of characters needed in the Latin alphabet is way smaller than those needed for Arabic and Asian languages. We only need the basic A-Z characters 0-1 numbers and common punctuation symbols to be able to store Latin characters as with the English language. Arabic and Asian languages and other world languages would need thousands and thousands of characters to be represented; think about how Chinese dialects have tens of thousands of pictograms in their language.

Remembering that computers are made to work primarily with groups of 8 bits (a byte), one group of 8 bits would be enough for Latin character languages such as English. The number of characters which could be represented by 8 bits is 2^8, or 256. ASCII, the original computer character set now uses exactly that, so it represents the most common 256 Latin characters and symbols. To go up into the thousands of available representations, it makes sense to stick with the 8 bit groupings model, and so with 16 bits, 2^16 (65, 536) different characters can be represented. And this is exactly what the standard UNICODE code set contains.

** Answer to Q1D**: A lingua franca is a language which most people around the world can understand. So, yes, binary, in a way does represent a lingua fraca, in so far as the computers most people around the world use are able to interpret the character codes sets ASCII and UNICODE.

And if you were to draw a basic summary diagram, particularly for Q1A, the following would suffice nicely:

__Teaching Points Tweaked Out A Bit More__

*2.1.10 Teaching Note Point - Hex representation of colours
“To include strings, integers, characters and colours. This should include considering the space taken by data, for instance the relation between the hexadecimal
representation of colours and the number of colours available.*

Definitely first make sure you have a general understanding of what is covered below in the big image at the bottom of these notes. But hopefully you got a good understanding of that from class.

To get this, first you need to get the straight-forward relationship between binary and “hex” (hexadecimal) values representing color.

(Of course,) at the memory level, all colors (like all data) are represented by 0s and 1s, somehow.

But we wouldn’t want to be working in Photoshop, for example, with the color “0000 0000 1111 1111 0000 0000” (green in a 24 bit RGB color model).

Instead, a nice, short number (in hex) is much easier to work with; green in 24 bith RGB would be “00 FF 00”.

So the relationship between binary and hex is this:

------ one hex bit is the same as four binary bits ------

this is because in binary, four places can represent 16 toal different values; 0000 (decmial 0) all the way up to 1111 (decimal 15)

and hex it’s just one digit which can represent exactly 16 total different values; 0 (decmial 0) all the way to F (decmial 15)

so every four binary digits (so, 6 groups of four in 24 bit RGB) are equivalent to 1 hex digit.

Examples: (using a 24 bit color model)

R G B

decimal green: 0 255 0

binary green: 0000 0000 1111 1111 0000 0000

hex green: 0 0 F F 0 0

decimal red: 255 0 0

binary red: 1111 1111 0000 0000 0000 0000

hex red: F F 0 0 0 0

decimal yellow: 255 255 0

binary yellow: 1111 1111 1111 1111 0000 0000

hex yellow: F F F F 0 0

decmial sky blue: 0 51 204

binary sky blue: 0000 0000 0011 0011 1100 1100

hex sky blue: 0 0 3 3 C C

So you can see that using the hex values is just easier. Green is 00FF00, Red is FF0000, Yellow is FFFF00, and Sky Blue is 0033CC, all fairly easy values to write

and even to remember.

Finally then to the assessment statement teaching note point. “The relationship between the hexadecimal representation of colours and the number of colours available”

is as follows:

**For every hex digit added to the color model, there are 16 times more colors available**. It’s as simple as that...

*...Except that color models generally don’t go up by one hex digit, they go up by two*. And that makes sense, since the standard for working with data is the

byte and the byte is usually 8 bits, i.e. that which can be represented by 2 hex digits, not 1.

So the most common color models (8-bit color, 16-bit color, and 24-bit color) each go up in the number of colors possible by a factor of 256:

8-bit color (i.e. two hex values color) can represent 256 differnet colors.

16-bit color (i.e. four hex values color) can represent 65, 536 (which is 256 x 256)

24-bit color (i.e. six hex values color) can represent 16,777,216 (which is 65,536 x 256)

*2.1.10 Teaching Note Point - Comparing the number of characters needed in the Latin alphabet with those in Arabic and Asian languages to understand the need for Unicode*

Basic ASCII (The American Standard Code for Information Interchange) uses a 7 bit set of characters to represent letters, and in various extended ASCII sets, a full 8-bit byte is used. That means that for the basic ASCII, 2^8 characters - i.e. 128 - can be represented, and in the extended ASCII, 256 characters can be represented.

The Latin alphabet that we use in English has 26 lower case letters, and 26 upper case letters. So those, along with the 10 numeric digit characters (0 - 9), and 20 or so common symbols of the keyboard (!@#$%^&*( ) ;':"[]{},.< >/?\|) can all be represented within those 128 combinations of 0s and 1s of basic ASCII. Below is a picture of the basic ASCII first 128 characters. (The reason, by the way, that 7, not 8 bits were used is that in the original ASCII the 8th bit of the standard 8 bit byte was used for error checking during data transmission - get me to explain how the error checking worked in class sometime; there was a "check-bit".)

But for "Arabic and Asian languages", there's not enough room in a 128 set, or even a full 256 full ASCII extended set to fit all the characters. There are tens of thousands of Chinese characters, for example.

And there you have it, in terms of the teaching note "comparing the number of characters needed..." "Arabic and Asian", and indeed a whole bunch of other kinds of languages demands more than 256 characters to be represented. So we now have computers which work with UNICODE, which can represent thousands of characters. UNICODE uses 16 bits per character. And in so doing, computers can still work with the basic 8-bit byte. It's just that working in UNICODE, two bytes (16 bits) are read for each character. And in using 16 bits, the calculation for the number of different combinations of 0s and 1s is 2^16, or 65,536.

Try to go online and find a UNICODE explorer kind of website or application which allows you to view different catagories and characters of UNICODE.

http://unicode.mayastudios.com/ - a good one for work with Java code

But the point of why so much more is possible is the same sort of idea as with going from 8-bit color to 16-bit color.

With every extra bit you add, you double the number of possible combinations of 0s and 1s, and therefore the number of possible things (colors, or in this case letters) you can represent.

The math of this is 2^numberOfBits.

2^8 = 256

2^16 = 65,536

So in extended ASCII, the 0s and 1s combinations go from

0000 0000

0000 0001

0000 0010

0000 0011

.

.

.

to

1111 1100

1111 1101

1111 1110

1111 1111

256 combinations in all.

And in UNICODE, the 0s and 1s combinations go from

0000 0000 0000 0000

0000 0000 0000 0001

0000 0000 0000 0010

0000 0000 0000 0011

.

.

.

to

1111 1111 1111 1100

1111 1111 1111 1101

1111 1111 1111 1110

1111 1111 1111 1111

65,636 combinations in all.

__The Actual Theory Behind All of This____ - in order to be able to understand the answers above__.

(Jaime: Really good video from CMS years of the South Asian professor talking about the String pool on YouTube.... the point being that this is still one layer of abstraction at least higher than the way strings actually work, as arrays of chars, but each char is in the shared String pool.)

One last point that naturally quite often comes up is: "how does the computer know to interpret the 0s and 1s as characters, or numbers or colors etc.?"

The answer is simply that in the header of each file, the context of those 0s and 1s is stated. Something along the lines of "what is to follow is text from the ANSI extendend UNICODE character set". You can see this easily with a TextEdit document if you check the box "Ignor Rich Text Commands" when you open it. The simplest text document with "hello world" as the only words looks like this, and it is the first line which states that what is following is to be interpreted as text:

{\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210

{\fonttbl\f0\fswiss\fcharset0 Helvetica;}

{\colortbl;\red255\green255\blue255;}

\paperw11900\paperh16840\margl1440\margr1440\vieww18180\viewh15500\viewkind1

\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural

\f0\fs24 \cf0 hello world}