Is Reinforcement Learning a Slow Learner?

A prominent AI researcher recently gave a webinar with the ACM (Association of Computing Machinery) expressing dismay at the current performance of AI systems and giving his thoughts on the directions research should take. Yann LeCun, Chief AI Scientist at Facebook and Professor at NYU, gave a talk titled “The Power and Limits of Deep Learning”. Current AI systems have no ability to model the real-world. For example, babies learn quite early that a truck that drives off a platform and hovers in the air is unexpected. Current AI systems do not have this ability – they might, after many, many training examples, might be able predict this type of behavior for very specific vehicles. “Sure, I know a red fire truck will fall down, but I have no idea what this Prius is going to do. Let’s watch…” This same type of thing happens in the simpler task of image recognition. A human can get the idea of an elephant from a few images, but our most sophisticated image recognition systems need many thousands of training examples to recognize a new object. And even then, it will have difficulty in recognizing a different view (Elephant rear-end?, Elephant with trunk hidden behind a wall?) if it has not specifically been trained with those types of views.

Similarly, Reinforcement Learning, a technique used to train AI systems to do things like play video games at (or above) human levels, is a slow learner. It takes 83 hours of real-time play for the RL systems to achieve a level a human player can achieve in 15 minutes.

Two basic algorithms used in AI are (1) supervised learning and (2) unsupervised learning. Supervised learning is an algorithm trained by showing it an image (or training example) along with the desired response. “Hello computer. This image is a car. This next image is a bird.” This goes on for millions of images (The ImageNet dataset, used in a lot of benchmark tests, has over 15 million images, and often a subset of over 1 million images is used for training). The training is also repeated over that same set many times (many “epochs”). On the other hand, unsupervised learning tries to make sense of the data without any human “supervised” advice. An example is a clustering algorithm that tries to group items into clusters, or groups, so that items within each group are similar to each other in some way.

Prof. LeCun’s suggestion is that unsupervised learning, or what he calls self-supervised learning, might provide a better approach. He said “Prediction is the essence of intelligence.” We will see whether computers will be able to generate predictions from a just a few examples.

References

  1. Karen Hao, Technology Review, The AI technique that could imbue machines with the ability to reason
  2. Yann LeCun, The Power and Limits of Deep Learning

Bitwise Compression: What Is It and How Does it Work?

File compression programs such as Zip, Pkzip, and Winzip can make a file smaller and take up less space on your hard disk. These programs can then reverse the process, or uncompress the file(s), and restore them exactly as they were before. This is called lossless compression. There lossy compression programs, in which the restored files are close but not exactly the same as the original files. This might seem not useful for data files (like databases, text, or spreadsheets), but can be critically important for things like movies or photographic images. We are going to be only discussing lossless compression today.

The very best (in terms of compression ratio) programs use advanced techniques. For many more details and a history of progress, see Matt Mahoney’s Data Compression Explained. Matt Mahoney also runs the The Hutter Prize, a contest (with cash prizes) for the best programs in a specific data compression task. For reference, the current record for compressing the first 100MB of Wikipedia content has a compression ratio of 0.15285.

There is a specific technique that all the top compression programs use that I will explain in this blog post, called Bitwise Compression. We will get to that in a bit, but first we will need to discuss some basic data compression and information theory.

Basic Compression

The basic idea of data compression is that repeated things can be assigned shorter codes than the length they originally were. This means that things that are more likely should be assigned shorter codes. Morse Code tries to be efficient by assigning the most common letter, “E”, the shortest code 1. Let’s look at a toy example, and suppose that we have just 4 symbols in a file and we want to compress the file. Suppose that the 4 symbols are A,B,C, and D and have the following probabilities:

SYMBOLPROBABILITY
A0.75
B0.15
C0.05
D0.05

The probabilities can either be thought of as the likelihood of those symbols appearing in future files that we will see, or the fraction of the symbols in a particular file that we are trying to compress. Note that in endeavors like the Hutter Prize, there is a specific file that we are trying to compress and optimize for that file as much as we want; general compression programs do not have that luxury.

Given the above symbols and probabilities, we might ask what is the optimal way to encode them in bits. A particular solution to this program was invented by David A. Huffman, in his 1952 paper, called Huffman Coding. He asked the question: if I assigned bit sequences to each of those symbols, what assignment would produce the shortest output bit sequence on average. Think of this like Morse Code, where instead of each symbol being assigned a certain sequence of dots and dashes, we assign a certain sequence of 0s and 1s.

For example, suppose we just decided to assign A=00, B=01, C=10, and D=11. That is, 2 bits for each symbol. Of course, this encoding method can be reversed, because we just take each pair of bits, and translate back into ABCD as we go. What is the average number of bits used per symbol for this code? By inspection, the answer is 2.0 bits per symbol. Dr. Huffman says we can do better. Using his algorithm, we get the following optimal assignment 2

SYMBOLCODELENGTHPROBABILITYCONTRIBUTION 
A110.750.75
B0120.150.30
C00130.050.15
D00030.050.15
TOTAL1.35 bits/symbol

This assignment shows that 1.35 bits/symbol is achieved. Huffman says that is this the best we can do by assigning a bit sequence to each symbol. Are there other methods that could do better? Claude Shannon, who invented the entire field of Information Theory with his 2 part blockbuster paper published in the Bell System Technical Journal in 1948, said yes. He said that the information content of a symbol stream could be measured by somethings called Entropy, in units of bits. The formula for computing Entropy is

\[
E=\sum_{k=0}^{N-1}-p_i \log_2 p_i
\]

where \(\log_2\) is the logarithm to the base 2, which, in case you have forgotten can be calculated as
\begin{align}
\log_2 x = \frac{\log x}{\log 2}
\end{align}
where log is the logarithm to your favorite base (such as the more common 10 or e). The units of Entropy are bits/symbol. The number of bits used for any entire file would then be Entropy times the number of symbols in the file.

For our toy ABCD example, the entropy we get is
\[
E=1.15402 \textrm{ bits/symbol}
\]

Now, Shannon did not know the exact practical way to get to an encoding method that could achieve this, but he showed that it was possible. One way to do better would be to encode pairs of symbols into Huffman Codes. That is, we make a table like we did above, but the symbols are now AA, AB, AC, AD, BA, etc. We get the probabilities of these combinations by multiplying the probabilities of the individual symbols together (putting aside conditional probabilities for the time being). This will get us closer to the 1.15402 bits/symbol we are looking for, but not there yet. The reason is that we constrained to using a whole number of bits for each symbol and we end up wasting output bits for each symbol.

Arithmetic Coding

Enter Arithmetic Coding, invented in the 1970s at IBM, and published in the IBM Journal of Research and Development. It basically allows us to out fractional bits for each symbol, and overall, the number of bits used approaches the Entropy we calculated above (the Arithmetic Coding algorithm gets us to within 1 byte of the Entropy in our entire sequence, so let’s call that good enough). There are many descriptions on how AC works, starting with Arithmetic Coding. We are not going to go into the details here.

Context

In any real file, the individual bytes do not show up with independent probabilities. For example, in an English language text file, after the letter “q”, the letter “u” is very likely to show up, far more likely than it’s overall frequency of occurrence in the file. How do we take advantage of this information? It turns out, that using Arithmetic Coding (AC), it very straightforward. AC, at each step, just needs the probabilities of each symbol, and it can output the proper bit sequence to encode those symbols. The decoder just needs the same probabilities provided to it, and it can recover the original symbol sequence. One way that file compression algorithms can take advantage of this information is by building models (or tables) as a file is processed, which try to predict what the next symbol will be. This prediction is done by giving a probability for each symbol. Now, in a technique known as Prediction by Partial Matching (PPM), we look at the previously seen symbols and then predict the next symbol. For example, if we see “sequ” we might give a high probability to the next symbol being “e”. This is done by keeping track of how many times various symbols are seen for each context, or previous symbols. Where it gets difficult is that you want the probabilities to be adaptive, so they can adjust to varying parts of the file. It turns out that we cannot just compute these probabilities by context for the entire file and use those for encoding. They would have to sent along to the decoder. While this would result in extremely good compression ratios, the size of the data (the probability for every context) would be huge, and negate the space savings.

For a bunch of reasons, a variation of this method, Bitwise Compression, was developed. It works basically the same way, but each symbol being encode is a bit from the source file. That is, we only have 2 symbols, “0” and “1”. This makes everything simpler, including the AC algorithm, which now only needs the probability of a 1 to encode each bit. A good question to ask is, are we losing anything by doing this Bitwise Compression? At first glance, it seems like you might be taking up extra space because each bit has to be encoded, rather than entire bytes.

Bitwise Compression

Now, one thing is obvious. If we did not use any context, this would not work. Without any context, each bit is likely to be 50/50, so we would not gain any compression benefit. So what I want to show here is that compressing by bytes is equivalent to compressing by bits, with the context of the byte encoded so far.

Let’s start by comparing two methods of compressing a text file:
(1) Compress each byte based on its probability of occurrence in the file, with no context. This is easily estimated by simply computing the probability of each byte (based on the number of times it occurs in the file), then computing the entropy of those probabilities.
(2) Compress each byte, one bit at a time, where the context of each bit prediction is the previously seen byte. That is, for the first bit prediction, the context is nothing. That is, the first bit will be predicted based on the probability of the first bit of each byte being 1. The second bit will be predicted based on the previous context being ‘0’ or ‘1’, depending on what the first bit was. A simple way to handle the bit context in Python is to make a context string, which is a ‘1’ followed by the actual bits seen so far. This distinguishes 3 zeros seen from 1 zeros seen: 1000 (0x8) means 3 zeros seen, while 10 (0x3) means a single 0 seen, and 1 (0x1) means nothing seen yet (the first bit of a byte). In the accompanying code, this is called the bitcontext.

You can follow along with the python code in the related gitlab account, which you can get by doing

git clone https://gitlab.com/dgrunberg/bitwise-compress.git

If you have Python 3.5 or later, you should be able to run this code, like this:

python study-bits.py --meg 1 --input enwik8

Where enwik8 is the 100MB of Wikipedia data you can get from the Hutter Prize site. This is perform the analysis on the first 1000000 bytes of enwik8. The Estimated compression bits it puts out is computed by calculating the entropy of the probabilities generated as inputs to the AC.

The results are

Method – no previous bytes context + bitcontext Compression Ratio
Entropy by byte: 5.0589 bits/byte 0.6324
Encode by bit – Estimated bits out by calculating entropy 0.6324
Encode by bit – Actual bits output 0.6324

So you can see, for this simple example file, that encoding by bit with a single byte bitcontext gives the same result as encoding based on byte probabilities.
Now, for a preview of how context can help with encoding, let us consider using the single previous byte as context. The results are:

Method – single previous byte as context + bitcontext Compression Ratio
Encode by bit – Estimated bits out by calculating entropy 0.4793
Encode by bit – Actual bits output 0.4794

So we went from compression ratio of 0.6324 to 0.4794 by using a single previous byte as context. If you follow the history of compression programs from Matt Mahoney’s excellent site, you will see that to get from here to the current record holder ratio of about 0.15, a number of additional things get added:

  1. An adaptive probability model is used, where the count of occurrences of each context is encoded into a single byte (called a bit history). This gets updated each time a context is encountered through a table lookup
  2. Each count is mapped to a probability through a single table for the entire context; the table gets updated as each bit is revealed.
  3. Many different models are used, including various numbers of previous bytes and some unigram word models
  4. These models are combined linearly (called the mixer), with the weights adjusted adaptively
  5. This mixed probability is then run through a SSE (Secondary Symbol Estimation), which is another adaptively updated table that corrects this probability. Sometimes 2 or more SSE stages are used.
  6. More than 1 set of weights are used, where the particular weight to be used is chosen based on some other context characteristics.
  7. A preprocessor and dictionary is used, which replaces many of the words in the file with various dictionary entries and escape codes.

-THE END

Notes:

  1. E is just one single dot. The second most common letter, T, is a single dash.
  2. One minor point I glossed over: since the codes in general are different lengths, we have to be able to decode them without any specific indication of where each code ends. That is, we need to be able to parse out the individual bit sequences from the overall stream. The property of the chosen codes that ensures this will happen is called the prefix property. This means that no code can be a prefix (first part) of any other code. Huffman Codes have this property.

Review of The Perfect Bet

From 1993 to 2010, Joan Ginther won four jackpots in the Texas scratchcard lottery, totaling over $20 million.

Although Ginther has never commented on the reason for her multiple wins, some have speculated that her statistics PhD might have had something to do with it.

The Perfect Bet, by Adam Kucharski, 2016, Hachette Book Group.

Adam Kucharski, an assistant professor in mathematical modeling at the London School of Hygiene and Tropical Medicine, in his recent book, amuses and informs about the ways math and gambling have been connected throughout history.

In 1892, Karl Pearson decided to study the Le Monaco newpaper’s record of roulette outcomes to see if they were random.  As he studied the data, he found a number of anomalies.  Unfortunately for aspiring gamblers, it turns out that the data had been faked by reporters who were too lazy to record the actual numbers.

Kucharski outlines a number of cases where math did serve the gambler.

He talks about the MIT’s Ed Thorp, who basically invented blackjack card counting with his book Beat the dealer 1.

Did you know the Eudaemons built a roulette beating computer device 2? Hint: It has to do with physics, centripetal acceleration and friction.

He covers how Mohan Srivastava cracked an instant tic-tac-toe lottery game due to a design flaw. He details how teams, including one from MIT, beat the Massachusetts state lottery in the 2000s. See 3 for even more details on this one.

He talks about horse racing, including the use of modern techniques like machine learning, and how they have been applied to gambling. It’s not enough to just predict winners from results of previous races. It depends on which horses were running in those races and how they performed based on the groups they had been in previously, and so on.  It quickly become complicated.  But, as they say, if it were easy, everyone would be doing it.

From there, he moves on to soccer and sports betting.

He then take a few chapters to discuss the ins and outs of poker playing – both how computers have assisted players in improving their game, and how poker-bots have made a name for themselves in online poker rooms. There has also been an annual poker-bot competition at an academic artificial intelligence conference, where a team from the University of Alberta has dominated. Who knew?

While those looking for a handout of secrets to beating casino games will be somewhat disappointed, this book does provide some colorful background for those wanting to learn more about the ways math has been used in winning at gambling. It’s going to be up to you to figure it out.

Books mentioned here are available at Amazon:

The Perfect Bet, by Adam Kucharski
The Eudaemonic Pie, by Thomas Bass
Beat the Dealer, by Ed Thorp
How Not to be Wrong, by Jordan Ellenberg [My Review]





512 px Vdara Hotel

Starting a Fire With a Hotel

The Vdara hotel in Las Vegas gained some notoriety in 2010 when guests complained about getting severe burns from the pool deck area (see article). Note the curved facade and the pool and deck immediately below.  Image By Cygnusloop99 (Own work) [CC BY-SA 3.0 or GFDL], via Wikimedia Commons

Here is the view from Google maps. We will use this to get some measurements for the hotel:

We are going to build a simple simulation of the solar radiation reflection off the hotel mirrored facade. To this, we will need some things:

  1. The geometry of the hotel and pool deck area
  2. The orientation of the sun by time and date
  3. Calculations for reflection of the sun’s rays and their intersection with the pool deck.

Let’s take them in order.

Hotel Geometry

From a hotel floor plan, easily obtained on the web, we trace out the pool, pool deck, and facade surfaces.  Using the above satellite photo to get the scale properly, we then create 2 circular surfaces that closely match the facade geometry.  The radii of the 2 facades are 91.2m and 112.2m.  This is what the mathematical curves overlaid on the image look like. The axes are in meters and North is facing upward.  Notice that the curved surfaces have a very nice southern exposure.

In this image, I have extended the second wall (orange) all the way in order to see how it fits on the floor plan. It is slightly above the floor plan at its right-most edge because there is a slight separation between the 3 semi-circular facades.
You can see the original floor plan here (rotated) Vdara Floor Plan.

Sun Orientation

The Wikipedia Sun Position article has a good introduction to the basic calculations.  We will review what we need here.  Basically, there are two coordinate systems that we will need.  The first is the ecliptic coordinate system, in which the sun and stars are (roughly speaking) constant.  Think of it as the stars’ coordinate system. In it, the ecliptic latitude is the angle from the plane of the Earth’s orbit around the Sun.  The ecliptic longitude of the Sun is zero when the Earth is at the vernal equinox.  So the ecliptic latitude of the Sun is 0, and the longitude depends on the day of the year, at least to the precision that we need for our solar convergence calculations.

The other coordinate system that we will need is the horizontal coordinate system.  This coordinate system is referenced to a viewer on a spot on the Earth, and has

  • Solar altitude: the angle above the horizon.  Its complementary angle is the Solar zenith.
  • Solar azimuth: the angle from the North, increasing clockwise towards the East.  Thus, 90 degrees is East and 180 degrees is South.

Rather than give all the equations for the calculations, which you can find on the Wikipedia page, I will give a Python function that will calculate the Sun’s position. The hour is the time before/after Solar Noon, which is when the Sun is at it’s highest point in the sky (close to Noon local time), and simpler to use here than trying to calculate from local time (which requires knowledge of time zones, etc.). Since we will be computing the Sun’s angle as it moves through the sky, the exact time is not important to us. If you want to check out when solar noon occurs, and compare the function below for accuracy, see NOAA Solar Calculations. For Las Vegas, the NOAA site matched this function within a few tenth’s of a degree.

def sun(hour, dt_noon=None):
  #compute sun at hour before/after solar noon on dt_noon, e.g. hour=-1 is 1 hour before solar noon
  if dt_noon is None:
    dt_noon=datetime.datetime(2017,8,24,hour=12,minute=0,second=0)
  #ecliptic coordinates
  n=(dt_noon - datetime.datetime(2000,1,1,hour=12)).days  #number of days since 1/1/2000
  L=(280.46+.9856474*n)/180*math.pi
  g=(357.528+.9856*n)/180*math.pi
  L=angle(L)  # make sure in 0-2*pi range
  g=angle(g)
  ecliptic_longitude=L+(1.915*math.sin(g)+0.020*math.sin(2*g))/180*math.pi
  ecliptic_latitude=0.0
  obliquity_of_ecliptic=23.4/180*math.pi

  hour_angle=(hour*15)/180*math.pi
  sun_declination=math.asin(math.sin(obliquity_of_ecliptic) * math.sin(ecliptic_longitude))

  solar_altitude=math.asin(math.sin(local_latitude)*math.sin(sun_declination)+
  math.cos(local_latitude)*math.cos(sun_declination)*math.cos(hour_angle))
  solar_zenith=math.pi/2-solar_altitude
  #solar_azimuth=math.pi-math.asin(-math.sin(hour_angle)*math.cos(sun_declination)/math.sin(solar_zenith))
  #solar_azimuth is measured by 0=north, 90 degrees=East, 180=South.
  #Note: problem when asin argument is around 1, because there are 2 solutions (like 85 and 95), so
  #we might get the wrong answer around sunrise/sunset.
  #This alternate version does not have the same problem, but we have to be careful about sign of h
  n1=math.sin(sun_declination)*math.cos(local_latitude) - 
  math.cos(hour_angle)*math.cos(sun_declination)*math.sin(local_latitude)
  d1=math.sin(solar_zenith)
  if (n1/d1) >= 1.0:
    solar_azimuth=0
  elif (n1/d1<-1.0):
    solar_azimuth=math.pi
  elif h<0:
    solar_azimuth=math.acos(n1/d1)
  else:
    solar_azimuth=2*math.pi-math.acos(n1/d1)
  return solar_altitude, solar_azimuth

Reflections and Intersections

The last piece we need are the calculations for reflections off surfaces.  If a light ray vector L impinges on a surface with unit normal N, it will be reflected with the same angle opposite the normal.  Looking at this figure:

Ray Reflection

We can see that
\begin{align}
P=-(N \cdot L)N
\end{align}

because it is the projection of L onto the normal N.  Then we have

\begin{align}
R-L&=2P \\
R&=L-2(N \cdot L)N
\end{align}

Now that we have the reflected ray, we need to see how it intersects with the ground (or pool deck). For a plane described by a point \(p_0\) and normal \(n\) containing all points p such that:
\begin{align}
(p-p_0) \cdot n = 0
\end{align}

and a line described by a point \(l_0\) and a vector in the direction of the line \(l\) where the line is
\begin{align}
p=dl+l_0
\end{align}
points p for all real numbers l. If we solve these 2 equations for d, we get:
\begin{align}
d=\frac{(p_0-l_0) \cdot n}{l \cdot n}
\end{align}
which we can substitute back into the previous equation to get the intersection point \(p\).

Putting It All Together

Using Python and matplotlib, we can easily piece together a simulation of the intensity of the Sun’s rays.  To start, I tested the program with a simple flat facade in roughly the orientation of the first Vdara curved wall.

Flat hotel facade

The red line, at approximately -18 degrees, is the mirrors position on the floor plan.  The wall is 176m high, which is the height of the Vdara hotel.

The solar intensity is about 1.0 for the entire parallelogram image, which sort of what we would expect for a flat mirror.  It is not exactly 1 because the sun is not hitting the pool deck at exactly 90 degrees.  The 3 yellow lines represent the Sun’s azimuth.  Because this is solar 12:00 (solar noon), the azimuth is exactly 180 degrees.

 

Now, let us look at the results for the curved facade.

Solar noon at the hotel

Wow.  You can things are starting to heat up.  Believe it or not, the peak solar value is 15 in this plot.  A few things to be careful about:

  • I have not modeled the different heights of the 3 curved walls, nor the third wall that extends a little bit beyond the second wall
  • I have not modeled rays that have 2 or more reflections from their first intersection with the wall.  That is, some rays can hit the wall at an oblique enough angle to then hit the wall again.  In the figures to come, you can see that because some rays accumulate behind the wall, which would not happen in reality.
  • I have assumed that the walls are 100% reflective with no loss.  The architect did place some kind of plastic covering to reduce the reflection; how much has not been reported.  Also, with glass windows, some of the light passes through the window; how much is transmitted vs. reflected depend on the angle of incidence and the index of refraction of the glass.  A more accurate model would take this into account.

Now, placing several of these figures in an animation around solar noon, we can see how the hot spots move across the pool deck:

Solar convergence animation