Visualizing Movie Data | VMD_07 | Movies by words

All films that we have seen in 2015 (and earlier) make use of a film script. What are the most often used words in this scenario? To find out, I need to look for movie scripts. I found them in the ‘IMSDb’ (Internet Movie Script Database). The first script that I found was the script of ‘Tamara Drewe’. Certainly not the best movie we’ve seen but that doesn’t matter much in this case. I loaded the script into TextEdit. Then I imported the complete script of ‘Tamara Drew’ into Processing. In Processing I used Ben Fry’s Treemap Library. In this version I have everything inverted. And I have added a sixth layout algorithm: PivotBySize. The words ‘the, a, is, you, to, tamara, nicholas, beth, and, in, i’ appear to be most often into this script.

Here I used the script of ‘The Master’. The starting point is now to use a gradient in the background. I do that by placing horizontal lines. This is a less advanced version. In this script are ‘the, freddie, to, you, and, a, master’ the most frequent words.

In this version, I’ve refined the gradient. I also changed the font to Avenir Light. And I turned off the frames. In this script of ‘Interstellar’ are ‘the, cooper, a, to, or, and, it, is, in’ the most common words.

This variation is only a variation in the code. I have programmed a few things slightly more functional. Used the script of the film ‘Blow’. The words ‘the, and, a, you, george, to, is, i, of, are most frequently used in the script.

I started again with studying the Treemap library. I have added two new classes. The BinaryTreeLayout and PivotByMiddle class. And I used a less harsh color scheme. I used the script from the film ‘Foxcatcher’. The most frequent used words are ‘the, to, mark, du pont, a, of, dave’.

A very colorful variation. The largest number that can be formed by w (width) in this version is 190 (pixels). All numbers that are higher do not have any influence on w for w can never be higher. Suppose I want to use six colors than that would be 190 divided by 6 colors (rounded up) is 32. So I can choose 6 shades of color in HSB color mode every 32 colors. As a script I used ‘Fargo’ here. The most frequent words are ‘the, a, and, to, margin, or, you, his’.

There are no scripts available anymore from movies that we have seen in 2015. So I have to search for an alternative. Films that we have seen but not in 2015. Also the range of colors has to be more advanced. If I want to use the entire 360 HSB range than I have 203 pixels for w (width) available. I have to divide 203 by 12 in this release. That’s 16. In the end this calculation doesn’t work. So I made increments of 40 in the HSB color mode. The script is from ‘12 Years a Slave’. Most common words are ‘the, a, to, and, solomon, or, is, in’

In this final version I have used a better color scheme. I also used the script of ‘Inglourius Basterds’. Most common words are ‘the, a, to, and, in, or, you, i, his’.

Visualizing Movie Data | VMD_06 | Films by flags

The idea is very simple. From which countries did we see films during 2015? Represent the number of movies with the flags of those countries.

The size in which I work is 1000 x 1000 pixels. There are exactly 50 countries who have made films. And the largest score of 69 films is achieved by the USA. Looking at the display size, in width, I have 1000 : 50 = 20 pixels available. In the height I have 1000 : 69 = 14 pixels available. One flag has a size of 20 x 14 pixels. That is very small. I have made a Processing test-file. The first column is made up of svg-files which are reduced in Processing. Processing or Adobe Illustrator are doing crazy things with this flag-file. The stars of the American flag are randomly placed outside the boundaries of the flag. But because the flag is reduced you cannot see the stars, and the errors, anymore. The second column is a png-file which blurs the stars and lines rather randomly together. The third column is a bmp-file. Manually corrected in Adobe Photoshop. The fourth column is the same bmp-file but now it has a black line between each separate flag. I choose the latter. Although it’s a lot of manual editing work.

The next step I’m going to make is to find the positions of the flags. There, too, I’m going to use the American flag. It’s just a placeholder. I need to make a minor adjustment to the image file. When the flags are 20 pixels in width I must sacrifice one pixel in width for a black line. Otherwise, the flags will visually not be separated from each other. And if you look at this example, you will see that it’s not a good visualization. We are dealing with very big differences in the rather small data set. The minimum score is 1 and the largest score is 69. In addition, the flags are not looking good. And I have no space to add extra information such as headers, country names or scores.

I have to solve this differently. Let’s start all over again. I have 50 films. That are two rows of 25 flags. Or two columns of 25 flags. Or 10 rows of five columns with flags. That seems most appropriate. Flags are usually wider than tall. And that would eventually produce a square. I pick a random image of a flag. In this case its the United Kingdom’s flag. Five flags in width and ten in height.

I also need margins on all four sides of the display window. I take 100 pixels margin. This produces flag sizes of 160 x 80 pixels. And that results in an exact square of 800 x 800 pixels.

How can you visualize the number of films that we saw in 2015 using these flags? One possibility is Processing’s tint function. But because this tint function does not work the way I want it to work, I go for a black rectangle with a little transparency. All flags are 160 pixels wide. A 100% score is 69 films seen (the American ones). The width of a flag 160 divided by the score 69 = 2.3. Thus the multiplication factor is 2.3. for all the scores. Bringing this into practice that is visually extremely disappointing. Although it represents the exact data.

The disadvantage of this representation is that the differences are too far apart from each-other. I would like to keep these differences but the relationship needs to be adjusted. I think I need to use Processing’s map function. It re-maps a number from one range to another which provides a more interesting image.

I have the map function applied excessively. Otherwise, the countries from which we have seen only one film stay almost invisible. Furthermore, I found little refinement in the image. I solved that with a small transparent black gradient. Finally, I have rotated the total image 90 degrees. This gives a better picture because it seems that the light comes from above. Which is much more natural. And why not? Flags on their side remain equally recognizable.

Visualizing Movie Data | VMD_05 | Small comparisons

At this stage of our Visualizing Movie Data project, I want to see how far you can go with a reduced view of our movie data. How far can you reduce the decoration in exchange for the functional display of data. However, the readability must remain intact. What kind of problems can we expect? And perhaps more importantly … what are the solutions to those problems. In fact, the charts I post ever week on Facebook are my starting point. This is just a smaller variation of it.

What determines the smallest size of a single graph? I think that’s the longest movie title! In this case, the longest movie title is: ‘What happened Miss Simone?’. I now use a point size of 10 pixels. As a typeface I use Futura Bold. I reserve a square of 100 x 100 pixels for every movie. So for a hundred movies I need a display window of minimal 1000 x 1000 pixels. But I also need some margin. And I have to position a legend somewhere. The lines of the bars could be formed by dots. Dots you can count easily. 10 Dots stands for 10 points. 5 Dots stands for five points. But now some interesting effects pop-up. I can now draw 13 dots in width. And 13 dots in height. But this layout seems to form vertical lines with dots instead of horizontal lines. A side effect which is not what I was looking for. Furthermore, it seems that the text at the top of the square belongs to the graphic which is above the text. Also not a desirable effect.

13 Dots in width would be just right for several categories. But three dots disappear in height because my scores range from 0 to 10. So the solution could be that the movie titles would be displayed under the graphics. And I have to use vertically stacked points instead of horizontally. Because the points only reach 10 points in a few cases the movie titles will belong visually closer to dots that are displayed above the movie titles.

I have changed the generation of the movie data into a function. The positioning is resolved and becomes easier than it was in the previous chapter of this project. There are still some parts that need more attention. The height of the column of dots. The color and the placing of the movie title. When the dots have to display a 10 then the topmost dot slightly overlaps the rectangle of the graphic.

All data is used. A hundred movies as a dot graph. Right now you can not distinguish the categories. But what you can see is that a number of films that are top-rated. If the rectangle was completely filled with dots than the film would have scored only tens. Unfortunately, there is not one movie that has made it. There are now five films which scored very high: Locke, Mr. Turner, From What Is Before, ’71 and Lilting. Amour Fou is slightly below that score.

At the moment, the program displays a point for each dot. I’m going to turn into a circle. On a longer term that gives me more possibilities to make variations.

I have used a number of colors for the different columns. It’s a multiple of 30 in the 360 hue scale. But I find it looks cheap. In addition, some colors are too close together. The green series for example. And the dark blue disappears into the background color. And, some colors are hard to distinguish from others.

I have made a few variations of the existing color scheme. A number of these variations are not acceptable. The number of colors must be reduced. But perhaps I can do something with the brightness of the colors. The 13 columns are very difficult to distribute in different colors. The red column pops out. It appears to me that it means something special. But that is not the case. That is also the problem with the 3 degrees of brightness in the columns. Those seem to belong together. But that is also not true.

Now I am starting to look for a range of colors where each color stands on itself. But the range should be acceptable as a fine color range. A color may not have a relationship with the other colors. I replaced the circles by squares.

The squares can also be replaced by rectangles. I’ve also added a legend. Otherwise, the addition of color is absurd. I give the category texts in the legend the same color as the bars to which they belong in the graphic.

Just a few variations with broken lines, and rectangles, and transparency.

And the last variation. Totally non-functional. But it does make an interesting picture.

Visualizing Movie Data | VMD_04 | Reviews by categories

As a next step, I find it interesting to see what our reviews are telling us when I show each category of a movie. I can imagine that the titles of the films are on the left. Suppose we start from the first 100 movies we have watched from the beginning of 2015? What does it look like? And what conclusions can we commit to? I’ll try programming this version slightly smarter than the earlier version.

I start by creating a grid of numbers. There are 13 categories (13 columns) with decreasing numbers from top to bottom and from 10 to 0. The size of the display window is a bit of guesswork. I now work on a size of 800 by 800 pixels. On the left side of the display window film titles have yet to be placed. And all 13 category labels should still come on top. I expect that I need much more space than 800 pixels in width and height. In the program I have added an empty draw block. Otherwise functions as keyReleased and timeStamp do not work.

Placing the film titles is a matter of creating a text file with 100 titles of films that we have seen since the beginning of 2015. Then read this text file into Processing and displaying it in the display window. The order (from top to bottom) corresponds to the viewing order. The list starts with the film ‘Boyhood’. Which is the first film that we saw in 2015. The list ends with the film ‘Restless’. And that’s the hundredth film we’ve seen. However, there is only one-third of the list visible. This is up to the film ‘Calvary’. And that is film number 38. Putting another 62 films in this display height makes no sense because the point size would become too small to read.

To get all the movies titles on the left in the picture, I have a few options. Reduce the line spacing. Reduce the point size of the font. Or I can increase the size of the display window. In this case I have used all three possibilities. I end up with 1500 x 1300 pixels. I also added the names of categories.

Another stage where I further optimize the distances. The category names (the labels of the columns) are still too far from the category columns. I’m going to put them closer and place them on an angle of 45º. The category numbers are now placed on an imaginary square. The display window is now 1460 x 1228 pixels. And the grid is built with squares of 90 x 90 pixels. Testing a first line which is drawn through the numbers who rated the film ‘Boyhood’. That does not look good. The lines are too stiff. It should be more fluid. VMD_04_04

In order to make more fluid lines I did one attempt with the curveVertex function. The problem here is that the curveVertex function uses Catmull-Rom splines. It does not make beautiful curves. In the end I opted for bezier curves. For the quality of the curve that is the best solution, but it requires more passes of data to describe the curve. Four anchor points and four control points per line. That means 13 x 8 points per bezier curve. That is 104 numbers for the first movie. Thus, in total there must be 10.400 points calculated to make the final visualization.

The first six films drawn using bezier curves.

I have now drawn 26 films with bezier curves. And it shows directly the weakness of this visualization method. Since all lines have the same color and thickness it is difficult to see which movie has scored which number in which category. At a later stage I will do something about that. But the problem is not completely solvable.

About half way with the positioning of bezier curves. I place the curves in a very straightforward way. I know that this can be done with more intelligence but I will not have time enough to solve this problem now. I think it requires an additional study which I might do in a later stage.

And about to place a fourth number of bezier curves.

All bezier curves are now positioned. On the left, it has become a pretty organized chaos. Looking at the line patterns you can conclude that most movies have brought us a 6, 7 or 8. What might also be said of our rating. Is our rating mediocre?

With all the lines in their place, it is now the time to bring in the Futura font. I have changed the background color to black. Font color is white. The color of the lines is gray with 50% transparency.

Time for a number of tests with line widths. Some are absolutely exaggerated. Others are functional. These variations also show that the number columns have to be written as a last item. Otherwise they will be overwritten by the bezier lines. And I shifted the column with movie titles slightly to create some space  between the start of the bezier lines and the end of the movie titles.

Trying to solve a problem that popped up in VMD_04_07. To what extent is it possible to get more distinction between the bezier curves themselves. I start with two colors. Red and green. There seems to be a strange effect to occur. When a certain amount of red and green lines overlap it creates an additional color. It looks like orange. At least that seems to be orange but if you make the lines thicker it seems to be some light version of something brown-ish.

Added a blue color. Now it seems that there are many more shades of additional color variations possible.

What happens if I make an ascending color scale from 0 to 360? I switch to color mode HSB. HSB is easier to work with (as a human).

Which movies have been honored with at least once the highest possible value of 10 points?

Which movies have been awarded with at least once the highest value of 9 points or higher?

And finally: which movies have been rewarded with at least once the highest value of 8 points or more?

A quick conclusion. I am tempting to say that if a film did not score one 8, 9 or 10 in the assessment it would be not a good movie. That means it is of a lower level than films who scored at least one 8. Or one 9. Or one 10. This visualization is showing the worst films of all 100 films we have seen since the beginning of 2015. In total these are only 27 movies. So a little over a quarter. That means that three-quarters of the 100 films that we have seen always had something of good quality in them. And that’s very reassuring. For the filmmakers, the film industry and for us.

Visualizing Movie Data | VMD_03 | Waltzing with Bezier

When I started this assignment I was interested in how much money is actually going on in the film industry. What costs a movie? What is the budget? How much money does it produce? And how do these figures compare with our ratings. I thought it was easy to check the data on the site of IMDb. But unfortunately all I found was very incomplete data. I checked all 150 films that we have seen since January 2015. And guess what. There are only 56 films that both show you the budget and the profits. In addition, all amounts are mentioned in different currencies. So I have to convert them to dollars or an other currency unit. Additionally, in all the movies descriptions that are not from the United States, there is almost no sign of costs and benefits to find. So I have to check at other websites if there is additional information.

After that extensive check this resulted in 69 films with complete financial information. I think I should leave out the series. These often run over several years and are applying varying budgets. While a film only runs once and receives just one budget. Another thing is that these figures represent only periods when movies are played. Some play longer periods than others. Because they are more popular they bring in more money. But that says nothing about the quality. Our list shows that there are only three films made which costs less than one million dollar. However, there are 14 films which benefits less than 1 million. I made two text files of them. One with the highest budget on the top. The other list has the highest gross at the top.

Then it is important to read the text-file into Processing and display it in the display window. A simple task. But that turned out to be more complicated than I thought. It comes down to that there is a lot of attention in the tutorials to get a text-file into Processing’s console. But how to get the data into the display window I could not find anywhere. I got my question answered 50% through the Processing Forum. And partly solved it myself. Been busy with it for one afternoon. And this is the first result. Not very impressive but all data that is in the text file is displayed in my Processing display window. And that was the first goal I had in mind.

The next step I need to take is to get the data lists separated. It should be possible to reposition the movie-titles, budget and revenue. If I cannot do that I cannot deal with the layout. Incidentally, at this moment the sort and reverse functions are quite handy. And I have changed the font to Futura Book.

How does the program know which budget and income are associated with a movie? That is a question for me too. For the two digit columns are mixed-up. The budget and the income lists are both sorted from large to small amounts. The budget list thus does not have the same order as on the income list. So I have added film titles both to the budget and the income list. In that manner it is easy to check for me if the lines of the budget is written to the right amounts of the income list.
VMD _03_03

Changed the background colour to a very dark grey. Furthermore, now the budget and income-lists are connected by a line to one another. Everything looks pretty cluttered. But that will change in the next design. What’s striking is that the biggest blockbuster has a horizontal line. ‘Interstellar’ with a budget of 165,000.000 dollar and a total income of 675,020.017 dollar.

I’ve started checking the film titles. Whether they are written correctly and without mistakes. All non-English-language film-titles translated. Les Petits Mouchoirs is Little White Lies. Loin des Hommes is Far from Men. Relatos Salvajes: Wild Tales. Marie Heurtin: Marie’s Story. Elddfjall: Volcano. And that is one side of data visualization. You must be an administrator, Sherlock Holmes, graphic designer, translator, animation designer and programmer at the same time. I have given the chart some more space. And the distance is increased to the lists of numbers. Which suddenly brings me to a new idea.

I now work in Processing 2. Its time to download the new Processing 3 and fund the Processing Foundation. That is the least I can do because I work daily with Processing. In Processing 3 you can use the Table Class. It’s easier to work with because everything is now in one text file.

The $ sign was added but I do not find it successful. Maybe find another solution. Right now you do not see what the amounts of the lists are. I know that the left-hand amounts are for the budget. The right column represents the amount of income.

Because ‘Interstellar’ is misrepresented I have thrown this film out. I think the columns should have proper labels. And I need room for doing that. I have also added the sequence of 0-10 to the right. The numbers 0-10 represent the ratings we have given to the films. The idea is that I’m once again going to draw the lines but now from the income-list to our ratings.

I have adapted the total graph a bit. Lines start and stop now slightly closer to the lists of numbers. I have added the vertical text ‘Amounts in American Dollars’. The overall chart remains somewhat chaotic but I think the result is not disappointing.

Added colour. I chose green for the films that cost less than their revenue. And I choose red for the films that have cost more than their revenue. Now the graph begins to show a disadvantage. Because the lines are thicker it is difficult to see to what amounts they belong.

I replaced the line function by the bezier function. Now it is better to see which amount belongs to which line. And the overall chart looks slightly smoother. Of the 64 films, 26 films have made a loss. 38 Films have made profits. Mr. Turner eventually made losses but was still on top of our rating. Locke is a movie made for 2,000,000 dollar. It made a profit of 5,000.000 dollar and received a 10 in our rating. The Salvation has cost 11,524.796 dollar. To our knowledge it has brought 5000 dollar (which I strongly doubt). But it still gets a 7 in our rating. In short, data visualization is very interesting, very time-consuming and precise puzzling. Actually I had to code the program much smarter. But that would cost even more time.

Visualizing Movie Data | VMD_02 | Time Series

A dozen years ago I heard the word ‘ubiquitous’ for the first time. I wondered for what it stands for. Looked it up and the word ‘ubiquitous’ means: present, appearing, or found everywhere. So these time series graphs are a type of graphs that you can find anywhere. Because this project is about visualizing our movie data I need three or more data sets. The idea is that I will compare these data sets with our own data set. I hope to find out how our qualifications relate to, for instance, the IMDb (Internet Movie Database), Metacritic and/or Rotten Tomatoes. Suppose I would like to see the first one hundred films compared to results of these websites I should be able to draw the necessary conclusions. This is going to be a lot of handwork. But that’s okay because I’am in a learning process.

I could imagine that you have the numbers 1 to 10 on the left side of the graphic. And at the bottom are all the film titles. That seems logical. But it is not. Movie titles may be very long. For example: ‘A Pigeon Sat on a Branch Reflecting on Existence’. So you would expect the film titles to the left side of the graph. And then, the numbers 1 to 10 at the bottom. At this moment I think the best solution would be if you place your mouse cursor on a data point that the movie title is displayed at that point. But maybe I run too much ahead of myself. I have now read the original data of Ben Fry’s Time Series chapter in the program and I changed the display format.

Let me concentrate on the data. The first thing you notice about the IMDb, Metacritic and Rotten Tomatoes reviews is that they work with floats. Our own movie data works also with floats but the end results in ints. So I actually have to run all 100 film programs again and see what the endresult is using float’s. When I have those results, I have to type them in a text file. And then I do the same with the results of IMDb, Metacritics and Rotten Tomatoes. I left out Metacritics in the end. It sometimes happens that we have seen a film but that it is not found on IMDb or Rotten Tomatoes. In that case, the film gets a zero. The first thing I noticed in our chart, which uses our own data, is that it looks quite messy. There is not really some logic to find in the positioning of the points. The reason for this is that our films are chosen randomly. This results in random positions for the positioning of the set of points. The sequence is the real sequence of the first 100 films we have seen in 2015 though. Furthermore, the points are positioned at the bottom. This is caused by the largest value in the other data series. Our data set ranges from 0.0 to 10.0. While the other two data sets a range from 5.1 to 46.4. Therefore these other two sets have still to be adjusted. But I do not have the right data for them yet.

At this moment I have added all the scores from all the IMDb and Rotten Tomatoes. I can now on hit the “]” key and the “[” key to go through the three different graphs. It all looks a bit scarce. But you do get an impression of how the scores are distributed. I’ve also added titles as a placeholder (We, IMDb and RT (Rotten Tomatoes)).

I have increased the number of films to 150. It now looks somewhat less scarce. Eleven films from Rotten Tomatoes are not evaluated. That makes them stand to zero. At the bottom chart of the chart. However, these films are evaluated on IMDb and by us.

At the bottom, I added the amount of films we have seen in numbers. I also reduced the white background space slightly. This ensures that everything is shown less cramped in the display window. It would even be better when you could read the titles of the movies instead of our numbering. But perhaps I can add that at a later stage. And perhaps not at all. Maybe. Because after all these graphs are only about comparing our voting behavior with IMDB and RT. A quick conclusion about it teaches that our differences are slightly wider spread. It ranges from 3.3 to 5.9 points. IMDB ranges from 4.1 to 9.3. Rotten Tomatoes series go from 4,5 to 9,8 (if you do not count the 0.0).

I have added horizontal and vertical grid lines that may be helpful to compare the data points better. On the left side of the graph are now the scores of 0.0 to 10.0 displayed. And as a result, there is no need for the positioning of additional tickmarks. The horizontal and vertical lines do their work instead. I think that score numbers are displayed too long. I have now four digits after the point because we are working with floats. The function ceil does not help in this case. Because that rounds everything off upwards. Floor rounds everything downwards. The feature I’ve used now is nf. This means that there is just one number after the point shown. I use two versions of the Futura. Futura Medium and Bold. Furthermore, I also labeled the numbers. That makes the chart clearer.

I now go ahead replacing the points with a line. Actually this is a bit rubbish. The scores of the films have nothing to do with each other. Each score of a film state is a value on its own. So there is no mutual connection with a line necessary. But as a variation it is perhaps interesting. I also changed the colors. The white field is replaced with a dark gray. Because then the colored lines stand out better.

In this version all scores are displayed on top of each other to see where the differences are. The title of the data sets should change with it if you choose another data set. But I don’t like it anyway. It is a poor and chaotic whole. So this seems to be not a good option.

I now have retrieved some items from one of the earlier sessions. The line connections remained blue and the points themselves are white. The points are most important so they are allowed to stand out. I’ve made them a little smaller. This has as a result that (when points are close to each other) they overlap each other less.

This proposal introduces rollovers. I now get feedback that I already can see on the x and y axes but much more precise. But actually you would like to see the movie title when your cursor is at a data point. I think I’m going to do that at a later stage. But I am unsure about it. I think it’s it’s more important that I get some sense of what you can do with the data.

I do have the feeling that the lines have become too dominant. Especially now that you’re getting direct feedback on the cursor. The lines are no longer functional. I will also try if I can make the middle block more squared. You lose that  the smaller rectangles are not square anymore. However, it does create more room in the width. I also reduced the proximity of the cursor and increased the point size of 10 to 12. And Futura Bold is used for the values under the cursor.

Replacing vertex in drawDataLine by curveVertex actually does not make much sense. The data points are most of the time so close together that no fluid line between the points can be made. But if you make a plane field to the lower right point right and the lower left point it makes more sense and it gives a different picture. The question then is whether the horizontal and vertical lines are still functional. So I have them  removed. I think this looks better than all the previous versions. And along with the feedback you get when you stand with your cursor on a data point it looks just fine.

I have made the background of the chart the same color as the background color. That gives a completely different picture. I initially had accentuated the vertical lines. But I think the horizontal lines can better be accentuated. These lead you too much more meaningful data. I have given the horizontal lines 50% transparency in the beginning. But afterwards I got a better result by decreasing the line width to 0.5 pixels. Which is basically logically impossible.

It seems silly to transform this graph to a bar graph. I must then let the program draw rectangles instead of one flat plane. But then I have a problem. Because I have 150 bars in a width of 600 pixels. This means that the width of one bar can be a maximum of 3 pixels or less. At 4 pixels, the total lower surface is filled again by overlapping bars. But with 3 pixels I think it’s just about acceptable and it even has some form of sophistication.

As a last proposal I introduced tabs for the three different data sets. But I found the Futura Bold far too heavy in these white tabs. So I opted for the Futura Medium.

Now I have to do a few more things. The white area behind the title is way too loud and is almost visually independent of the graph. Plus the bar chart layout is not the best I’ve seen so far. As a final detail I go back to the design of VDM_02_12. I now only use the Futura Medium. I also adjusted the color. I chose red and green. Two distinctly different colors. The strong contrast between the two colors allows the separation-line between the two planes extra stand out. And thus it seems to me that this session is finished. But there is one more thing.

I have made a very simple animation of the three datasets. The datasets of us, the IMDb and Rotten Tomatoes interpolate their points. Unfortunately, the interactive version is not available. I captured the animation so that there is atleast something to see.


Visualizing Movie Data | VMD_01 | Mapping Movie Data

In 2008 we began reviewing movies. In the beginning we were doing that using points that lay between 0 and 10. At a later date we used a more narrative way. That led to detailed reviews of movies that we posted on Facebook. But over time was that too much work and it cost too much time so I decided to introduce a more accurate way of reviewing the movies. I started on 5 January 2015 to use 13 categories for each movie. Storyline, originality, cinematography, involvement, sound, editing, educational, title design, acting, interesting, unusual, exciting and superior. Every category earns a score between 0 and 10. A simple Processing program add’s all points together. And the result of this addition is divided by the number of categories. Then you get an average point for one movie. The aim of Visualizing Movie Data is to give us more insight into the choices we make in evaluating a film. Ultimately, this should tell more about ourselves than about the films. If that’s true, that would be a positive spin-off. Through Visualizing Movie Data I try different ways of visualizing data collected by us. And I get help from Ben Fry’s book ‘Visualizing Data’ published by O’Reilly Media Inc. Contrary to the Generative Design Variations project Visualizing Movie Data is not about making as many as possible variations. Now it is the intention that the movie data is used as functional and as basic as possible. I will skip any form of decoration. In this chapter I try to use a very simple and basic way of reading, displaying and interacting with a number of small data sets. These datasets consist partly of the data we found in our movie reviews.

I started looking for a world map. Eventually I found a world map that uses the Mercator projection. Mercator projection is a conformal cylinder projection with large surface deformations at higher latitudes. In this projection Europe is slightly larger which is an advantage because a lot is going to happen in this ‘small’ part of the world. Perhaps it is ultimately necessary to use a separate map of Europe as an insert. But I don’t know that at this moment. I decided to omit all color because I want to use the color for the markers of the countries. At this stage I only display a world map.

This is a first version in which the program reads coördinates from a text file to place red dots. I wanted to know if that worked. And If that works then it should also work with exact coördinates.

Now I need to gather a list of all 46 countries that have produced films. I happened to find a two-letter code list of countries through the International Organization for Standardization (ISO 3166 Country Codes). These abbreviations I can use to display in a later stage. The overall list is much too extensive so I have to select only those countries that have made films that we actually have seen from the beginning of 2015. If you look at this visualization there are a few things that stand out. In Europe it is very busy. Some countries are completely covered by a red dot. And some dots overlap other spots. Actually, it’s a mess. On all other continents there are only a few spots. It looks empty. Further, the overall color of the image looks too dark. I also have made an outline version of the dots. But this does not solve the problem. In Europe the small countries are slightly more visible but it is a minimal improvement and no real solution.

This version I’m going to make an insert for Europe. At least so I thought at that time. After I had made a rough sketch it actually delivered more problems than a solution. In the first place Europe is out of proportion when you compare it with the rest of the world map. Which is the case anyway at each world map. An exception is the projection of Goode. That has is an equal-area map projection. And in the second place an insert covers always a portion of the world map. Making that to be moved again. And actually you create five continents, which are all out of proportion. The question is whether that’s good. So it seemed best to temporarily leave everything as it is. And I solve problems when they are relevant. The dots are slightly reduced and everything becomes pretty clear. I have also added a title plus additional information which makes it even more complete.

A few years ago I have worked on patterns and photography in Processing. I could apply one of those patterns on the world map. Since a realistic world map is not possible anyway, it is just as good to make an abstraction of the world map. The question is now: is abstraction decoration? The world map is now completed with dots with a diameter of 4 pixels. The locations are displayed (as much as possible) in the middle of the country. Although the middle in some countries is hard to find. Where is the middle of the USA with its territories and various possessions?

Those dots are of course meaningless. They only give the central locations of countries where movies are made. It would for instance be more helpful when you could see where the most films are made. This would be able when you could vary the size of the dot. Large dot is a lot of movies. Small dots are a few films. This is a version that uses random generated numbers.

This version uses our movie data. The size of the dot determines the number of films from that country. It is immediately clear that most films come from France, USA, UK and Germany. This of course says nothing about the quality of the movie. It only shows information about the quantity. And it is only partly the truth as we will see later. And how will this visualization look like at the end of 2015 or 2016 as we have seen more movies. It was a surprise that France plays such a leading role in the field of film production. But it also raises immediately questions. Maybe we’re being manipulated? Might there be another variable which makes that France appears so high in the movie production?

It is also possible to interpolate between two colors, and make all dots the same size. I go for the low numbers in red and green for the high numbers. But I find that this version does not really show the smaller differences very well. In fact, there are only two green dots and two interpolations between green and red tending towards brown. The rest is nuances in red. And what does green-ish or reddish-brown mean then?

All countries of which we have seen five or more films are green. The largest dot on the world map indicates that we have seen much more than five movies. A small dot indicates that we have seen five movies. All countries of which we have seen less than five movies are red. A larger red dot indicates that we have seen almost 5 movies. A small dot indicates that we may just have seen only one movie. In short, very complicated and not very clear. And you really need some textual information here.

Of course you do want to be more accurate in displaying this data. So I’ve changed the program in a way that the abbreviated name of the country is displayed when you get to the dot of the country with the cursor. This goes not flawless though. Sometimes the name of the country disappears under a dot. At a certain point when the countries are small and close together both country names are being activated. I made a version of the Futura Medium 12 and with Futura Bold 12. I think that the Futura Bold version is better suited because it is more readable.

I have optimized the mouse interaction. Now, there are never two countries selected at the same time and it is also true that the name of a country is always drawn last. So it can never be overwritten by a dot.

I have replaced the floating numbers with integers. And I replaced all the colors by green. I think this is a reasonable version. You can easily see from which country most films come from. Ad the for the exact quantities you get feedback from the cursor.

I’m going a little deeper into the movie productions made in France, United Kingdom, Germany and the USA. I begin with France. That data does not look very spectacular. Apparently, most of the movies are filmed in Paris. In addition there are a few film locations in the south and center of France. Furthermore it seems that of the 47 French films that we have seen only 20 are filmed on a location in France. The other 27 are made through partnerships with other countries. And the film locations are all situated outside of France.

There is a problem with the data. When I run the program I get an ArrayIndexOutOfBoundsException: 8 error. I tried to find out what causes this error. It appears when you read data from text files that are of different lengths you get an error. That seems logical because in that way the arrays can not all be filled just the same way. If you have an array with 10 lines and another with 8 lines you get this error. And that was the case when I went to adjust the French data files for the United Kingdom data files. The 8 in the error message is the number of lines that were set aside for the array.

It looks empty in Germany. But I think it all will be fine in the long-term. Also, I think I’ve done something wrong. Some movies, of course, have several film locations. If I can trace those it is guaranteed that the image is getting more interesting.

Also this US version is not really interesting. But something else is happening. If the first two characters in my text file are not unique (and thus are duplicates in the list) the positioning of the dots and text goes wrong. And because the first two characters have no further function (but apparently have influence) it might be better to keep this form of abbreviations: AA, AB, AC, AD, AE, etc., and after AZ continuing with BA, BB, BC, BD.

I have now found all film locations for all American films we have seen since January 2015. I did a much too  superficially search so I found only 18 locations. Now I have 218 locations available. A number of them will not be used because the film locations are outside the USA. And there are several movies that play at the same location. So the final list will be shorter than 218 but longer than 18. Now I need to make a list and avoid duplication. All the movies filmed in Los Angeles are to be summed up to a total. All films in New York. All films of Detroit. And this goes for all the other cities and towns in the USA. I ended up with a final list of 87 cities and villages. Los Angeles has reached the top 61 productions. Which is actually not very surprising.

I now go one level deeper. I started on a global scale. Then countries scale. And I’m now going to work on an urban scale. After some research it seemed suitable to me to use the film locations of the series ‘Breaking Bad’. I display the abstract version of the map of Albuquerque in the background. And I used a reduced version of the Breaking Bad wordmark. But that doesn’t work at all. I also find the amount of film locations insufficient.

After some more research I could trace a lot more ‘Breaking Bad’ film locations. I changed the title and background colors. But I think it’s really ugly. When something is designed simple and basically it doesn’t have to look ugly. So I have to work on that.

This is the state in which I want to finish this exercise for the time being. All locations can now be read and the amount of scenes are shown after the location name. There is a title and a subtile. You can immediately see that Walter White’s House is the film location which is most used (with 80 scenes). Then Jesse Pinkman’s House follows (with 42 scenes). Hank and Marie’s House (with 29 scenes). The DEA offices (with 27 scenes). The Car Wash (with 22 scenes), Jesse Pinkman & Jane’s House (with 21), Gus’s Laundry Service (with 20) and Los Pollos Hermanos (with 17). Too bad I can not show this in a JavaScript version because that is unfortunately not working.

A rough conclusion: Data visualization is much harder than dreaming up nice effects like I did in the previous Generative Design Variations project. In data visualization you should limit yourself in order not to come up with lots nonfunctional decoration. A lot of time must be invested in research in the beginning, during and sometimes afterwards the design phase. And it is an iterative process. There are always improvements possible after the improvement. It also takes more time than a general design job. You must be very precise and constantly looking for better data and better interpretation and visualization of the data. Data visualization has a high level of detective work.