Convite para inscricao CMG BRASIL - 7º Encontro Técnico Nacional - 15/08/2012

Caros membros e associados,

 

O CMG BRASIL  convida todos os profissionais das disciplinas e sub-disciplinas da área de planejamento de capacidade e análise de desempenho em TI e também a comunidade em geral de TI que tenham interesse nesta área para o próximo evento técnico nacional:

 

Data: 15/08/2012

Horário: Das 8:30 às 18:00 horas

Local:  Faculdade Oswaldo Cruz

Endereço:  Rua Brigadeiro Galvão, 540 -  São Paulo

Estacionamento: ao lado da Faculdade

Metrô: Estação Marechal Deodoro

 

Todos os detalhes para o procedimento da inscrição estão contidos no convite em anexo.

 

As inscrições estarão abertas a partir de 14/05/2012 até 10/08/2012, e recomendamos inscrições com a maior antecedência possível, devido a limite de vagas.

 

 

Paula Iochpe

CMG BRASIL

Computer Measurement Group Brasil

11 4193-5096 – 11 7366-6598

paulaiochpe@gmail.com

 

Click here to download:
ChamadaInscricao-EVENTOCMGBR-15Ago2012.pdf (65 KB)

Click here to download:
FichaInscricao- EMPRESA.pdf (70 KB)

Click here to download:
FichaInscricao-PF.pdf (68 KB)

CMG BRASIL - CALL FOR PAPERS 2012 - ENCONTRO TÉCNICO NACIONAL

Prezados membros e associados,

 

Estamos nos preparando para o 7º Encontro Técnico Nacional do CMG BRASIL a ser realizado em agosto próximo e contamos  com o seu valioso apoio  intelectual.

Em anexo o convite com todos os detalhes para a apresentação dos trabalhos, bem como os modelos para referências do padrão internacional.

 

Agradecendo desde já a atenção e participação, ficamos a disposição.

 

 

Paula Iochpe

CMG BRASIL

11 4193-5096 – 11 7366-6598

paulaiochpe@gmail.com

 


Click here to download:
CMGBR_CallForPapers2012-EncontroTecnicoNacional.pdf (81 KB)

Click here to download:
002-templateCMG.ppt.ppt (32 KB)

Examining File System Latency in Production

Examining File System Latency in Production

Brendan Gregg, Lead Performance Engineer, Joyent

 

This paper introduces file system latency as a metric for understanding application

performance. With the increased functionality and caching of file systems, the

traditional approach of studying disk-based metrics can be confusing and incomplete.

The different reasons for this will be explained in detail, including new behavior that has

been caused by I/O throttling in cloud computing environments. Solutions for

measuring file system latency are demonstrated, including the use of DTrace to create

custom analysis tools. We also show different ways this metric can be presented,

including the use of heat maps to visualize the full distribution of file system latency,

from Joyent’s Cloud Analytics.

Click here to download:
Joyent_File_System_Latency.pdf (1.51 MB)

Mensagem de natal e ano novo para a comunidade CMG BR

Clip_image002

Caros amigos membros do CMG Brasil

 

Em nome do Conselho, Diretorias e Coordenadorias,

 

Desejo a todos um Feliz Natal e um Ano Novo com paz e saúde ! 

 

Este ano fizemos bonito, no próximo faremos ainda melhor com a colaboração de todos!

Agradeço a todos os voluntários, participantes e patrocinadores que tornaram possível,  o crescimento do capital intelectual do nosso país por meio do compartilhamento de idéias e informações.

Vamos em frente!

Clip_image003

João Natalino,

Presidente do CMG Brasil

 

 

 

In Memory Data Grid Technologies

After winning a CSC Leading Edge Forum (LEF) research grant, I (Paul Colmer) wanted to publish some of the highlights of my research to share with the wider technology community.

What is an In Memory Data Grid?

It is not an in-memory relational database, a NOSQL database or a relational database.  It is a different breed of software datastore.

In summary an IMDG is an ‘off the shelf’ software product that exhibits the following characteristics:

The data model is distributed across many servers in a single location or across multiple locations.  This distribution is known as a data fabric.  This distributed model is known as a ‘shared nothing’ architecture.

  • All servers can be active in each site.
  • All data is stored in the RAM of the servers.
  • Servers can be added or removed non-disruptively, to increase the amount of RAM available.
  • The data model is non-relational and is object-based. 
  • Distributed applications written on the .NET and Java application platforms are supported.
  • The data fabric is resilient, allowing non-disruptive automated detection and recovery of a single server or multiple servers.

There are also hardware appliances that exhibit all these characteristics.  I use the term in-memory data grid appliance to describe this group of products and these were excluded from my research.

There are six products in the market that I would consider for a proof of concept, or as a starting point for a product selection and evaluation: 

  • VMware Gemfire                                                (Java)
  • Oracle Coherence                                             (Java)
  • Alachisoft NCache                                             (.Net)
  • Gigaspaces XAP Elastic Caching Edition           (Java)
  • Hazelcast                                                          (Java)
  • Scaleout StateServer                                         (.Net)

 And here are the rest of products available in the market now, that I consider IMDGs:

  • IBM eXtreme Scale
  • Terracotta Enterprise Suite
  • Jboss (Redhat) Infinispan

 Relative newcomers to this space, and worthy of watching closely are Microsoft and Tibco.

Why would I want an In Memory Data Grid? 

Let’s compare this with our old friend the traditional relational database:

  • Performance – using RAM is faster than using disk.  No need to try and predict what data will be used next.  It’s already in memory to use.
  • Data Structure – using a key/value store allows greater flexibility for the application developer.  The data model and application code are inextricably linked.  More so than a relational structure.
  • Operations – Scalability and resiliency are easy to provide and maintain.  Software / hardware upgrades can be performed non-disruptively.

How does an In Memory Data Grid map to real business benefits?

  • Competitive Advantage – businesses will make better decisions faster.
  • Safety – businesses can improve the quality of their decision-making.
  • Productivity – improved business process efficiency reduces waster and likely to improve profitability.
  • Improved Customer Experience – provides the basis for a faster, reliable web service which is a strong differentiator in the online business sector.

How do use an In Memory Data Grid?

  1. Simply install your servers in a single site or across multiple sites.  Each group of servers within a site is referred to as a cluster.
  2. Install the IMDG software on all the servers and choose the appropriate topology for the product.  For multi-site operations I always recommend a partitioned and replicated cache.
  3. Setup your APIs, or GUI interfaces to allow replicated between the various servers.
  4. Develop your data model and the business logic around the model.

With a partitioned and replicated cache, you simply partition the cache on the servers that best suits the business needs to trying to fulfil, and the replicated part ensures there are sufficient copies across all the servers.  This means that if a server dies, there is no effect on the business service.  Providing you have provisioned enough capacity of course.

The key here is to design a topology that mitigates all business risk, so that if a server or a site is inoperable, the service keeps running seamlessly in the background. 

There are also some tough decisions you may need to make regarding data consistency vs performance.  You can trade the performance to improve data consistency and vice versa.

Are there any proven use cases for In Memory Data Grid adoption?

Oh yes, and if you’re a competitor in these markets, you may want to rethink your solution.

Financial Services: Improve decision-making, profitability and market competitiveness through increased performance in financial stock-trading markets. Reduction in processing times from 60 minutes to 60 seconds.

Online Retailer: Providing a highly available, easily maintainable and scalable solution for 3+ million visitors per month in the online card retailer market.

Aviation: Three-site active / active / active flight booking system for a major European budget-airline carrier. Three sites are London, Dublin and Frankfurt.

Check out the VMware Gemfire and Alachisoft NCache websites for more details on these proven use cases.

About the Author:

Paul Colmer is a technology consultant working for CSC and director and active professional musician for Music4Film.net.  He specialises in Cloud Computing, Social Business and Solution Architecture. He is based in Brisbane, Australia. http://www.linkedin.com/pub/paul-colmer/6/894/539

Linux Systems Capacity Planning

As infrastructure costs rise, there's an urgent need to squeeze more performance from the same hardware. After several years of measuring and managing the capacity of thousands of Linux servers, we have learned that most typical tools and metrics are not sufficient to predict performance bottlenecks, particularly during traffic spikes. By using queue theory formulas and instrumenting our applications we were able to find the limits of our systems, improve reliability, and maximize throughput and performance.

 

Visualizing Device Utilization

Visualizing Device Utilization

Device utilization is a key metric for performance analysis and capacity planning. In this post, I’ll illustrate different ways to visualize device utilization across multiple devices, and how that utilization is changing over time.

As a system to study, I’ll examine a production cloud environment that contains over 5,000 virtual CPUs (over 600 physical processors). I’ll show how well different visualizations work for an environment of this scale, including:

  1. Command Line Interface Tools
  2. Tabulated Data
  3. Highlighted Data
  4. 3D Surface Plot
  5. Animated Data
  6. Instantaneous Values
  7. Bar Graphs
  8. Vector Graphs
  9. Line Graphs
  10. Ternary Plots
  11. Quantized Heat Maps

Definition

Device utilization can be defined as the time a device was busy processing work during an interval, so a device at 100% utilization is active doing work 100% of the time. Such a device may not accept more work, becoming a system bottleneck. Or it may accept more work but do so with higher latency, causing poor performance.

While utilization may be an imperfect metric, depending how it is defined and measured (I listed some reasons on page 75 of Solaris Performance and Tools), it’s still tremendously useful for quickly identifying (or eliminating) devices as sources of performance issues.

Problem Statement

For any given device type (CPUs, disks, network interfaces), and any number of devices (from a single device to a cloud of servers), we’d like to identify the following:

  • single or multiple devices at 100% utilization
  • average, minimum and maximum device utilization
  • device utilization balance (tight or loose distribution)
  • time-based characteristics

By including the time domain, we can identify whether utilization is steady or changing, and various finer details. These may include short bursts of high utilization, where it is useful to know the length of the bursts and the interval between them. Longer patterns over time may also be observed, such as how load changes during a 24 hour or weekly cycle. Time-based patterns may also be compared to other metrics, and correlations observed that illuminate complex system interactions. This may all be possible by studying how utilization varies across a time-series (the examples below use 60 x 1 second values).

Finally, we’d like to observe this all in realtime.

1. Command Line Interface

Device utilization is usually available via command line tools. These may show per-device utilization numerically, with interval summaries printed in realtime. If the output scrolls, changes over time can be identified by reading and comparing previous summaries.

These tools usually don’t handle scale. I’ll illustrate this for CPU utilization.

One Server, 1 second

CPU utilization on Unix/Linux systems can be examined with the mpstat(1) tool. It prints a single line of output for each virtual CPU, with various useful metrics (the columns vary on different systems):

This one second summary shows 16 virtual CPUs. CPU 1 is at 100% utilization (calculated by inverting the last column, %idle), which can be evidence of a software scaleability issue (single hot thread).

One Server, 60 seconds

We’d also like to see how this changes over time. This is 60 x 1 second summaries from mpstat(1):

For a sense of scale, I’ve highlighted the 1 second summary shown earlier.

This amount of output is already difficult to digest. On a terminal, this would be many pages to scroll through. And this is just one server.

Data Center, 60 seconds

This data center has over 300 servers. Showing 60 seconds across all of them:

This gives an impression of the amount of data involved, in terms of mpstat(1) output. The output is so small that the whitespace between rows and columns creates an effect that appears like fabric.

The rectangle represents the amount of data that a single server contributes. Each server’s data in this image is actually placed in a horizontal line. One such line is darker than the others, in the middle top. The darkness was caused by high multi-digit values in many of the mpstat(1) columns, replacing whitespace with numbers (this prompted me to investigate further on the server; the issue turned out to be a misconfigured sendmail calling 3000 exec()s per second, which caused high values for the minf, xcal, migr, smtx and sysctl columns).

2. Tabulated Data

I’ll now visualize just the per-CPU utilization values, per-second, as a table of values. This strips the mpstat(1) output down to just the (inverted) %idl column, which provides a better sense of the volume of the utilization data we are trying to understand.

One Server, 60 seconds

16 CPUs and 60 x 1 second summaries of utilization only:

Unlike the mpstat(1) server summary, this time the font size is (almost) large enough to read. In a few places a single CPU hits 100% utilization, visible as an unbroken line of digits (100100100).

Data Center, 60 seconds

Over 50 servers:

This image is more interesting than I would have guessed (click for high-res). Faint darker patterns are caused by areas with double- and triple-digit (100%) utilizations, where the digits themselves give a darkness effect. This differs from the mpstat(1) dark patterns, as these highlight utilization only.

3. Highlighted Data

Utilization values could be highlighted deliberately by coloring the background relative to the value.

One Server, 60 seconds

Distinct patterns now emerge. While there are bursts of CPU load across many CPUs, only CPU 0 seems to be busy the entire time (perhaps mapped as a device interrupt CPU).

This visualization can be thought of as having three dimensions, as pictured on the right, with the third the utilization value represented as color saturation. (I’m using the HSV definition of saturation.)

Data Center, 60 seconds

Over 300 servers:

This time it’s not necessary to highlight a single server – some busier servers are clearly visible as red rectangles. Other observations:

  • Servers with a single hot thread appear as or , depending on how well the thread stays on one CPU (affinity) or skips around.
  • Some servers like and have multiple hot CPUs, but also idle CPUs, which may be a sign that load is not balanced (either due to thread scalability or CPU resource caps).
  • Most servers show consistent CPU load over time, with only a few like showing high variance (that is the same one as was shown in the previous One Server example).
  • Idle servers can clearly be seen, which often contain one or two short bursts of single CPU usage (monitoring software). The entire image is speckled with these short bursts.

Also apparent is that most servers in this data center are idle at this time of day (off peak).

Limitations

The server images answer some of the tricker questions from the problem statement: they can identify single and multiple hot devices, unbalanced utilization, and, as a bonus: utilization that shifts between devices (). They aren’t good at expressing exact utilization values, which rely on how well our eyes can differentiate color. Average utilization across devices is also hard to determine.

The data center image provides a great impression across all 332 servers. Were this to be used as a tool, the numbers could be dropped (the highlighting is sufficient) and it could be made interactive: mouse over servers for expanded details. However, scaling this much further will be difficult. This example has little more than 1 pixel per data element. If each server had 64 virtual CPUs instead of 16, the number of elements would increase by a factor of 4.

While these images look similar to heat maps (covered later), they aren’t the same. One reason is that heat maps (usually) have scalar dimensions. Within the server images, the x-axis is scalar (time), but the y-axis is the set of CPU IDs – which may have no relative meaning (operating systems can enumerate their virtual CPUs in odd ways). On the data center image, the x- and y-axes span repeated ranges. Servers that are nearby on the image are also nearby physically, just due to the way the data was collected, but this is in no way a reliable scalar dimension.

One Server, 1 hour

This is the same server example from before, but for a full hour:

Each horizontal strip represents 8 minutes. I’ve included this just to show what could happen when scaling time. What’s interesting about this image isn’t the CPU utilization, but the lack of CPU utilization – idle time – shown as white patches.

4. 3D Surface Plot

Three-dimensional plots can be created from the dimensions: CPU ID, time and utilization. Given that two of the dimensions are provided in the data set as regular steps, CPU ID and time, a surface plot may be suitable as these map to regular latitude and longitude points. The utilization value becomes surface elevation.

One problem may already be expected, as shown on the right. The utilization value can change steeply from one point to the next, making the surface difficult to follow. This can be improved by reducing the elevation of the utilization dimension.

One Server, 60 seconds

16 CPUs, 60 x 1 second averages:

Time is the x-axis from left to right, CPU ID is the y-axis, and the z-axis (elevation) is the utilization value. This has also been colorized based on the utilization value, so, utilization is represented here by both elevation and color saturation.

An issue when scaling this plot type is that the grid lines (polygon edges) for this wireframe visualization can become too dense, resulting in a black surface. They can be removed:

Data Center, 60 seconds

Over 300 servers:

This is similar to the visualization from the Highlighted Data section (the server ordering is aligned differently). Click for high res.

Zooming in:

Now returning the grid lines:

This line width creates an extra effect highlighting subtle changes in elevation (click for full version). If the lines are too fine, the visualization approaches the previous version; if the lines are too thick, it appears black.

5. Animated Data

Over 300 servers, over time:

The utilization data for each second is shown by one frame, which consists of just the 5,312 CPUs as highlighted pixels (similar to before, but the digits were dropped). This animation has been sped up to 10x normal time, and shows 20 seconds in a loop (click for the full 60 seconds).

The advantages are similar to those in the Highlighted Data example above. Some additional disadvantages are that it cannot be included in printed text, and that identifying time-based patterns relies on memory and patience. Memory, to identify differences between a sequence of frames, and patience, to consume information at the rate of the animation. This may become irritating – if one frame out of a six-second loop is interesting, it’s difficult to study if it’s only visible once every loop. Additional controls could be added to slow or pause the animation.

6. Instantaneous Values

A highlighted table of current utilization values is a simple visualization that answers some questions, without the density of including historic data. Here is current utilization across 5,312 CPUs:

This is actually a single frame in the previous animation. Here, the utilization digits have returned (click for high-res). Servers appear as vertical columns, and there are two rows of servers.

Microsoft Windows 8 will include this type of visualization in the Task Manager, to show instantaneous utilization values on systems with more than 64 virtual CPUs (logical processors). The screenshot on the right is from an MSDN blog post that explains the move. This shows 113 CPUs, and has a scrollbar to reveal more.

I switched the screenshot from blue to red to fit better with the other visuals (click for the original). I also think red better suggests hot CPUs.

While this doesn’t include historic data, it’s worth including for consideration. Now that it will be in Windows, this type of visualization for device utilization may become widely understood. Also note that Microsoft call these “heat maps“. I’ll show a different type of heat map in the Quantized Heat Map section.

7. Bar Graphs

Bar graphs can be used to show a single utilization value, which scales the length of the bar. Mac OS X has Activity Montior, which can provide a “Floating CPU Window” bar graph that can be placed anywhere on the screen. An example is pictured here, from a laptop with two CPU cores, and has a bar graph for each. (Click to see the green original.)

The advantage is that utilization can be understood at a glance, instead of reading utilization digits or examining color. This visualization could also be enhanced by placing watermarks at the recent minimum and maximum values.

Using a bar graph for each device will become difficult when scaling to 5,320 CPUs, at which scale a bar graph may be better suited for just the average utilization across all devices.

8. Vector Graphs

Angle can be used as a visualization device. In the example below, a gauge similar to a car tachometer shows server CPU utilization (average for all CPUs). This is from a commercial monitoring product intended for cloud and other environments.

As with bar graphs, this only shows a single utilization value. Other values may be indicated around the edge: this example has a green to red spectrum. I’m not sure what this coloring means in the context of CPU utilization. At the very least, it shows which end of the range is which. It could also just be decorative, which along with the whitespace around the round shape makes this visualization type one of the least dense, which may be a problem for scaling.

To see how this could scale to a cloud environment, I’ve created a couple of montages from this example image. Here are 322 servers and 5,352 CPUs, both 600 pixels wide. Both images include a single CPU at 60% utilization and another at 100% (find them?).

9. Line Graphs

Showing time on the x-axis allows the passing of time to be visualized intuitively from left to right. The level of utilization, shown on the y-axis, can be understood at a glance, and can be compared quickly and accurately from one second to the next. Such a comparison is difficult with color brightness, and requires reading for digits.

One Server, 60 seconds

16 CPUs, with each drawn as a separate line:

This is the same server as previously visualized separately (shown above as ). Single CPUs hitting 100% can be clearly seen, although the lines from multiple CPUs remaining at 100% overlap.

For comparison, here is a busier server (previously visualized as ):

CPU utilization is loosely grouped around 50%, with a CPU hitting 100% every eight seconds. Also noticeable is that there is no longer a flat bottom edge of idle CPUs, showing that usually all CPUs are doing work.

A server that is mostly idle (previously visualized as ) looks like this:

Activity with two periods can be seen: a large spike in single-CPU utilization every 30 seconds, and a smaller burst every eight seconds.

Data Center, 60 seconds

Scaling this to 5,312 CPUs:

This visualization hasn’t really worked (using the full range of random colors doesn’t help either). The pattern of horizontal lines at about 20% utilization is only visible because those CPUs were drawn last, on top of the previous lines (you can see them at the bottom of data center visualization in the Highlighted Data section). If the CPU ordering is shuffled when drawn, those lines vanish.

Average, One Server

Taking the first server from before, and showing just the average utilization across all CPUs:

This works well. The average – and how it changes over time – is often used for capacity planning.

Averages hide the presence of devices at 100% utilization. For this 16 CPU system, each CPU only contributes 1/16th to the average. Even if a CPU were to change from 0% to 100% utilized, the line only moves by 6.25%. This is much smaller and harder to see across multiple systems.

Adding a maximum line can show the hottest device:

While this works, in practice we’d often like to know how many devices were hot. Disks under the ZFS filesystem, for example, will often hit 100% for periods of seconds during transaction group flushes – which is perfectly normal. One or two disks at 100% utilization, however, is not normal (and can be from a particularly difficult pathology with unreported drive failure that I’ve seen many times). So, I’d like to know more than just knowing something hit 100% — how many? And were other devices close to 100%, or not?

10. Ternary Plots

The barycentric coordinate system can be used to create a ternary plot (“Barry3″) showing three components of CPU utilization: user, system and idle. User-time is time spent in application code, and system-time is time spent in the kernel: processing system calls, interrupt routines and asynchronous kernel threads. These breakdowns of CPU utilization are commonly used by Unix and Linux operating systems, and are useful for a better understanding of the CPU workload. The mpstat(1) tool prints them out by default (%usr, %sys, %idl).

Data Center, 15 seconds

The Barry3 plot on the right includes 5,312 CPUs, and is also animated with a frame for each second. Just the first 15 seconds have been included here to keep the GIF small.

The advantage of this visualization is that any one of the three dimensions can be picked, and then all points considered based on that dimension (the plot can be rotated to aid this). Also, all three dimensions can be read directly from each point. Another potential advantage is that patterns between user and system CPU time could be identified (they could also be identified from an x-y plot of %usr and %sys). Other device types that can breakdown utilization into three components could also be visualized using the Barry3.

This was created using the cpuplayer tool, and an awk program to reprocess the previously collected mpstat(1) data. I modified cpuplayer to handle the high CPU count, and made some cosmetic changes to simplify the look (original here, which includes grid-lines to aid reading each dimension).

Limitations

One issue with the Barry3 visualization is that CPUs can overlap, especially at the corners, making it difficult to know how many CPUs were in that state. This is made worse by the data set I’m using, which has integer values for utilization (due to a limitation of mpstat(1), not the underlying operating system statistics, which in this case are high resolution CPU microstates). There are also issues with making this an animation (memory and patience), as was discussed previously.

The triangle shape leaves much room unused in the top corners, which may become more noticeable if multiple ternary plots were drawn (instead of the animation). Note that a rectangular x-y plot of %usr and %sys would have a similar degree of unused space for the triangular area where %usr + %sys was greater than 100.

11. Quantized Heat Maps

Finally, the device utilization heat map uses a column quantization to visualize utilization in three dimensions: time (x-axis), percent utilization (y-axis), and number of CPUs (z-axis, color saturation) within the time/latency range. This is perhaps the most useful visualization that I’ve created to date. Bryan Cantrill first developed it into Sun Storage Analytics while we worked on the advanced products team at Sun Microsystems. I mentioned this towards the end of my ACMQ article on Visualizing Latency under the heading “Other Applications”, which summarizes the concept:

Utilization of components can also be visualized as a heat map showing the percent utilization of individual components, instead of displaying an average percent utilization across all components. Utilization can be shown on the y-axis, and the number of components at that utilization can be shown by the color of the heat-map pixel. This is particularly useful for examining disk and CPU utilization, to check how load is balanced across these components. A tight grouping of darker colors shows that load is balanced evenly, and a cloud of lighter pixels shows that it isn’t.

Outliers are also interesting: a single CPU at 100 percent utilization may be shown as a light line at the top of the heat map and is typically the result of a software scalability issue (single thread of execution). A single disk at 100 percent utilization is also interesting and can be the result of a disk failure. This cannot be identified using averages or maximums alone: a maximum cannot differentiate between a single disk at 100 percent utilization and multiple disks at 100 percent utilization, which can happen during a normal burst of load.

We’ve developed this idea further with Joyent Cloud Analytics, where it is used to analyze the performance of devices across multiple systems in the cloud. Such utilization heat maps may become a standard tool for visualizing device utilization, especially in light of CPU scaling and cloud computing environments.

Data Center, 60 seconds

Over 300 servers (5,312 CPUs):

Recapping: the x-axis is time, the y-axis is CPU utilization percent, and the z-axis (saturation) shows how many CPUs were at that time and utilization level. This is shown on the right, and is different than all previous visualizations (color no longer represents utilization; here it is used for the CPU count).

Each rectangle that makes up the heat map is a “bucket” spanning a time and utilization range, and is colored based on the CPU count (darker means more).

The darker colors at the bottom of this heat map show a constant concentration of idle CPUs. The red line at the top shows a constant presence of CPUs at 100% utilization. The dark color of the 100% line shows that multiple CPUs were at 100%, not just one. (I’ll explain the exact color algorithm in the Saturation section).

Apart from identifying multiple CPUs at 100%, this shows generally that CPUs are idle, near the bottom of the plot. A subtle band can also be seen around 50%.

One Server, 60 seconds

Seeing how this looks for a single server (the same as has been selected previously), with 16 CPUs:

CPUs hitting 100% can be seen at the top of the plot. Periods when CPUs are and are not idle are also clearly visible at the bottom.

The number of quantize buckets on the y-axis is so high for only 16 CPUs that this appears as a scatter plot. Reducing the number of buckets to ten (as well as reducing the height):

Now that each quantize range is more likely to span multiple CPUs, more shades are chosen. This can help create patterns.

Other server examples follow. As these include much whitespace, a simple border has been added.

Idle Server

This is the same idle server shown as a line graph earlier (and as ):

 

Light Load

A server with light load, and a single CPU sometimes hitting 100%:

 

Busy Server

This busier server has a tighter distribution of CPU utilization, grouped around 50%:

Saturation

The color saturation of each bucket reflects the relative number of CPUs that were quantized in that time/utilization range. The more CPUs, the brighter the color is.

The actual algorithm used here is non-linear, which helps identify subtle patterns. A linear algorithm could be used that makes the bucket with the highest CPU count the darkest shade available, and the bucket with the lowest CPU count (probably zero) the lightest, and all buckets in-between scaled linearly by those. In practice, this can wash out details as some buckets (in the case of CPU utilization, the idle buckets representing 0% utilization) would have such a high CPU count that the others only use much lighter shades – and appear washed out. An example of linear application of saturation based on value is on the right.

For the heat maps above, the buckets are first sorted from least to most CPUs, and then the full spectrum of shades applied to the sorted list. This ensures that the full spectrum of shades are used, making best use of that dimension, and allowing subtle patterns to be seen that would otherwise be washed out. This approach was devised by Bryan Cantrill for the heat maps used by the Sun Storage products. He named it rank-based coloring.

Hue and Value

The hue used above is red, merely to stay consistent with other images here. When used in the Highlighting Data section, more red meant hotter CPUs, which seems intuitive. Here, more red means a concentration of CPUs, even when the line we’re looking at represents idle CPUs. This probably isn’t a good choice of color, and can be easily changed (Sun Microsystems Analytics chose blue; Joyent Cloud Analytics chose orange).

The heat map could be adjusted to retain the intuitive nature of “red means hot” (100%). With the first example on the right, I’ve allowed only the top utilization range to be red, with the lower utilization ranges in grayscale. Beneath that is a different example, where the value of the red hue is scaled based on utilization (this may be referred to as saturation, depending on the color model used).

Another use of hue can be to reflect a fourth dimension. In Joyent Cloud Analytics, the make-up of the heat map can be investigated by highlighting components in different hues, which is collected as a fourth dimension to the data. For the data center heat map above, individual servers could be highlighted with their own hues. David Pacheco wrote Heatmap Coloring to explain this, which also provides examples of rank-based vs linear coloring.

Background

I thought of using heat maps for device utilization after being burned by performance issues during the development of the Sun Storage appliance, including:

  • sloth disks: these are disks that mysteriously begin returning very, very slow I/O, over 1 second, yet do not return error counts (hard or soft). Their percent utilization as reported in the operating system (which is a percent busy) would stay at 100% for seconds at a time, while other disks (in the same RAID stripe) were idle. Sloth disks would kill performance, and I needed a way that I, the field engineers, and customers could all easily identify them. A constraint was that this couldn’t just look for the max utilization: the ZFS file system often drove all disks to 100% utilization during transaction group flushes. This had to identify the presence of one or two such disks only.
  • hot threads: this is usually where the software has not been designed to scale across all available CPUs, and some CPUs are idle while others are at 100% utilization. It could be as simple as a codepath that should be multi-threaded but isn’t. One particular issue I ran into was the ZFS pipeline, where originally stages could only be processed by up to eight threads (CPUs). A hot stage (compression) could limit ZFS performance as only eight CPUs could be used (this was since fixed).

For this type of issue, the workload can become bounded by the performance of the few busy devices, while the majority of the devices are idle. I’ve seen this type of problem across all device types (CPUs, disks, network interfaces, storage controllers, etc). The device utilization heat map quickly proved an excellent way to identify this type of issue, as well as show many other useful characteristics.

Summary

These visualizations have been created to illustrate different ways to observe device utilization on large scale environments.

I frequently need to analyze urgent performance issues on these environments using a variety of tools, and with varying degrees of success. Sometimes a customer has been unable to resolve a crippling issue because their visualization hides important details (the most common problem is a line graph showing average device utilization, making it impossible to identify that single or multiple devices are at 100%). I’ve condensed years of such pain and frustration into the problem statement at the start of this post, and then presented various visualizations that could satisfy those needs.

I’d suggest using:

  • Quantized Heat Map: to identify single or multiple devices at 100% utilization, minimum and maximum device utilization, and device utilization balance, all over time (performance analysis)
  • Line Graph: to observe average utilization across multiple devices over time (capacity planning)

The visualizations should be realtime, so that any change to the environment can be analyzed immediately and repaired sooner. Dave Pacheco and I showed how Joyent Cloud Analytics did this in the OSCON 2011 presentation Design and Implementation of a Real-Time Cloud Analytics Platform.

These visualizations can be interactive. For example, the user could click on the 100% devices in the quantized heat map, and be shown information to explain them further: how many devices there were and on which servers.

I’d also consider including both of the above visualizations plus text, for times when it’s important to verbalize the state of performance quickly (the emergency concall). Text could include average utilization across all devices for different time intervals (previous minute, hour, day, week), and maximum utilization. The 95th or 99th percentile could also be included, to convey details about the upper distribution.

I’d love to see quantized heat maps show up in more places where currently bar graphs or line graphs are used.

Acknowledgments

Many tools were used to create the images in this post; by type:

  • Command Line Interface Tools: mpstat(1) on Solaris was originally by Jeff Bonwick. I visualized the data using some shell scripting, awk(1), and Firefox with the Screengrab plugin.
  • Tabulated Data: same tools as above.
  • Highlighted Data: same tools as above.
  • 3D Surface Plot: these were made using R with the lattice package, inspired by Dominic Kay’s Visualizing Performance work.
  • Animated Data: the same tools as before, with ImageMagick to assemble the animation.
  • Instantaneous Values: includes a screenshot from Microsoft Windows 8.
  • Bar Graphs: includes Mac OS X’s Activity Monitor.
  • Line Graphs: were created by gnuplot (after trying other tools that couldn’t handle 5,312 lines).
  • Ternary Plots: The Barry3 CPU visualization type was created by Dr Neil J. Gunther, and cpuplayer was written by Stefan Parvu.
  • Quantized Heat Maps: this type for device utilization was created by myself at Sun Microsystems, while working with Bryan Cantrill who developed it in the Sun Storage ZFS appliance. It has been developed further by Dave Pacheco, Robert Mustacchi, and others on the Cloud Analytics team at Joyent.

Gimp was used for post processing images.

Style is inspired by Edward Tufte, including clearing “chart junk” from the line graphs and the use of high definition graphics in text (the micro heat maps, like ). From Tufte, I’d recommend reading “Visual Explanations”, “Beautiful Evidence”, “Envisioning Information” and “The Visual Display of Quantitative Information”; and from William S Cleveland, “The Elements of Graphing Data” and “Visualizing Data” (from which I’m tempted to reassemble this blog post to categorize visualizations into univariate, bivariate and multivariate types).

Thanks to Deirdré Straughan for editing another one of my long blog posts, and for suggestions to improve the content; and to all the people (particularly Jason Hoffman) who have been referring me to books, articles and links to read during the past year.

Posted on December 18, 2011 at 1:47 pm by Brendan Gregg · Permalink
In: Performance · Tagged with: , , ,
There are no comments so far.-->

Leave a Reply


AWS Getting Started Guides for Linux and Microsoft Windows

We've created three new documents to make it even easier for you to get started with AWS:

The first two documents (Getting Started Guide: AWS Web Application Hosting for Linux and Getting Started Guide: AWS Web Application Hosting for Microsoft Windows) are designed to help you create scalable, robust web applications that handle sophisticated demands and workloads using AWS. It provides an example architecture diagram of a web application hosted on AWS and a step-by-step walkthrough of how to deploy your web application using AWS services and follow best practices.

The guides walk you through each step of the process. You'll sign up for the services and install the command-line tools. Then you will create an Elastic Load Balancer, EC2 Security Group, and a Key Pair. Next, you will use Auto Scaling to launch a load-balanced array of Amazon EC2 instances and set up a CloudWatch alarm to drive the Auto Scaling process. You will add database capabilities by launching an Amazon RDS DB Instance along with the associated DB Security Group. With the infrastructure in place, you will install and launch your web application.Finally, you will use the CloudFormer tool to capture your setup as a reusable CloudFormation template. The guide also covers the use of Route 53 for DNS hosting and CloudFront for content distribution.

We also have a brand new Microsoft Windows Guide. This guide contains conceptual information about Amazon EC2, as well as information about how to use the service to create new web applications on Windows instances. Separate sections describe how to program with the command line interface (CLI) and the Query API.

-- Jeff;

 

17 Techniques Used to Scale Turntable.fm and Labmeeting to Millions of Users

In How to launch in a month and scale to a million users, Joseph Perla, Former VP of Technology and founding team of Turntable.fm, shares techniques he used to build and quickly scale his startups. The post is very well written and a must read. Here are the essentials:

  1. Keep it simple. Build API's before making the website or mobile apps. Keep interfaces small and single-purpose. 
  2. Get it right. Build in automated tests from the start. Create function tests, module level tests, and full integration tests. Run tests on every commit. No new code written while bugs exist.
  3. Don't hide power. Use Pebbles to write bug-free Javascript, a library to create complicated AJAX interactions by writing 0 javascript by adding a few extra HTML tags to code.
  4. Use procedure arguments to provide flexibility in an interface. Pass functions instead of parameters to support complicated scenarios. For example, a filter function return a boolean.
  5. Leave it to the client. Keep the server simple and move as much functionality as possible to the client.
  6. Continuity. Keep interfaces stable. Version interfaces from the start.
  7. Keep secrets of the implementation. Keep service implementations entirely independent to provide maximum flexibility to handle requirement changes, even though it means a slight performance decrease.
  8. Use a good idea again instead of generalizing it. It's OK to replicate and specialize similar code instead of creating a more generalized library. 
  9. Handle normal and worst cases separately as a rule. Code should clearly special cases rather than use a more general algorithm that would remove the special cases. 
  10. Split resources in a fixed way if in doubt. Servers should be single purposed. For example, keep the database index and search index on separate machines. They can then be scaled independently and won't stomp on each other. 
  11. Use static analysis if you can. On check-in run stack analysis tools on code to find bugs and performance issues.
  12. Dynamic translation from a convenient representation to one that can be quickly interpreted. For example, a Python domain specific language for tweet filtering was easy to program and could be directly translated to python bytecodes.
  13. Cache answers to expensive computations. Self explanatory, but be careful of cache invalidation issues.
  14. When in doubt, use brute force. It's better to complete a feature faster using a simple algorithm than it is to delay implementing a clever algorithm. 
  15. Compute in background when possible. Do as a little work as possible in the web server, queue it to background processes. 
  16. Use Batch Processing if possible. Loading individual data items is slow, load them in large batches.
  17. Shed load to control demand. It's OK to have limits. Pick limits that make your software work without having to go through heroic efforts or change stacks.

Related Articles

5 ways to protect against vendor lock-in in the cloud

Two weeks ago, Google announced a significant price increase for use of its App Engine Platform-as-a-Service. The increase itself was not a huge surprise. Google had been making noises that something like this was in the offing for a number of months. But the size of the increase shocked the Web development and cloud applications community. For most users, the cost of using the Google runtime environment effectively increased by 100% or more.

A huge online backlash ensued. For its part, Google put off the increase by a month and moderated some of the increases. But the whole incident brought many nagging doubts about the cloud to the surface. Said one poster on one of the many threads that lit up the Google Groups forums after the increase:

I like so many of us have spent a lot of time learning app engine – i have been worried like so many that using app engine is a mistake because any app you invest/build can only be run on… app engine.

Because the Google PaaS requires that developers customize code specifically to run in that environment and nowhere else, rewriting that code takes a lot of time, effort and money. With salaries for programmers hitting record highs in the Bay Area and recent CS graduates pulling in $120,000 or more to code, any big move that forced major code rewrites would ultimately wallop the bottom line. Ironically, these increases disproportionately affected numerous hobbyists and small developers running interesting applications – the creators of the next proverbial Google. Certainly corporate IT departments took notice, as well.

Vendor lock-in will make you vulnerable

Unquestionably, Google App Engine price increase revealed a key fundamental weakness of many cloud businesses.  Namely, vendor lock-in does exist in the cloud. This seems odd because one of the benefits of the cloud specifically was to obviate the advantage of vendor lock-in and make applications more portable. In that worldview, no cloud rules them all (not even Amazon) and companies operating applications in the cloud can quickly and easily port their applications to other PaaS offerings or to other IaaS providers.

With vendor lock-in comes vulnerability to price increases. In all likelihood, Google – a data-driven business if there ever was one – was rebalancing pricing to reflect its own need for profitability. But for developers and app makers, this drastic shift effectively turned their decision to go with Google App Engine into what may have been a “bet-the-company” decision without ever realizing it.  For the PaaS industry in general, the move raises significant uncertainty. If Google has to raise its prices this much, who’s next?

Start thinking defensively before you choose a platform

In a similar vein, developers who put their applications up on Heroku may not have realized that their business fate depended on the fidelity of the Amazon EC2 cloud. If a company had been planning a big sales event or promotion during the extended EC2 outage, those three days of hard downtime may have had an outsized impact.

So clearly the rules of the game have changed for anyone who wants to put an app in the cloud and run a real business. Defensive thinking is in order. Here are five key rules to avoid getting gouged by Google App Engine or eviscerated by an EC2 outage:

  1. Avoid vendor lock-in at all costs. This is now a no-brainer. Make sure that your app can be easily ported to other clouds if you need to move due to service outages. If you must write apps that require serious customization, make sure you have a back-up plan and, if you can swing the cost, an alternative cloud running your code as a backup.
  2.  Know thy PaaS. Spreading the risk among multiple PaaS providers makes a lot of sense – unless they are all totally dependent on one big cloud to deliver your applications and cloud business. Explore installable PaaS options that you yourself control. So ask pointed questions about where your PaaS is running and how they are managing their risks of failure of a big cloud.
  3.  Ask hard questions about redundancy and system architecture. Deep under the covers of most clouds are core system architectures that may replicate single-points-of-failure. That’s because, at its core, the cloud infrastructure ecosystem is not a terribly diverse environment. Only a few hardware and software companies rule the roost. Similarly, ask your cloud provider to completely open their architecture and software kimono and let you examine everything. If they won’t, then you caveat emptor. If they will, you can judge their redundancy steps for yourself. So ask for specific architecture diagrams if you are going to be dependent on a cloud environment and its reliability. And get a network engineer or system architect buddy to review the diagrams. Think this is overkill? Ask FourSquare, Reddit and the other huge sites that have corporate backing or VC money and went down hard in the EC2 outages.
  4. Pick code that’s easier and faster to modify. Not all runtime environments and frameworks are alike. Certain flavors and types of frameworks and Web scripting environments are more difficult to change in a pinch due to the core architecture of the way the scripting language works. Until recently, PHP was far harder to clean up than RoR, and Python, pre-Django, was more unwieldy.
  5. The most popular code may not be the cheapest code. Think about the availability of coders. Many applications companies have a horror story about how their iOS app needed modifications and they either had to pay a high-end dev shop $200 per hour or had to wait for weeks to make the mods. At the same time, some runtime environments like Node.js can be built with Javascript code throughout the application stack. (We’re biased as we are strong backers of Node.js). That means you eliminate the need for differentiated front- and back-end coding teams, in a best case scenario. When building your cloud app, think hard about the code selection before you start filling up your GitHub repository.

By no means are these five steps comprehensive. And for the most part they are obvious. But in the cloud things move pretty quickly and sometimes slowing down to think about what your cloud application will be in six, 12 or 24 months is hard to do. So put on your crash helmet, watch your wallet, and be careful out there, people.

Alex Salkever is Director of Product Marketing at Joyent Cloud (@Joyent). He was formerly a technology editor at BusinessWeek.com.

Image courtesy of Flickr user kreg.steppe.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

Evento noSQL Brasil 2a Edição

Acontecerá em São Paulo nos dias 21 e 22 de outubro a 2a edição do noSQL Brasil, principal conferência da América Latina sobre o tema. Este ano o evento conta com uma grade ainda mais rica, abordando não só as tecnologias noSQL, mas também newSQL e Cloud Computing.

Serão dois dias intensos onde teremos 12 palestras e 6 Lightning Talks. Alguns palestrantes internacionais de destaque:

- Alistair Jones, Software Engineer da Neo Technology - empresa responsável pelo Neo4j
- Frank Weigel, Diretor de produtos da Couchbase - empresa responsável pelo desenvolvimento dos populares CouchDB e Membase)
- Benjamin Coverston, Diretor de Operaçōes da Datastax - empresa responsável pelo Apache Cassandra
- Mathias Stearn, Sr. Software Engineer da 10gen - empresa responsável pelo MongoDB

Outros palestrantes nacionais e também internacionais estarão presentes no evento. A grade preliminar já estão disponível no site do evento.

Os preços promocionais são válidos até o dia 30 de Setembro.

site: http://nosqlbrasil.com (http://nosqlbrasil.com/)
twitter: @nosqlbrasil

CHAMADA INSCRIÇÕES - CMG BRASIL - 3º ENCONTRO TECNICO REGIONAL SUL - 04/11/2011


Inscrições abertas  para o 3º Encontro Regional Sul a ser realizado em 04/11/2011, em Curitiba.
As inscrições serão encerradas em 31/10/2011.
Anexo o convite para as inscrições com todos os detalhes necessários.  Havendo dúvidas favor contatar Paula pelos telefones 11 4195-4406 – 11 9283-1744  ou e-mail paula@maffei.com.br.
 
Contamos com a sua presença.
 
Paula Iochpe
Computer Measurement Group Brasil
CMG Brasil
Tel/Fax 55 11 4195-4406 – Cel 11 9283-1744

Click here to download:
CHAMADA_Inscricoes_EventoRegionalSul_2011.pdf (83 KB)

Making Facebook Self-Healing

When your infrastructure is the size of Facebook’s, there are always broken servers and pieces of software that have gone down or are generally misbehaving. In most cases, our systems are engineered such that these issues cause little or no impact to people using the site. But sometimes small outages can become bigger outages, causing errors or poor performance on the site. If a piece of broken software or hardware does impact the site, then it's important that we fix it or replace it as quickly as possible. Even if it's not causing issues for users yet, it could in the future so we need to take care of it quickly.


Facebook's Site Reliability team is dedicated to keeping the site up and fast and stable. We handle everything from the smallest outages on individual servers to the largest outages across the entire site. When I joined the Site Reliability team a couple of years ago, it was clear that the infrastructure was growing too fast for us to be able to handle small repetitive outages manually. We had to find an automated way to handle these sorts of issues so that the human engineers could focus on solving and preventing the larger, more complex outages. So, I started writing scripts when I had time to automate the fixes for various types of broken servers and pieces of software.


Introducing FBAR (Not FUBAR)

Over time, I developed the scripts more and more. As they got better, they saved me more time, which I used to continue improving them. Eventually my team started benefiting from my scripts enough that I was asked to work on them full time. I separated out the common parts into generic APIs that model our infrastructure and I turned the rest into remediation modules that use these APIs to implement the business logic for individual components of the Facebook back end. Then I wrote a daemonized service that executes workflows comprising these remediation plugins to handle outages detected by our monitoring system. I named the whole system "Facebook Auto-Remediation" or "FBAR" for short. (I originally wanted to name it "FUBAR", but I couldn't come up with anything good for the "U" to stand for, so "FBAR" it is.)


To understand how FBAR works, let's look at what happens when an individual server goes down. Imagine a hard drive goes bad on one of our Web servers. First, the monitoring system will detect the failed hardware and report this outage as an "alert". FBAR's "Alert Fetch Loop" runs continuously in the background querying the monitoring system to find new alerts. When it finds alerts, it processes them and calculates appropriate workflows to execute to handle the outages. The workflows get placed on a job queue for the FBAR Job Engine to execute.


FBAR Alert Fetch Loop


The FBAR Job Engine will then pull the job for this server off of the job queue and begin executing remediation plugins in precedence order. Each plugin is written against the FBAR API. This API gives the plugin access to hardware and configuration data about the host and to the alert that describes the detected outage. The API also provides access to power control, command execution on the host and to the host's entries in our site-wide service configuration database.


FBAR API Stack

When the job runs on our hypothetical Web server, the first remediation plugin would verify that the machine has damaged hardware, classify the failure type as hard_drive, then return that data to FBAR. At this point the workflow would branch. Rather than moving on to handle the next outage (like SSH or HTTP), FBAR would execute the plugin to remove the Web server from production service and then flag the machine as needing a part replacement.


Remediation Workflow

When the data center technician has replaced the bad drive on the machine, they would flag the machine as repaired. At this point, FBAR again takes control of the machine and verifies that it is ready for production service and re-enables it. The only human interaction with the machine is when a person replaces the physical hard drive. The rest of the process happens automatically without any manual intervention.


Automating the Work of Hundreds

Today, the FBAR service is developed and maintained by two full time engineers, but according to the most recent metrics, it’s doing the work of approximately 200 full time system administrators. FBAR now manages more than 50% of the Facebook infrastructure and we’ve found that services have dramatic increases in reliability when they go under FBAR control. Recently, we've opened up development of remediation plugins to other teams working on Facebook's back end services so they can implement their service-specific business logic. As these teams write their own remediation plugins, we're expanding FBAR coverage to more and more of the infrastructure. This is making the site more and more reliable for end users while reducing the workload of the supporting engineers.


Facebook is an amazing place to work for many reasons but I think my favorite part of the job is that engineers like me are encouraged to come up with our own ideas and implement them. Management here is very technical and there is very little bureaucracy, so when someone builds something that works, it gets adopted quickly. Even though Facebook is one of the biggest websites in the world it still feels like a start-up work environment because there's so much room for individual employees to have a huge impact.


Like building infrastructure? Facebook is hiring infrastructure engineers. Apply here.


Patrick is a software engineer at Facebook.

Call for papers para o CMG Brasil - 3º Encontro Técnico Regional Sul - 04/11/2011


Prezado(a) Senhor(a),

O CMG Brasil, já contando com o seu apoio intelectual, tem a satisfação de pela 3ª. vez, encaminhar o “Call for Papers” para o Encontro Técnico Regional Sul a ser realizado em 04/11/2011, em Curitiba, Paraná.

Desde já agradece e aguarda o seu tema com o resumo para a apreciação do board técnico.

Paula Iochpe

CMG Brasil

Tel/Fax 55 11 4195-4406 – Cel 11 9283-1744

Click here to download:
CallForPapers2011-EventoRegionaSul-04Nov.pdf (60 KB)