chapter ten

performance considerations

SAS Visual Analytics was built for speed. This design goal was strongly applied to the SAS LASR Analytic Server, which is an in-memory analytics engine. Working with the full set of data placed in RAM allows LASR to crank through large numbers of records at dizzying velocities. And while LASR is certainly central to the performance of the overall SAS Visual Analytics solution, there are also other areas that you’ll want to consider when planning a new deployment or attempting to squeeze the most out of the deployment that you’ve already got.

This chapter will look at several deployment options that can have significant impact in performance of SAS Visual Analytics at your site.

LASR performance

Did you know that when the technology for the SAS LASR Analytic Server was originally built, it was expressly designed for operation in massively parallel processing (MPP) environments? That is, LASR was designed for maximum scalability right from the start.

Over time as LASR matured further, the minimum system requirements to support a LASR Analytic Server were reduced until now they are:

•   Distributed LASR in MPP environments requires a minimum of 4 server hosts with a total of 16 cores (4 servers × 4 cores) with a combined 256 GB of RAM (16 GB per core).

Figure 10.1 A distributed LASR Analytic Server acts as a single service while running in parts across multiple host machines

image

•   Non-Distributed LASR in symmetric multi-processing (SMP) environments requires a minimum of 1 server host with 4 cores and 16 GB of RAM per core.

Figure 10.2 A non-distributed LASR Analytic Server runs on a single machine as part of a SAS deployment

image

Non-Distributed LASR (SMP)

A non-distributed SAS LASR Analytic Server runs using the SMP compute model. In other words, it runs on a computer like most other applications with which you’re familiar.

This means that an instance of non-distributed LASR Analytic Server can only scale up as far as the host hardware will allow. If your data sets grow in size over time, they might eventually exceed the size limit of what can be adequately processed by a non-distributed LASR Analytic server on your host machine. If possible, you can add more RAM onto your host machine to extend the LASR working size. But using this approach means that eventually the host machine will reach a maximum amount of RAM it can hold. If your data set continues to grow, then you need to look at buying a whole new server – a higher class with larger RAM capacity – or consider upgrading to Distributed SAS Visual Analytics and the distributed LASR Analytic Server, which can scale much higher by running across multiple machines simultaneously.

Scaling up in size using this SMP approach can get expensive very quickly. Shop around on the various retail sites for server hardware vendors, and you can see that moving up from one machine to another with twice the CPU and RAM can cost three, four, or more times as much.

Distributed LASR (MPP)

MPP offers a much longer path for scalability than SMP. The idea with an MPP approach is to use smaller machines that are relatively cheap and more cost-effective. Then as your computing power needs increase, add more of those cheap commodity machines to your environment and extend the software cluster to run on them. In this way, there is effectively a much, much higher limit on the number of CPUs and the amount of RAM that can be thrown at your computing tasks.

Distributed SAS LASR Analytic Server was built from the ground-up specifically to get the most out of the MPP compute model.

When data is copied into a distributed LASR server, it is broken up into chunks and divided across the multiple servers that are hosting LASR. The LASR components that hold and process this data are known as the LASR Workers. Their actions are coordinated on another host machine known as the LASR Root Node. In this way, when a request comes to LASR for analytic processing of the data, the LASR Root sets the LASR Workers to task. Each LASR Worker acts independently on its chunk of the data and sends the resulting answer back to the LASR Root. Once all LASR Workers have reported back, the LASR Root will finalize the answer and provide the response back to the requestor. In a properly configured environment, SAS has demonstrated that LASR can process a billion records in just a few seconds. Because of this incredible speed, LASR does not cache any results. Every request that comes in is fully acted on the entire in-memory data set every time.

Load balancing by data distribution

When data is loaded into a Distributed LASR Analytic Server from SAS, the LASR Root Node accepts the incoming data, and then evenly distributes it across all of the LASR Workers.

Figure 10.3 LASR distributes incoming data equally across the LASR Workers

image

This approach to load balancing – that is, equitable data distribution to achieve similar workloads – is predicated on the assumption that each of the machines hosting the LASR Workers have equivalent hardware specifications as well as similar workload that is not related to LASR.

For example, let’s suppose that server Host_3 in Figure 10.3 above is running some other process that consumes 50% of its CPU and that the other three LASR Worker hosts are comparatively idle. Now a request comes into LASR to run some big number-crunching on a table that already resides in memory. The LASR Root will direct the Workers to perform that task. But the LASR Worker on Host_3 only has 50% of the CPU left to work with. Therefore, in the race to complete the analytic task, the LASR Worker on Host_3 will return its results to LASR Root last. Of course, LASR Root cannot respond to the incoming request until all of the answers are received. So the delay from Host_3 has affected the overall LASR response time. This example illustrates that LASR is only as fast as its slowest node.

Keep this consideration in mind when planning your SAS Visual Analytics software deployment in association with other SAS and third-party software required by the enterprise.

High-volume access to smaller tables

Not all of the data tables in your enterprise environment are large in size. And yet you still might want to put some of this smaller data into your Distributed SAS LASR Analytic Server. With relatively small data volumes, it can be inefficient to break it into chunks and distribute them across multiple hosts for MPP processing. This is because the communication and coordination between nodes will consume a significantly larger percentage of LASR’s overall response time. In those situations, you might see faster response times if that data is placed into a non-distributed LASR Analytic Server because having the entire table in one place eliminates the multi-node coordination necessary for a distributed LASR Analytic Server.

Figure 10.4 Smaller tables copied to non-distributed LASR Analytic Server for more efficient processing

image

What is a small table in terms of size? That answer varies depending on factors, but generally any table less than 2 GB in size is likely to be an ideal candidate for this approach. Tables ranging in size between 2 GB and 20 GB are possibly good candidates, depending on the current workload, the available RAM to each host, the number of nodes, and so on. Tables that are over 20 GB in size are probably best served in a distributed LASR Analytic Server.

To meet this need, a distributed LASR Analytic server can also support running individual non-distributed LASR Analytic Server instances within the cluster as well.

There are two approaches for enabling this high-volume access to smaller tables:

1.   LASR Libraries

In the SAS metadata, create a LASR library dedicated for hosting smaller size tables, and set the extended attribute called VA.TableFullCopies. The VA.TableFullCopies attribute takes a positive integer as its value, which represents the number of copies to create in the cluster.

Figure 10.5 Enabling full copies of smaller tables in a distributed LASR Analytic Server

image

2.   SAS Program Code

The LASR Procedure provides a FullCopyTo option which, when used to load a table into LASR, specifies the number of copies to create of the smaller table in the cluster. A simple example:

proc lasr add data=hdfs.small_table1 fullcopyto=3 port=10010;
run;

When using the full copy functionality for small tables with a distributed LASR Analytic Server, considerations to keep in mind are the following:

•   In some cases, the Distributed LASR Server must also have a copy of the small table. Tasks that use more than one input table (such as the SCHEMA and the SAVE statements) run on the distributed server only. If the table exists only on a non-distributed server, then the table is copied to the distributed server before the requested processing begins.

•   Table requests are load-balanced across machines that are hosting full table copies.

•   Full table copies are read-only. UPDATE and APPEND statements will return an error.

The following administrative actions are recommended when working with full copy tables in LASR:

•   Start with a small number of copies, which represent less than the total number of LASR Worker hosts and then incrementally increase as needed

•   Train users to use the LASR library (or the FullCopyTo option for PROC LASR) with care because inadvertently loading a very large table in this way could quickly consume all RAM resources on affected hosts.

•   Non-distributed LASR servers continue to run until the distributed LASR server is stopped.

•   If logging is enabled, only the distributed LASR server will capture activity. The non-distributed LASR servers do not log any activity directly.

Fast loading of data to distributed LASR Analytic Server

The SAS LASR Analytic Server can be loaded with data from any source that Base SAS in your environment has access to. Base SAS comes with the built-in ability to work with data from sources such as the following:

•   Local text files

   Formatted, like CSV

   Raw, like log files

•   SAS data sets

•   SAS Scalable Performance Data Engine (SPD Engine) data tables

   On direct-attached storage

   On HDFS

Furthermore, Base SAS can be extended to with optional SAS/ACCESS engines for native access to third-party data providers such as Oracle, DB2, SQL Server as well as storage based on Hadoop, such as Hive, Impala, Spark, and more.

All of these data sources have one thing in common: The Base SAS instance acts as a data proxy that connects to the source data, siphons it out, and then sends it to LASR. Data transferred in this manner is all sent directly to the LASR Root Node, which then distributes it across the LASR Workers.

Figure 10.6 SAS supports a wide variety of data sources for serially loading data into LASR

image

While this approach works well in support of a very wide array of supported data providers, its performance is constrained by the serial distribution points: Base SAS and the LASR Root Node. For better throughput, it is possible to load data using multiple parallel streams directly to the Worker Nodes of a distributed LASR Analytic Server. The architecture and deployment of software in your environment determines exactly which of these parallel loading techniques are supported.

LASR and a remote data provider (asymmetric)

Chances are that, after deploying Distributed SAS Visual Analytics at your site, there is a data provider elsewhere in your environment that hosts data that you want in LASR. The serial loading technique illustrated in Figure 10.6 above will certainly work, but for a large volume of data, it might take a long time to complete the transfer.

For supported data providers, SAS offers In-Database technology. This technology delivers the ability to parallel load data from multiple nodes of the remote data provider directly to each of the worker nodes of a distributed LASR Analytic Server. To gain this ability, your site will need to license the appropriate SAS/ACCESS product and deploy the SAS Embedded Process into the remote data provider.

One benefit of keeping the data provider separate from LASR is that it enables you to customize each environment for their specific service objectives. Each environment can scale independently of the other and each one can have its own maintenance operations with minimal impact to the other.

LASR symmetrically co-located with HDFS

At some sites, the distributed LASR software is placed on a cluster of machines that are also hosting Hadoop. In that case, LASR is said to be co-located with Hadoop. If we take this concept a step further and carefully deploy the distributed LASR service components alongside their equivalent Hadoop Distributed File System (HDFS) counterparts (that is, place the LASR Root with the HDFS NameNode together on the same host as well as a LASR Worker with each of the HDFS DataNodes on their hosts), then LASR is symmetrically co-located with HDFS.

With a symmetrically co-located deployment of distributed LASR with HDFS, then a symbiotic relationship between those services is possible:

•   LASR can save in-memory tables directly to disk in HDFS on each Worker Node in the SASHDAT format (and plain text comma-separated values files, CSV).

•   LASR can read SASHDAT format (and CSV) directly from HDFS.

SASHDAT Tables

SASHDAT is the SAS high-performance data structure optimized for MPP environments. The SASHDAT format is binary, compressible, and encryptable. It explicitly avoids fractional rows (which is a challenge for most HDFS-stored items). As a rule, SASHDAT provides the fastest and most efficient way to (re-)load data into LASR.

SASHDAT is provided as a function of the distributed LASR server. So it is not a feature provided directly by Base SAS or any SAS/ACCESS engine. Also, SASHDAT is not available for non-distributed LASR Analytic Server because it is expressly designed for MPP environments. Notice that we wrote “(re-)load” above – data must first be loaded directly through in-memory LASR from an external source before it can write the data down to SASHDAT on HDFS.

SASHDAT is not intended to act as a primary data store. This is due in part to the typical limitations you deal with when working with files stored in HDFS. Because HDFS takes an immutable approach to file state, it’s not possible to modify the contents of a file. Only appends are allowed. If you need to delete a single line from a file in HDFS, the resulting action is to basically write a new copy of the entire file, minus that one line, and then delete the old file. This makes HDFS slow and ponderous for transactional update actions. Therefore, SASHDAT is better positioned as a staging area to aid in the rapid (re-)loading of data into a distributed LASR Analytic Server. If your environment must drop some in-memory tables to make room in RAM for others and then later swap back, then having fast, efficient SASHDAT to cut down on transition time can be a real help.

LASR co-located with dedicated HDFS and loading data from remote HDFS

The SAS LASR Analytic Server is an in-memory analytics engine that relies heavily on RAM. The Hadoop Distributed File System relies heavily on disk. They both use CPU and network when performing most actions.

So if your site is currently using Hadoop services such as HDFS, MapReduce, Spark, Hive, and more, then you probably will not want to co-locate a distributed LASR Analytic Server in the same cluster because it will just compete for resources which you’ve already allocated in support of Hadoop. So with your Hadoop services running in one cluster of machines, it’s likely you’ll want to procure a new set of servers to host the distributed LASR Analytic Server. That way, you can have some manageable control over resource allocation for Hadoop and LASR since you’ve physically separated them.

But then what about SASHDAT? Since SASHDAT offers the fastest, most efficient way to (re-)load data into LASR, it is a pretty tempting feature. It’d be great if we could keep it to use as needed.

So then you can symmetrically deploy a co-located instance of HDFS to the cluster of hosts where the dedicated LASR Analytic Server is running. This second instance of HDFS should be dedicated to only hosting SASHDAT files for LASR as a staging area.

Figure 10.7 Dedicated HDFS for storing SASHDAT

image

In this way, your primary deployment of Hadoop (or other remote data provider) can be optimized to best perform for your enterprise. The LASR cluster will be optimized for super-fast, in-memory processing, and the LASR cluster can provide a secondary service acting as host to SASHDAT files in HDFS. Any supported distribution of Hadoop can be used for storing SASHDAT files.

References

When dealing with a software solution like SAS Visual Analytics, which offers new levels of speedy performance, there is much information to consider and many decisions to make. After acquainting yourself with the topics illustrated here, you’ll find more details about your exact situation in the SAS documentation which is available online at support.sas.com:

SAS Institute Inc. 2016. SAS Visual Analytics 7.3: System Requirements. Cary, NC: SAS Institute Inc.

SAS Institute Inc. 2016. SAS Visual Analytics 7.3: Administration Guide. Cary, NC: SAS Institute Inc.

SAS Institute Inc. 2016. SAS LASR Analytic Server 2.7 Reference Guide. Cary, NC: SAS Institute Inc.

SAS Institute Inc. 2016. SAS 9.4 Supported Hadoop Distributions. Cary, NC: SAS Institute Inc.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset