Mark Bolinsky · Feb 3, 2016 go to post

I tired using the <!--break--> tag in a recent post, and no matter where I placed it there was no change.  Can you provide a snipet showing a sample in the Filtered HTLM editor?

Mark Bolinsky · Mar 8, 2016 go to post

Yes.  Database mirroring within cloud infrastructure is possible.  As you point out the use of the virtual IP address (VIP) in most cases is not doable.  This is due to cloud network management/assignments/rules not particularly liking having IP addresses changing outside of the cloud management facilities.

Having said that, the use of 3rd party load balancers offers a solution in the form of a virtual appliance available in most cloud marketplaces in a Bring-Your-Own-License (BYOL) model.  As an example F5 LTM Virtual Edition.  With these appliances there are usually two methods available to control network traffic flow.  

The first option uses an API called from ^ZMIRROR during failover to instruct the load balancer that a particular server is now the primary mirror member.  The API methods range from CLI type scripting to REST API integration.

The second option uses load balancer polling to determine which mirror member is primary.  This involves creating a simple CSP page or listening socket to respond whether a given server in the load balanced pool is the primary mirror member.

The second option is more portable and load balancer agnostic since it doesn't rely on specific syntax or integration methods from a given load balancer vendor or model.  However the limitation is the frequency of polling.  In most cases polling can be as low as a few seconds - which in most scenarios is acceptable.

I will be soon posting a long article here on the Community detailing some examples using F5 LTM VE and providing a sample CSP status page and REST API integration to cover both options mentioned above.  I will also be presenting a session during our upcoming Global Summit.

Mark Bolinsky · Mar 9, 2016 go to post

Hi Alexey,

Thank you for the post on your deployment.  I'm very interested to understand more how a virtual router helped in your deployment.  If I'm understanding correctly, because of the use of VMware vSphere and the network rules allowing, the use of the actual VIP within database mirroring was used as normal - meaning Cache' was able to remove/assign the VIP to whichever node was the primary mirror member.  

As a side note - with ECP clients in the mix, the VIP is not actually a requirement because ECP clients are "mirror-aware" unless some portion of the application needed to access the database server directly.

I'm curious to learn more how you used the virtual router and what components were NAT/PAT'd to and from.  For example, did the vRouter sit between a single external address to an internal load balancer or server pool of ECP clients or web servers?  

It's great to hear alternatives to solutions.  I look forward to hearing back on your deployment.

Kind regards,
Mark B-

Mark Bolinsky · Mar 9, 2016 go to post

ECP clients are "mirror-aware" meaning when you create remote databases on a given ECP client, they are marked as "mirrored".  When the ECP client connects to either mirror member it will be redirected to whichever is the active/primary mirror member.  It will also reconnect to a new primary member during failover.  Our documentation has good detail about this available here:

http://docs.intersystems.com/cache20152/csp/docbook/DocBook.UI.Page.cls…

Specifically in the Notes: (1)

ECP application servers do not use the VIP and will connect to any failover member or promoted DR member that becomes primary, so the VIP is used only for users' direct connections to the primary, if any.

Mark Bolinsky · Mar 21, 2016 go to post

Not just for test/dev/demo either...  Caché can support highly resilient enterprise applications in cloud..  I recently posted an article how to use database mirroring in a cloud without the built-in Virtual IP (VIP) to provide rapid failover for high availability and disaster recovery - even between availability zones and/or geo-regions.  

https://community.intersystems.com/post/database-mirroring-without-virtual-ip-address

Mark Bolinsky · Mar 25, 2016 go to post

Hi Francis,

You are absolutely right that memory access performance is vital, however this is not only bandwidth but also latency.  With most new systems employing NUMA based architectures, both memory speed and bandwidth have a major impact.  This requirement continues to grow as well as because more and more processor cores are crammed into a single socket allowing for more and more concurrently running processes and threads.  In additional NUMA node inter-memory accesses plays a major role.  I agree that clock speed alone is not a clear indicator of being "the fastest", since clock speeds haven't changed all that much over the years once getting into the 2-3Ghz+ range, but rather items such as overall processor and memory architectures (eg. Intel QPI), on-board instruction sets, memory latency, memory channels and bandwidth, and also on-chip pipeline L2/L3 cache sizes and speeds all play a role.

What this article is demonstrating is not particularly CPU sizing specifics for any given application, but rather mentioning one of (not the only) useful tools comparing a given processor to another.  We all agree there is no substitute for real-world application benchmarking, and what we have found through benchmarking real-world application based on Caché that SPECint (and SPECint_rate) numbers usually provides a safe relative correlation or comparison from processor model to processor model.  Now things become more complicated when applications might not be optimally written and impose unwanted bottlenecks such as excessive database block contentions, lock contention, etc... from the application.  Those items tend to negatively impact scalability on the higher end and would prohibit linear or predictable scaling.

This article is to serve as the starting point for just one of the components in the "hardware food group".  So the real proof or evidence is gained from doing proper benchmarking of your application because that encapsulated all components working together. 

Kind regards...

Mark Bolinsky · Apr 4, 2016 go to post

Hi Alexey,

WAN connectivity varies significantly and many factors play into the requirements and latency.  You can get very good (fast and reliable) WAN connectivity, however distance impacts latency, so you need to be careful in your planning.

As for deciding which mirror to promote...  This is one of the reasons we do not recommend automating the promotion of a DR Async member to become primary.  You will want to evaluate the state (or reported latency) within the ^MIRROR utility on each DR Async member to determine which one (maybe both?) are current or not.  If they are out of sync with each other, you will need to manually rebuild the "new backup" in the secondary data center based on the newly promoted DR Async member.

Regards,

Mark B-

Mark Bolinsky · Apr 4, 2016 go to post

Yes.  Latency is a major factor when considering geographically splitting synchronous mirrors.  You will need to really understand the given application and workload to know how much latency can be tolerated.  Some applications can accept latency (to a certain level) however others may not.

We do have deployments with each synchronous member located in different locations and latency is single digit millisecond latency and only separated by about 100 miles, so there is tolerable latency in this configuration for this application.

Unfortunately there is no absolute formula here to determine if a particular application can leverage that type of a deployment strategy.  The first things to consider is monitor the current journal physical write rate of the application with ^mgstat or ^pButtons during peak workloads.  You also need to understand if ECP is heavily used because this will have an impact on the the number of journal sync calls for ECP durability guarantees .  Usually looking at IO rates with iostat (Linux or UNIX) or PERFMON.EXE (Windows) of the journal volume will give you a good indication of the mirror throughput you will need.  Using that figure you can work out what maximum latency should be as a start.

Here is an example:

Say on a given system you see the journal write rate from pButtons/mgstat is relatively low at only 10-20 journal writes per second.  Let's assume these are full 64KB journal buffer writes - so bandwidth requirements will be in the neighborhood of 1.3 Mbytes / second (or 10Mbit / second) as a minimum.  I would recommend allocating at least 20Mbit or more to ensure spikes can be efficiently handled.  However when looking at iostat output you notice the journal volume is doing 200 writes per second because the application is using ECP clients (application servers).  

So with this example, we know that at a minimum synchronous mirroring will need at least 20Mbps of bandwidth and latency less than 5 milliseconds.  I came to the 5 millisecond requirement by taking 1000 milliseconds (1 second) and divide by 200 journal IOPS.  This gives the maximum latency of 5ms to sustain 200 IOPS.  This is by no means the absolute requirement for the application.  This is a simple starting point to understanding the requirement scope for WAN connectivity, and the application needs to be thoroughly tested to confirm transaction/processing response times are adequate.

I hope this helps.

Regards,

Mark B-

Mark Bolinsky · Apr 5, 2016 go to post

Hi Alex,

You are correct that latency is only a major consideration for synchronous (failover) mirror members.  In an async member, latency to/from the primary mirror member does not slow down the primary mirror member processing.  Like you mentioned it only impacts the delay in the async mirror member being "caught up".   Your example is perfectly fine for DR Async, and if the DR Async should fall behind for any reason, it will put itself into "catch up mode".  In all cases this does not impact the primary mirror member performance.

I'd like to mention that in DR Async mirror members we also use compression as a means to be sensitive to bandwidth requirements, so if sizing a WAN link for DR Async consider that the bandwidth requirements will be less due to compression.

As for cascading mirrors, that currently is not a feature we support today.

Thanks again for your excellent questions.

Kind regards,
Mark B-

Mark Bolinsky · Apr 10, 2016 go to post

Hello,

I cannot name specific customers, however this is a configuration used with TrakCare and TrakCare Lab deployments (prior to TrakCare Lab Enterprise which now integrates lab directly as a module into a single TrakCare instance), where each the TrakCare and TrakCare Lab are separate failover mirror sets and TrakCare Analytics is defined as a single Reporting Async mirror member to be the source data to build/support the TrakCare Analytics DeepSee cubes and dashboards in a single instance.

This is our standard architecture for TrakCare based deployments.  I hope this helps.  Please let me know if there are specific questions or concerns with this deployment model.

King regards,

Mark B-

Mark Bolinsky · Jun 17, 2016 go to post

setting the TZ environment variable needs to be done in the system-wide profile such as /etc/profile.  This should define it properly for you.  I would recommend a restart of Caché after setting it /etc/profile.  

Also the impact of the TZ environment variable not being set should be reduced (eliminated) with the current 2016.1+ releases where we have changed the way this operates.

Kind regards,
Mark B-

Mark Bolinsky · Jul 8, 2016 go to post

Hi Matthew,

Thank you for your question. Pricing is tricky and best discussed with your Microsoft representative.  When looking at premium storage accounts, you only pay for the provisioned disk type not transactions, however there are caveats.  For example if you need only 100GB of storage will be be charges for a P0 disk @ 128GB.  A good Microsoft article to help explain the details can be found here.

Regards,

Mark B

Mark Bolinsky · Jul 8, 2016 go to post

Hi Ron,

There are many options available for may different deployment scenarios.  Specifically for the multi-site VPN you can use the Azure VPN Gateway.  Here is a diagram provided by Microsoft's documentation showing it.  

Here is the link as well to the multi-site VPN details.

As for Internet gateways, yes they have that concept and the load balancers can be internal or external.  You control access with network security groups and also using the Azure Traffic Manager and also using Azure DNS services.  There are tons of options here and really up to you and what/how you want to control and manage the network.  Here is a link to Azure's documentation about how to make a load balancer Internet facing.

The link to the code for some reason wasn't marked as public in the github repository.  I'll take care of that now.

Regards,

Mark B-

Mark Bolinsky · Jul 26, 2016 go to post

Thank you for your comment.  You will need to establish you own monitoring and ultimately range of IO response times for your application using tools like iostat.  This article is used to give you a starting point for monitoring.  Your specific application may need higher or lower requirements.   

Using iostat, you want to continuously monitor storage device performance (specifically the iostat -x <device> <time between sample in seconds> <number of iterations> command) and monitor it for a particular range of time.  For example, if you want to only monitor during peak business hours from 8am-12pm.  What is mostly important is average response times - typically I like using iostat -x <devices> 2 1000 to report 1000 2-second samples.  This is useful when diagnosing a performance issue.  

To reduce the amount of data collected you can use a higher time between samples such as iostat -x <devices> 5 1000 for 5 second samples or even higher if you wish.  It's really a function of what reasons you are monitoring - if doing an in-depth performance analysis you would want a small time between samples to better observe spikes in response times, or if you are doing just daily statistic collection you could go for a higher time between samples.  The objective here is to get familiar with your specific application's needs and this article just provides a baseline for what is typical for most applications.

Kind regards,

Mark B-

Mark Bolinsky · Aug 2, 2016 go to post

Thank you for your question.  It is recommended with any InterSystems 2014.1 product (including Caché, Ensemble, or HealthShare) version to remain using SMT4 (or SMT2).  Not until running a version based on 2015.1 or higher would SMT8 be advisable and provide any potential gain. 

Mark Bolinsky · Feb 22, 2017 go to post

Hi Alexey,

Thank you for your comment.  Yes, both THP and traditional/reserved Huge_pages can be used at the same time, however there is not benefit and in fact systems with many (thousands) of Caché processes, especially if there is a lot of process creation, has shown a performance penalty in testing.  The overhead of instantiating the THP for those processes at a high rate can be noticeable.  Your application may not exhibit this scenario and may be ok.  

The goal of this article is to provide guidance for those that may not know which is the best option to choose and/or point out that this is a change in recent Linux distributions.  You may find that THP usage is perfectly fine for your application.  There is no replacement for actual testing and benchmarking your application.  :)

Kind regards,

Mark B-

Mark Bolinsky · Feb 22, 2017 go to post

Hi Alexander,

Thank you for you post.  We are only relying on what RH documentation is stating as to when THP was introduced to the main stream kernel (2.6.38) and enabled by default as noted in the RH post you referenced.  The option may have existed in previous kernels (although I would not recommending to try it), it may not have been enabled by default.  All the documentation I can find on THP support in RH references the 2.6.38 kernel where is was merged feature.

If you are finding it in previous kernels, confirm that THP are enabled by default or not.  That would be interesting to know.  Unfortunately there isn't much we can do other than to do the checks for enablement as mentioned in the post.  As the ultimate confirmation, RH and the other Linux distributions would need to update their documentation to confirm when this behavior was enacted in the respective kernel versions.  

As I mentioned in other comments, the use of THP is not necessarily a bad thing and won't cause "harm" to a system, but there may be performance impacts for applications that have a large amount of process creation as part of their application.

Kind regards,

Mark B-

Mark Bolinsky · Feb 22, 2017 go to post

I will revise the post to be more clear that THP is enabled by default in 2.6.38 kernel but may be available in prior kernels and to reference your respective Linux distributions documentation for confirming and changing the setting.  Thanks for your comments.

Mark Bolinsky · Nov 3, 2017 go to post

Hi Dean - thanks for the comment.  There are no changes required from a Caché standpoint, however Microsoft would need to add the similar functionality to Windows to allow for Azure Backup to call a script within the target Windows VM similar to how it is done with Linux.  The scripting from Caché would be exactly the same on Windows except for using .BAT syntax rather then Linux shell scripting once Microsoft provides that capability.  Microsoft may already have it this capability?  I'll have to look to see if they have extended it to Windows as well.

Regards,
Mark B-

Mark Bolinsky · Nov 7, 2017 go to post

We are receiving more and more requests for VSS integration, so there may be some movement on it, however no guarantees or commitments at this time.  

In regards to the alternative as a crash consistent backup, yes it would be safe as long as the databases, WIJ, and journals are all included and have a consistent point-in-time snapshot.  The databases in the backup archive may be "corrupt", and not until after starting Caché for the WIJ and journals to be applied will it be physically accurate.  Just like you said - a crash consistent backup and the WIJ recovery is key to the successful recovery.  

I will post back if I hear of changes coming with VSS integration.

Mark Bolinsky · Jun 14, 2018 go to post

Hi Paul,

The call-out method is highly customized and depends on the API features of a particular load balancer.  Basically the code is to added to the ^ZMIRROR routine to call whatever API/CLI is available from the load balancer (or the EC2 CLI calls). 

For the  appliance polling method (the one I recommend because it is very simple and clean).   Here is a section from my AWS reference architecture article found here.  The link also provides some good diagrams showing the usage.

AWS Elastic Load Balancer Polling Method

A polling method using the CSP Gateway’s mirror_status.cxw page available in 2017.1 can be used as the polling method in the ELB health monitor to each mirror member added to the ELB server pool.  Only the primary mirror will respond ‘SUCCESS’ thus directing network traffic to only the active primary mirror member. 

This method does not require any logic to be added to ^ZMIRROR.  Please note that most load-balancing network appliances have a limit on the frequency of running the status check.  Typically, the highest frequency is no less than 5 seconds, which is usually acceptable to support most uptime service level agreements.

A HTTP request for the following resource will test the Mirror Member status of the LOCAL Cache configuration.

 /csp/bin/mirror_status.cxw

For all other cases, the path to these Mirror status requests should resolve to the appropriate Cache server and NameSpace using the same hierarchical mechanism as that used for requesting real CSP pages.

Example:  To test the Mirror Status of the configuration serving applications in the /csp/user/ path:

 /csp/user/mirror_status.cxw

Note: A CSP license is not consumed by invoking a Mirror Status check.

Depending on whether or not the target instance is the active Primary Member the Gateway will return one of the following CSP responses:

** Success (Is the Primary Member)

===============================

   HTTP/1.1 200 OK

   Content-Type: text/plain

   Connection: close

   Content-Length: 7

   SUCCESS

** Failure (Is not the Primary Member)

===============================

   HTTP/1.1 503 Service Unavailable

   Content-Type: text/plain

   Connection: close

   Content-Length: 6

   FAILED

** Failure (The Cache Server does not support the Mirror_Status.cxw request)

===============================

   HTTP/1.1 500 Internal Server Error

   Content-Type: text/plain

   Connection: close

   Content-Length: 6

   FAILED

Mark Bolinsky · Aug 3, 2018 go to post

Thanks Thomas.  Great article!  

One recommendation I would like to add is with VM-based snapshot backups, we recommend NOT including the VM's memory state as part of the snapshot.  This will greatly reduce the time a VM will be "stunned or paused" that would potentially bump up close to or exceed the QoS value.  Not including the memory state as part of the VM snapshot is OK for the database as recovery never relies on information in memory (assuming the appropriate ExternalFreeze and ExternalThaw APIs are used), since all writes from the database are frozen during the snapshot (journal writes are still occurring).

Mark Bolinsky · Aug 15, 2018 go to post

Hi Jason,

We are working on a similar utility for writes now to support either a solely write or a mixed read/write workload.  I hope to have it posted to the community in the next few weeks.

Kind regards,

Mark B-

Mark Bolinsky · Oct 29, 2018 go to post

Hi Ashish,

We are actively working with Nutanix on a potential example reference architecture, but nothing imminent at this time.  The challenges with HCI solutions, Nutanix being one of them, is there is more to the solution that just the nodes themselves.  The network topology and switches play a very important role.  

Additionally, performance with HCI solutions are good...until they aren't.  What I mean by that is performance can be good with HCI/SDDC solutions, however maintaining the expected performance during node failures and/or maintenance periods is the key.  Not all SSDs are created equal, so consideration of storage access performance during all situations such as normal operations, failure conditions, and node rebuild/rebalancing is important.  Also data locality plays a large role too with HCI, and in some HCI solution so does the working dataset size (ie - the larger the data set and random access patterns to that data can have an adverse and unexpected impact on storage latency).

Here's a link to an article I authored regarding our current experiences and general recommendations with HCI and SDDC-based solutions.

https://community.intersystems.com/post/software-defined-data-centers-sddc-and-hyper-converged-infrastructure-hci-–-important

So, in general, be careful when considering any HCI/SDDC solution to not fall into the HCI marketing hype or promises of being "low cost".  Be sure to consider failure/rebuild scenarios when sizing you HCI cluster.  Many times the often quoted "4-node cluster" just isn't ideal and more nodes may be necessary to support performance during failure/maintenance situations within a cluster.  We have come across many of these situations, so test test test.  :)

Kind regards,

Mark B

Mark Bolinsky · Nov 28, 2018 go to post

Hi Scott,

Have you looked at using the Ensemble Enterprise Monitor?  This provides a centralized "pane of glass" for a dashboard type display across multiple production.  Details of using it can be found here in the Ensemble documentation.

Regards,
Mark B-

Mark Bolinsky · Jul 31, 2019 go to post

Hi Alexey,

I can help with your question.  The reason this is the way it is because you can't (or at least shouldn't) have a database file (CACHE.DAT or IRIS.DAT) opened in contending modes (open both as unbuffered and buffered) to avoid file corruption or stale data.  Now the actual writing of the online backup CBK file can be a buffered write because it is independent of the DB as you mentioned, but the actual reads of the database blocks from the online backup utility will be unbuffered direct IO reads.  This is where the slow-down may occur: from the reading the database blocks and not the actual writing of the CBK backup file.

Regards,
Mark B-