Service State Migration in MCN Project

Mobile Cloud Networking (MCN) is a large-size EU project that involves several leading companies, research centers, and universities that aimed at exploring a very large-scale coverage of a wide range of on-demand telco services in an efficient way. The main MCN goal is to provide innovative and effective solutions for enabling dynamic network function and self-adaptation to mobility with the exploitation and extension of cloud computing techniques in order to ease the deployment and operations of future mobile telco services through self-management, self-maintenance, on-premises design, and operations control functions. In particular, typical key requirements of highly dynamic and distributed system (i.e., latency, mobility, scalability, etc.) have been carefully explored through the exploitation of cloud computing cutting-edge technologies, and a smart on-demand deployment and distribution of mobile network functions, providing mobile services independent from physical location.

The main MCN objectives are to develop a novel mobile architecture and technologies to create a fully cloud-based system and to extend cloud computing, beyond datacenters to the edge of the network, towards mobile end-users. In fact, cloud networking is explored as a mechanism to support on-demand and elastic provisioning of mobile services, implementing a platform to process and storage data near the end-points in order to enhance performance and deliver services in an elastic and dynamic way. The MCN architecture is very modular and the key concept is to enable to combine different services to create other more complex end-to-end services. The MCN Service Management Framework ena-bles and affords the means to compose and orchestrate the MCN operations across multiple domains and service types, creating the E2E composition. MCN supports and enables developers to build upon MCN services so they can compose and orchestrate their own services delivering much needed additional value and new revenue streams. Let us note that the MCN project started as a parallel activi-ty to NFV and it covers what NFV does, by going be-yond NFV in some key areas such as end-to-end service composition as well as exposure of scalability effects when needed between the different services [28]. These dependencies are executed upon specifically by the Service Orchestrator (SO) “Resolver” component, as de-tailed in the following. The MCN architecture has two key aspects, the lifecycle and the architectural entities. The technical phase of the lifecycle includes all activi-ties from technical design all the way through to tech-nical disposal of a service:

Design of the architecture, implementation, deploy-ment, provisioning and operation solutions. Sup-ports Service Owner to “design” their service.
Implementation of the designed architecture, func-tions, interfaces, controllers, APIs, etc.
Deployment of the implemented elements, e.g., data centers, cloud controllers, etc.
Provisioning of the service environment (e.g., net-work functions, interfaces, etc.).
Operation and Runtime Management. In this stage the service instance is ready and running. Activities such as scaling, reconfiguration of SICs are carried out here.
Disposal of the service with the release of SICs and the service instance itself is carried out here.

Fig. 1 illustrates the MCN architecture and Fig. 2 illustrates the MCN architecture portion related to the services we use.

Design

We have adopted the service instance migration in order to be able to migrate on-the-fly the whole state of the service. The overall goal of the proposed system is to realize a coordinated set of mechanisms that allow moving all internal state of a service instance to another instance created from scratch at runtime, and with minimal service instance interruption. For our proposal, we design and implement, apart for placement decision, all the orchestration and migration activities outlined previously, commonly used during service state migration:

The service is continually monitored using MaaS functionality via integration with Zabbix in our testbed environment, as the main potential trigger for migrations.
Based on the collected data, Zabbix allows to activate triggers that sends alerts or automate unsupervised actions to automatically resolve issues.
To manage the values gathered by monitoring system, we implement black-box and gray-box techniques in order to be completely agnostic from the application and thus be able to make decisions by observing each VM from outside and without any knowledge of the service provided.
When the migration trigger has been activated, the migration procedure takes place. The successive migration process step starts with the creation of a new state skeleton, by the SO, to prepare the target place to receive the state to be moved.
Once the new VMs are ready, we start the actual migration of all data from the previous overloaded service instance to the new target one. We migrate the service state with an initial push phase where all the data on the origin service are moved on the destination service, followed by a phase that sends the residual data on the destination.
After the whole state has been moved, our implementation deletes its stack with the old service instance to release resources.
Finally, we reset the buffered monitoring values read from Zabbix to enable new future triggering actions.

Delving in a more detailed description, we illustrate the service state migration process steps in Figure 3.

Migration Sequence Diagram — Figure 3. State migration process.

Apart from the starting SM/SO typical activities to deploy and provision a new RCBaaS service (1) (2), the first step regarding the state migration is the triggering of the whole migration. The real resource usage values provided by Zabbix (3.1) (3.2) are stored into a sliding window array with a fixed buffer length that can be easily configured programmatically. Of course, the buffering of a time series of monitoring data samples enables algorithms for resource usage analysis and prediction. In our case, as an external triggering decision algorithm, we have implemented a lightweight first-order Grey Model filtering module (4). When the migration trigger has been activated, the steps already explained in Figure 2 start: prepare the target place to receive data (5); data migration towards the target VMs (6); stack disposals (7); reset the buffered monitoring (8). Finally, we return to store monitoring data from the hosts and restart the loop.

Starting from the RCBaaS division into smaller and more specific VMs instances in order to monitor more efficiently and accurately the effective resource consumption, we detail on the definition of the monitoring to keep track of the effective resource consumption and finally we concentrate on service state migration to move data on the fly during service usage. We can divide the implementation into 3 parts.

The source code is available here

PRELIMINARY WORK ON SERVICE SEPARATION

RCBaaS is composed by two main components that interact very frequently: Cyclops and InfluxDB.

Cyclops is the core component of RCBaaS, that contains all the logic inside the service for accounting and billing purposes. Cyclops is divided into three micro-services: user data records (udr) collects the usage data from a source, e.g. OpenStack, CloudStack, SaaS, PaaS, etc. and stores it in the database; rating and charging (rc) uses the usage data records generated by the udr to calculate, in relation to the cloud resource rate, the charge data records; billing interfaces with the rc to generate an invoice.

InfluxDB is an open-source time-series database, particularly suitable to keep trace of large amount of data from sensor data, applications metric and real-time analytics, thus it is the backend of RCBaaS as the service monitoring metrics repository to keep the all history of the service measurements.

To apply our service state migration procedure and mechanism described above in a practical valuable case, we have focused on the RCBaaS monitoring service, which we treated as a monolithic VM for the sake of maximum separation. We needed to split it into a couple of disjoint and only dynamically bound VMs: RCBaaS-VM that perform the core operation of the monolithic service and contains only Cyclops component; InfluxDB-VM that contains the backend component where data are stored. We separate the service through the Openstack Heat, the main orchestration program inside Openstack, that implements an orchestration engine that allows to launch multiple composite cloud applications based on a text file, i.e. the Heat template file. We opportunely configure the Heat template to create two different VMs for Cyclops and InfluxDB when lunched and to invocate two script files located on the Cyclops-VM after the creation to automatically configure the InfluxDB-VM IP address of the InfluxDB-VM to send data to store. In Figure 4 we outline the script to configure the IP address into Cyclops-VM, editing 3 configuration files, one for each Cyclops micro-service.

MONITORING SYSTEM

The monitoring system is mainly based on two main components: MaaS that uses Zabbix to retrieve resource information from the physical hosts; the Grey Model that predicts the next values 1-step ahead. MaaS runs a Zabbix server that communicates with distributed monitoring agents instantiated during the services provisioning and aggregates the resource information retrieved from them. This monitoring agent is designed to collect networking statistics, processing and normalizing the raw monitoring data retrieved and exposes them communicating with the Zabbix server provided by MaaS. Every deployed service, that needs to integrate MaaS in their service for resource monitoring purpose, requires the installation and configuration of a Zabbix agent, as shown in Figure 5, through the heat template file when the VM is created, that actively monitor resources by the interaction with the Zabbix server on MaaS.

Successively, in the SO implementation we set which monitoring information to retrieve, we apply the Grey Model and we define the parameters to activate the trigger that starts the full state migration. In Figure 6 we show a snippet of the SO implementation for the monitoring part. We check the cpu load, as the value to monitor to evaluate if the host is overloaded. The cpu load considers the queue length of processes the are waiting to be processed and it is a common and widespread parameter to detect accurately the host workload. We set 10 as the cpu load threshold for the InfluxDB-VM and, given that the VM has 2VCPU, it means the trigger is activated when at least 5 processes per single core are waiting to be processed. The cpu load value considered is returned by the Grey Model considering the last five values read from MaaS stored into a sliding window array. At startup, we consider 3 minimum reading to invoke the Grey Model, in order to avoid false positives and thus to avoid a trigger activation caused by few anomalous readings. Finally, the monitoring values from MaaS are retrieving every 1 minute, that is the sensitivity of Zabbix and the minimal time interval to retrieve data, in order to have a relatively fine-grained periodicity to balance overhead and responsiveness.

SERVICE INSTANCE MIGRATION IMPLEMENTATION

The service instance migration step, it consists of two phases in this RCBaaS case: i) a complete InfluxDB dump from the old to the new instance; ii) storage and migration of new data inserted on the old database instance while the migration is occurring and after the dump operation. About the database dump, it is the core operation of the state migration and all the data of the old InfluxDB instance are moved to the new instance. To reduce the duration of this phase, we followed the main design guideline of compressing all the data of the old InfluxDB into an archive that is moved to the new InfluxDB VM, where they are extracted and located into the target InfluxDB data folders.

The dump operation could have a non-negligible duration, also tens of seconds, in relation to the amount of database instances and records to transfer, due to both external operations (e.g., data compression, movement, extraction, and database restart) and internal database operation to synchronize to the new status. In order to minimize the database unavailability time and, thus, to preserve overall service continuity, we perform the second phase as follows. As soon as the old data have been copied into the archive, all dumped data in the old databases are dropped to be sure that every data successively inserted has not been transferred during migration and, as a consequence, to relieve further the old database performance. When the database dump has completed and the target InfluxDB has been configured and made available, we select all the new data at the old instance and move them to the target InfluxDB, merging with the data already migrated during the dump. By delving into some finer implementation details, this mechanism has required to save these “during-migration” entries a JSON file, and then to convert them through a Python script into a LineProtocol format file used by InfluxDB to insert data on-the-fly; this copy of the new data to the target InfluxDB instance completes the data migration step. Figure 7 graphically summarizes all the operations performed during our data migration process, by distinguishing the actions executed on the old and those run on the new InfluxDB VM instance. In particular, the first three blocks from the left refers to the database dump operation (phase i) and the last two blocks to the storage of new data inserted during the migration process (phase ii).

Migration Steps — Figure 7. Essential steps of the proposed data migration process.

In Figure 8 we show the code used inside the SO component to invocate the migration script and to move data between the two instances. We use a Python library that, with a SSH connection, allows to send a command to a remote host. In this way, it is possible to access the new instance as a typical SSH communication and execute, from the SO implementation, the script already prepared on the newly created InfluxDB-VM instance passing as a parameter the IP address of the old instance to migrate. We add the code into a loop block because the creation of a new VM may take a few seconds and we try to connect with the VM created until it is not fully started.

By focusing on the most technically challenging data migration step, it consists of two phases in this RCBaaS case, as also shown in Figure 9 into a script on the InfluxDB-VM used to migrate data: i) a complete InfluxDB dump from the old to the new instance; ii) storage and migration of new data inserted on the old database instance while the migration is occurring and after the dump operation.

TEST ENVIRONMENT

We performed several tests that cover all the steps and phases discussed above, deploying stacks on Bart OpenStack platform and using RegionOne as the default region. Openstack Bart is a testbed provided by MCN consortium that runs the basic OpenStack services, based on Kilo version.

ICCLab's Bart Openstack cloud consists of 1 controller node, and 4 compute nodes, each being Lynx CALLEO 1240 servers with the following characteristics:

Model: SA1240A304R (1HE)
Processor: 2x INTEL® Xeon® E5620 (4cores)
Memory: 8x 8GB DD3 SDRAM, 1333MHz, reg. ECC
Disk: 4x 1 TB Enterprise SATA-3 Hard Disk, 7200 U/min, 6 Gb (Seagate ST1000NM0011)

Each of the nodes of this testbed is connected through 1 GBps ethernet links to HP ProCurve 2910AL switch, and using 1 GB/s link to ZHAW university network. This testbed has been allocated 32 public IPs in 160.85.4.0/24 block which allows collaborative work to be conducted over this testbed.

This testbed can be easily modified to add more capacity if needed.

RESULTS

We stressed the RCBaaS utilization with a wide range of different workloads in order to observe the performance of the newly introduced function and the perceived limited unavailability time that we are able to provide notwithstanding dynamic state migration and synchronization. All the performance tests reported in the following refer to the average values we measure across multiple runs, anyway observing an overall low variance (<5%).

In Table 1, we show the performance related to: service initialization, Zabbix monitoring, the Grey Model usage, the new target stack creation, and the RCBaaS-VM creation. Let us note that only the monitoring performance, that in our case are negligible, may potentially cause performance issues because they are repeated continuously during the service life-cycle. The other operations reported are only performed at startup, thus they do not introduce any latency during system operations at runtime.

In Figure 10, we show the performance of the data migration for different amounts of data to migrate. We report average values measured on multiple runs because the overall performance varies slightly from test to test mainly in relation to the network conditions and the load on the physical host where the VM is running.

Experiemntal Results — Figure 10. Experimental Results.

We divide the overall latency into several times that allow us to distinguish the duration of the different phases; in particular, we measure and define the main times as follows:

Tvmconn: time the SO takes to connect to the InfluxDB-VM, or in other words, the latency time between when the trigger becomes active and the data migration starts;
Tcompress_move: time to compress data into a tar.gz archive and move to the new VM instance;
Tdelete: time to delete all the data from the old InfluxDB-VM instance, directly proportional to the number of database to delete;
Textract: time to extract the archive into the InfluxDB folders of the new VM instance;
Trestart: time to restart the InfluxDB service in order to get the update about the new data;
Tsync: time used by Influx process for internal synchronization after the dump.

Other time latencies are related to the storage and insert of the new data during the migration that is the time necessary to: get all measurements, retrieve data inserted into a Json file, convert the Json file into the LineProtocol format and insert the data into the databases. We do not report these latencies in the chart in Figure 10 because we assume the amount of data insert during the migration is limited and, thus, the associated time is negligible (in the order of 0.1-0.2s to move a dozen of records). Let us stress that during the overall state migration procedure the database unavailability, considering the latest assumption that ignores the new data time retrieval, is limited to the process related to the measurements deletions (Tdelete), proportional to the number of databases presents but always very low and, for typical execution and average migration, below 1s, guarantying relatively negligible unavailability, and thus proving the effectiveness of the proposed state migration function and its wide applicability to stateful services service state migration. Summing up, depending on the dimension of the state to be migrated, the overall service migration process time can go from 112 seconds for up to two millions of records (namely, 100 seconds to setup the target VM and 12 seconds for data migration) to 590 seconds (for 100 millions of records). In any case, we are able to achieve a fully scalable behavior with good overall performance, mainly limited by the InfluxDB internal operations (Tsync), that are the real bottleneck of the solution even if they do not affect the unavailability time but only the duration time of the migration.

Alessandro Zanni Ph.D. student in Computer Science

Cloud Computing is an emerging computing platform collecting under a unique umbrella several research results from the areas of Grid Computing, Hardware and Network Virtualization, and Data Centers administration. Research in Cloud Computing goes towards the goal of realizing the ambitious "computing as an utility" paradigm through public clouds, but also aims at improving private and community cloud infrastructures to answer the peculiar needs of many industrial realities. Challenges in Cloud Computing are the improvement of allocation algorithms for virtual machines and virtual networks, QoS provisioning and SLA enforcement for cloud users, enabling interoperability between different cloud vendors through the development of standards for cloud federations.

More information can be found here.

For any suggestion, comment or further detail do not hesitate to contact me.

alessandro.zanni3 AT unibo.it

Department of Computer Science Engineering (DISI), University of Bologna

Via del Risorgimento 2, Bologna, Italy

Service State Migration in MCN Project

Service Migration

Design

Implementation

PRELIMINARY WORK ON SERVICE SEPARATION

MONITORING SYSTEM

SERVICE INSTANCE MIGRATION IMPLEMENTATION

Experimental Results

TEST ENVIRONMENT

RESULTS

Contacts