Inside Vexor

Vexor is a cloud continuous integration for developers, which allows to effectively test projects and pay only for those resources that are actually used.
09 June   933

History

The project appeared from the internal developments of Evrone. Company used Jenkins to run the tests. But when it became obvious that the CI-service needs to be modified for our specific tasks, it was changed to GitlabCI. There were several reasons for this:

Vexor

Cloud Continuous Integration service. Unlimited parallelism
 

Continuous integration

Practice of merging all developer working copies to a shared mainline several times a day

  • GitlabCI is written in Ruby, which is the team's native language;
  • It is small and simple, which means it can be easily modified.

During the exploitation period, GitlabCI mutated quite strongly. And at the moment when it already had too little in common with the original product, team just re-coded everything. So this was the first version of Vexor, which was used only inside the company.

Evrone simultaneously develops several projects. Some of them are very large, and they need many tests with each commit. So, it's always necessary to keep a lot of servers up for workers. And pay for them in full.

But if you think about it, you'll understand:

  • At night and weekends, there is no need for servers rent;
  • If the team is large and the process is built in such a way that many commits are done simultaneously, then there are very, very many parallel tests. For example, if weekly iterations are used, usually at the end of the iteration, several features are released and at the same time 5-20 callbacks are done, and every callback run tests. There is a situation where you need, for example, 20+ workers.

Obviously, workers need to be lifted and removed automatically, focusing on current requests.

The first version of automatic scaling was written for a couple of hours based on Amazon EC2. The implementation was very naive, but even with it, the server usage payments were reduced. CI began to work much more stable, because the situation when a sudden "flood" of tests led to workers shortage was eliminated. Then the cloud integration was remade several times.

Now the server pool is clouded. It is controlled by a separate application, to which workers are connected. Application monitors their status: live / dropped / failed to start / no work. The application automatically changes the size of this pool, depending on the current workers' state, the size of the task queue and an approximate estimated time spent on their execution.

In the cloud, the ratio of hardware configuration and cost is absolutely linear. You want twice as powerful - you pay 10$, you want two times more powerful - another 20$ and so on. It's simple.
 

Oleg Balbekov
Vexor CEO

Originally , Amazon EC2 was used as a cloud. But on Amazon, hard drives that are connected to the servers, aren't physically located on the same host, they are in a separate repository and connected via network. When the HDs are intensively used (and the test run speed is very dependent on the drives' speed), the speed will be limited by the bandwidth of the channel to the storage. Amazon can solve this issue only for additional payment. Evrone considered other options: Rackspace, DigitalOcean, Azure and GCE. Comparing them, team chose Rackcspace.

Architecture

vexor architectureVexor architecture

Vexor CI is not a monolithic application, but a set of related applications.

For communication between them, RabbitMQ is used. In our case the Rabbit is good:

  • It can perform acknowledge message and publisher confirm. This allows to solve a lot of problems, in particular, allows to write applications in the popular Erlang'e style "Let it crash". That means that in the case of any problems whole system "drops" but as soon as the service returns to normal state, all tasks will be performed and none of them will be lost.
  • RabbitMQ is a broker with the ability to build branchy queue topologies and exchange points, as well as configure routing between them. This allows, for example, to easily test new versions of services in the production environment on current production tasks.
  • RabbitMQ works steadily with large messages. The record for today is 120Mb in one message. Vexor doesn't need to process millions of messages per minute, but the message itself can weigh tens of Mb or more (for example, when passing logs).

There are also known shortcomings in RabbitMQ, which also have to be dealt with:

  • It requires a perfectly stable network between clients and server. Ideally, the server must be on the same physical host as the clients. Otherwise, the Rabbit's customers will behave like canaries on a submarine: "drop" in case of any problems that no other service sees.
  • It's difficult to provide high availability with RabbitMQ. There are three solutions for this, but only federation and shovel provide real high availability. Which, unlike cluster (about which you can read here), aren't easily integrable into the existing application architecture, since they do not provide data consistency.

Since our servers are physically located in several datacenters, and workers' pool can switch to another data center, in case of any problems with Rackspace,  to ensure stable operation of RabbitMQ, federation was used.

Logs

SOA-architecture entails another difficulty: logs gathering becomes a non-trivial task. When you have only a couple of applications, don't bother about it: logs are on a few hosts, which you can reach to and grab the necessary. But when there are a lot of applications, and one event is processed by several services, a logs service is needed.

In Vexor, the elasticsearch + logstash + logstash-forwarder is responsible for this. All the logs of our applications are written immediately in the JSON format, all applications events are logged, and also PG, RabbitMQ, Docker logs and other system messages (dmesg, mail and others) are collected. Vexor's team tries to log everything because all the workers work only for a certain time. After the disconnection of server with worker, team cannot learn anything about the problem except info from the logs.

Container

Docker is used to start test with workers. It's a great solution for work with insulated containers, which provides all necessary tools. Now Docker works very stable and delivers minimum of problems (especially if you have a fresh kernel OS). Sometimes bugs also occur, for example, here's the one

Tests in Vexor are launched in a container based on Ubuntu 14.04, where all popular services and necessary libraries are preinstalled. Image is regularly updated, so the set of preinstalled software is always fresh.

In order to use one image for all supported languages ​​and don't make the image size too big, the necessary versions of the languages ​​(Ruby, Python, Node.js, Go - a complete and up-to-date list of supported languages)are get from packages when build is started. This procedure takes a few seconds, and this solution allows us to easily maintain a big set of language versions without overloading the image.

Image's Deb packages are recompiled at least once a week. If, for example, you use Ubuntu 14.04 amd64, then when you connect it, you will get 12 versions of Ruby, already compiled with the latest versions of bundler and gem, fully ready for installation.

In order not to do apt-get update when installing packages in runtime and using fuzzy matching for versions, utility that can quickly install packages of the necessary versions from our repository was written, for example:

$ Time vxvm install go 1.3
Installed to /opt/vexor/packages/go-1.3
...
Real 0m3.765s

Configuration

Ideally, Vexor understands what it needs to run your project. Vexor's team is working to recognize automatically what technologies you need and run them. But this not always possible. Therefore, for unrecognized cases, app asks users to make a configuration file for the project.

In order not to reinvent the wheel, configuration file .travis.yml is being used. Travis CI is one of the most popular and well-known services today, it's good if users face minimal difficulties when moving from it. After all, if in the root directory of the project there is .travis.yml already, then everything will start instantly. 

We believe that people shouldn't write the configurations themselves. The user simply adds a project to us, we will determine what and how to run for the tests. Our service is not only providing resources for tests, but also saving people time.
 

Oleg Balbekov
Vexor CEO

Servers

Many servers are used where a lot of tasks are performed. Therefore, various tools, such as Ansible, Packer and Vagrant are actively used. Ansible is responsible for the selection and configuration of servers and performs these tasks well. Packer and Vagrant are used to build and test Docker images and workers' servers. Vexor automatically creates proper images.

For whom will Vexor fit? 

Small projects that run small amount of tests not often and they don't want to pay much and think about system administration and deployment, and also enjoy the delights of continous integration.

For large projects with many tests, unlimited amount of resources is provided, parallelization the tests is provided and it accelerate their run for several times.

For projects with a big team, the problem with "queues for the miscalculation of tests" is solved. Now any number of tests can be run simultaneously, eliminating long expectations.

Use Vexor and get rid of these restrictions and unfair price. Connect and receive $ 5 on your account to check out Vexor.

Vexor.io

Which CI do you use?

In software engineering, continuous integration (CI) is the practice of merging all developer working copies to a shared mainline several times a day. There are a lot of different continious integration solutions with strong and weak sides.
Take part in the survey of our portal. Which Continuous integration system do you use?

Vexor at HighLoad++ 2017

Alexandr Kirillov reported about how to build a cluster to calculate thousands of high-CPU / high-MEM tasks at one of the biggest Russian IT conferences
12 December   3521

The HighLoad++ is professional conference for developers of high-load systems is the key professional event for everyone involved in the creation of large, heavily-frequented, and complex projects.

Main purpose of the event is to exchange knowledge and experience among leading developers of high-performance systems, which support millions of users simultaneously.

Agenda consists of all crucial web development aspects, such as:

  • Large scale architectures,
  • databases and storage systems,
  • system administration,
  • load testing,
  • project maintenance, etc.

This year the conference program will be dazzled with current trends: IoTBlockchainNeural networksArtificial Intelligence, as well as Architecture & Front-end performance.

The 11th HighLoad++ conference took place on the 7th and 8th of November 2017. 

  • 66% of participants work in large companies (of 30+ employees), 
  • 60% earn above the market, 
  • 55% hold leadership positions and have subordinates. 
  • 9% of conference visitors work as technical directors,
  • 12% work as heads of technical departments, and 29% work as lead developers and team leads.

Alexandr Kirillov, CTO at Evrone, had a speech at HighLoad++ 2017 "How to build a cluster to calculate thousands of high-CPU / high-MEM tasks and not go broke"

Alexandr Kirillov at HighLoad++ 2017
Alexandr Kirillov at HighLoad++ 2017
Alexandr Kirillov at HighLoad++ 2017
Alexandr Kirillov at HighLoad++ 2017
Alexandr Kirillov at HighLoad++ 2017
Alexandr Kirillov at HighLoad++ 2017
 

Our project is a cloud-based CI-service, where users run tests of developed projects.
This year the system of auto purchase of our project purchased 37218 machines (Amazon Instances). This allowed us to process 189,488 "tasks" (test runs) of our customers.
 

Tests are always resource-intensive tasks with the maximum consumption of processor capacities and memory. We can not predict how many parallel computations and at what point in time it will be. Before us was the task of building the architecture of the system, which can very quickly increase, as well as rapidly reduce the power of the cluster.
 

All this was complicated by the fact that the resource-intensive calculations did not allow us to use the classic tools AWS or GoogleComputeEngine. We decided to write our own system of automatic scaling, taking into account the requirements of our subject area.
 

Alexandr Kirillov
CTO, Evrone

At his report, Alexandr told about how they designed and built the architecture of the service, which is the system of automatic procurement of machines.

Additionally, he told more about the main architectural blocks of projects that solve similar problems.