Fish Magic
The following are the titles of recent articles syndicated from Fish Magic
Add this feed to your friends list for news aggregation, or view this feed's syndication information.

LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.

[ << Previous 20 ]
Tuesday, July 19th, 2016
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
9:09 pm
How do you debug a segmentation fault inside a large C #macro?
Here's how you can debug a lengthy #define. You will need to know a full object build string in your build tool (VERBOSE=1 make if you're using cmake), and install GNU indent.

Then:

cc -E file.c | grep -v '^#' | indent > out
mv out file.c
make # rebuild your project with the new file


Now it's easy to see what part of a macro is causing the problem.
Saturday, July 25th, 2015
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
7:57 am
Concurrent programming is difficult
Pursuing your vision is difficult. Especially, if this vision is a multi-threaded in-memory database.

A couple of days ago I've got a working implementation of 3-thread layout for Tarantool core.

Mm... what's that supposed to mean if Tarantool is single-threaded? First, it's not. The binary log
is written in a separate thread. Checkpointing (snapshotting) is done in a separate thread. Replication
relays run in their own threads. But the transaction processor still runs in 1 thread only. You're
supposed to shard anyway, so why not begin with shardingon a single multi-core server, that's the idea.

One thing, however, which was part of transaction processor thread but didn't belong to it was the
client/server protocol, handling the socket I/O.

So, now we have 3 major threads: the network thread, the transaction processor thread, and the
binlog thread. Once in a while there is a checkpoint thread that gets involved.

The benchmark numbers were supposed to go up. But not only that, they were supposed to go up
while still maintaining at least comparable CPU utilization. I wasn't interested in 20% performance
boost for 100%  incraese in CPU usage, even though that's exactly what I got at first.

So, finally, after a day of tuning and patching, and cooling off the hot mutexes I got my numbers.
+25% performance increase for +30% higher CPU utilization. It's +25% only because in this specific
benchmark network I/O is only 25% of the original performance profile, the rest is B-trees operations
and transaction management, stuff that stayed in the transaction thread.

Happily, I pushed the patch to  the next-release tree. And yesterday we ran first YCSB benchmarks with it.

Now, YCSB is a stupid idea for a benchmark. For example, YCSB RO benchmark is N clients (for example, N is 16)
issuing tiny requests in synchronous fashion. Essentially, with few clients it means the client and the
server operate in lock-step fashion - the cliens  fire up a bunch of requests over network, and wait for esults.
The server kicks in, handles the input, and responds. 30% user time, 70% sys time during the benchmark.

Now, in this particular test Tarantool results got worse, while CPU usage went significantly up. On top of a lot of kernelspace and userspace switching, the patch added a bunch of network/transaction thread switching. The same lock step, one step further.

And now I have to think what to do with it. We can't look silly even on a silly benchmark.

Multi-threading, evidently, needs more work.
Sunday, February 8th, 2015
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
6:30 pm
pthread_self() is stable across fork, wel, for the main thread
Wednesday, January 14th, 2015
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
6:41 pm
Inotify_add_watch/inotify_rm_watch perofrmance depends on the path
Why is /tmp so much slower? The filesystem is the same (ext4). My only guess is that it has so many events that inotify_rm_watch has to do a lot of work to clear them.

Wednesday, December 3rd, 2014
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
10:15 am
Один из участников нашего проекта написал письмо про мудаков. Поможет или нет не знаю, но repost
Originally posted by unera at Открытое письмо президенту РФ об информационной политике
В современном мире так сложилось, что институт частной собственности временами выступает мотором для развития общества, а временами - тормозом.

Известные мировые мыслители, такие, как Ричард Столлман, еще несколько десятков лет назад сумели оценить ту угрозу, которую несет миру распространение законов частной собственности на продукты интеллектуальной деятельности.

Эти люди, оценив будущие перспективы человечества, предложили мировому сообществу свой способ преодоления этих проблем. Способ, который хорошо согласуется с законодательством большинства стран в мире. Этот способ заключается в том, что различные авторы различных произведений в сферах искусства, науки и т.д. могут добровольно передать свои произведения в свободный доступ.

Лицензионные соглашения, которые авторы этих произведений составляют в таком случае, обычно ограничивают возможность ограничения доступа к этим произведениям.

Распространение идей этих мыслителей привело к тому, что сперва отдельные люди, а затем группы людей начали делать что-то полезное для всего человечества совершенно бесплатно.

Особенное распространение эти идеи нашли в сфере IT технологий.

На сегодня можно без сомнения сказать, что такие вещи, как, например, операционная система Linux, свободная энциклопедия - WikiPedia , или такие проекты, как OpenSteetMap , - это всё вещи, которые являются мировым достоянием.

Уверен, что о многих подобных проектах Вы слышали либо даже использовали.
Вы, вероятно, также читали известное произведение Ричарда Столлмана - Право Читать.


Если рассмотреть эти проекты поближе, то можно, например, увидеть, что над популярной энциклопедией - Википедией - совершенно бесплатно трудится несколько миллионов человек. Несколько десятков тысяч человек трудится над операционной системой Linux. Миллиарды людей ее используют (в большинстве современных смартфонов, телевизоров, принтеров, других бытовых устройств используется ядро этой операционной системы).

Существуют также уникальные коммерческие проекты, которые, несмотря на свою коммерческую основу, стали очень популярны среди людей, которые бесплатно трудятся на благо общества.

Например, огромный портал для хранения, накопления, разработки проектов из мира IT - GitHUB. Этот, безусловно, коммерческий проект возник на базе свободного проекта, предназначенного для упрощения разработки программного обеспечения - Git, и представляет из себя огромную площадку, на которой миллионы людей со всего мира разрабатывают различное программное обеспечение, ведут друг с другом дискуссии, разрабатывают коммерческие и открытые продукты.

На площадке Github разрабатываются такие проекты, как операционная система Linux, многие современные языки программирования, базы данных и многое другое.

Думаю, что не ошибусь, если назову GitHub крупнейшей научно-технической площадкой в IT сфере. Другой такой у человечества пока нет. Мне очень хотелось бы увидеть подобную площадку, которая собрала бы в одном месте скажем большинство физиков или химиков планеты. Но, увы, подобный проект пока не реализован. Физики и химики так же используют gitHub, когда им приходится работать на стыке различных профессий.

Многие коммерческие фирмы (включая мой небольшой бизнес) также используют gitHub как инструмент для разработки, управления проектами. Проект gitHub зарабатывает, предоставляя этим фирмам свои ресурсы, однако главная ценность - то, что он собрал в себя множество некоммерческих проектов.

Если провести аналогию с современным миром, то проект GitHub можно представить как огромное НИИ с множеством входов, зданий и даже кварталов, микрогородков. Некоторые здания - огромные высотки, некоторые - маленькие землянки. В большинстве из них люди трудятся над созданием чего-то нового.

И, как обычно бывает в мире, на стенах некоторых из этого огромного числа зданий появляются граффити, иногда даже нецензурные выражения.

Уважаемый господин президент!

Сегодня, 2 декабря 2014 г, РосКомНадзор заблокировал доступ к порталу GitHub для российских граждан.

Сделано это под предлогом борьбы с теми несколькими нецензурными граффити, которые несколько хулиганов оставили на стенах этого огромного мирового НИИ.

С сегодняшнего дня российские граждане не смогут больше полноценно участвовать в разработке проектов, которые являются общечеловеческим достоянием. Легально, во всяком случае.

Множество российских предпринимателей в IT-сфере понесут убытки в связи с этой блокировкой.

Таким образом, сегодняшнее действие РосКомНадзора приводит к изоляции российских граждан от участия в международных проектах в области свободного ПО, а также к финансовым потерям.

Вышеописанные действия РосКомНадзора трудно назвать как либо иначе, нежели прямым вредительством.
Крайне удручает, что наличие множества сайтов, распространяющих, например, порнографию, в том числе и детскую (такие, как redtube.com, tube8.com и другие), совершенно не смущает РосКомНадзор. Сайты, распространяющие вирусы и вредоносное ПО, сайты, переполненные негативной информацией о нашей стране, совершенно не интересуют РосКомНадзор. РосКомНадзор блокирует доступ российских разработчиков и ученых, пользователей и бизнесменов к ресурсу, замыкающему на себя львиную долю всех свободных мировых разработок в сфере IT.

Прошу Вас срочно вмешаться в сложившуюся ситуацию и остановить это вредительство.

Я совершенно согласен с тем, что блокировки некоторых сайтов необходимы.
Однако применение законов РФ для того, чтобы ограничить доступ граждан РФ к мировому достоянию, а также (и главное) к процессу его создания, очевидно, ни к чему хорошему не приведет.


С большим уважением,

участник ряда OpenSource проектов (Debian, Tarantool, и т.п.), сотрудник одного из чудом сохранившихся российских НИИ, предприниматель и просто гражданин РФ

Обухов Дмитрий
Tuesday, December 3rd, 2013
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
6:07 pm
Urgent and important vs. anything else
Making people do what you want them to do is impossible. They seem to always do what *they* want to do.
Do we need syntax highlighting in the command line client? Probably. Do we need it now? Definitely not.
But do we need colors in the *test runner*? Yeah, in 2025, perhaps! But we do have it now. And there is nothing I can do about it. Except this little revenge.
Monday, October 21st, 2013
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
3:35 pm
The video from NoSQL matters about Tarantool
http://vimeo.com/66713654
Wednesday, October 2nd, 2013
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
7:34 am
Performance of stdarg.h
Most discussions I was able to find online about functions with variable number of arguments in C and C++ focus on syntax and type safety. Perhaps it has to do with C++11 support of such functions. But how much are they actually slower?

I wrote a small test to find out:

https://github.com/kostja/snippets/blob/master/stdarg.c

kostja@olah ~/snippets % gcc -std=c99 -O3 stdarg.c; time ./a.out
./a.out 0.18s user 0.00s system 99% cpu 0.181 total
kostja@olah ~/snippets % vim stdarg.c
kostja@olah ~/snippets % gcc -std=c99 -O3 stdarg.c; time ./a.out
./a.out 0.31s user 0.00s system 98% cpu 0.320 total

64-bit ABI allows passing some function arguments in C via registers. Apparently this is not the case for functions with variable number of arguments. I don't know for sure how many registers can be used, but the speed difference between standard and variadic function call increases when increasing the number of arguments.
Friday, September 27th, 2013
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
10:27 pm
Launchpad bug tracker
The issue tracker on our source code host, GitHub, has matured enough for the team to make a decision to move.
It's probably not the best idea to criticize a free home for an open source project, after all, Launchpad wasn't making any money from hosting us, but, truth be said, and perhaps lack of business model is the reason, it has fallen behind in features and usability.

Just for the record, the most important problems with bugs at Launchpad for us were:
- 7-digit bug ids. Tarantool is a small project and we perhaps will never go out of 4 digits, and you often need to have a quick and easy "handle" for a bug during conversation or in a email
- too many attributes of a bug. The milestone and series system was again designed for a large project, and only complicated matters for us
- bug states were quite nice, but then again we only used a few of them. At the same time there was no "legal" way to mark a bug as a duplicate - perhaps something related to the internal policies at Canonical.
- no way to cross-link a bug and a commit, unless (I guess) you're using Bazaar
- no bulk operations on bugs.

GitHub issues solve a lot of the above, plus, and this is actually the main reason, the issue tracker and the code both benefit from being close to each other.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
9:55 pm
New algorithm for taking snapshot in Tarantool
Just merged in a patch which I think gives Tarantool one more small but important edge over any other free in-memory database on the market.
The patch changes the algorithm of snapshotting (consistent online backup in Tarantool) from fork() + copy-on-write to use of delayed garbage collection. The overhead per tuple (Tarantool name for a record) is only 4 bytes, to store the added MVCC version. And, since delayed garbage collection is way more fine-grained compared to page-splits after a fork(), as it works on record level, not on page level, the extra memory "headroom", required for a snapshot, is now within 10% of all memory dedicated to an instance.

This feature goes into 1.5, which is, technically speaking, frozen :), but the patch has quite good locality and has been tested in production for a few months already, so I couldn't stand the temptation of making it available ASAP.

Speaking of our master, 1.6, it has already got online add/drop space/index, space an index names, and now is getting ready to switch to msgpack as the primary data format. But since we withstood from making incompatible changes for almost 3 years, there is still a lot of wants and wishes for 1.6. So the current best bet is to get 1.6 out of alpha by the end of the year.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
9:47 pm
open_memstream()
Have you heard about open_memstream(). This is a nice addition of POSIX 2008.
Good little step towards bringing the number of different string classes in an average C/C++ program down.
Monday, September 16th, 2013
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
11:32 am
Relevance of regression test failures on exotic platforms
Back in my days at MySQL we had a lot of issues with test failures. We had lots of platforms, and would try to run and maintain our regression test suite on all of them. I remember spending days investigating issues on some obscure OS (Mac OS, mainly, Windows was taken care of) or hardware (little-endian, mainly) .
With Tarantool, we never got to do that. We do run buidls on lots of platforms, and someone always screams when they break, since we only run builds on platforms which are in actual use. And they do break, so it's a lot of hassle. But we haven't had time to maintain the regression tests on some of these platforms. Ugly? Yes. Yet we know which systems people use in production, and do take care of these. This set is much more narrow than the set of systems which people play with.
And also, we don't pay attention to test failures caused by, essentially, bad tests. If a test fails once in a while on a busy box, well, this is kind of bad, but tolerable. One day we'll rewrite the test.
It turns out that these tests failures have very little relevance to what people experience in production. In the course of these 3 years I've never seen a test failure on an exotic platform being relevant to any production bug we've had.
Perhaps this is all possible since Tarantool team is so much smaller than MySQL. But it spares us all from lots and lots of boring and unneeded work.
Thursday, August 8th, 2013
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
7:25 am
Notes from a test plan
It's been a month or so since I've begun looking at the new data dictionary implementation for Tarantool. Roman created a first version 3 months ago, but myself, blame my perfectionism, thought that some flexibility in the way system spaces can be manipulated with won't harm. The idea is that all space metadata is stored in spaces, not in a configuration file. Kind of "kill the .frms in your mind" feature. A user simply updates the "system" spaces and that effects in creation or destruction of spaces, indexes, data types.
When this whole thing becomes really twisted is that a system space definition also resides in system spaces. There are 3 dedicated spaces - _space - defines space names and ids, _index - defines indexes on spaces, and _format - defines tuple format. And these spaces, in the beginning, contain their own definitions.

Now, here's what I wrote in the test plan for this feature yesterday:

Check that when replacing a tuple in _index system space, thus redefining the primary key in this system space, the new tuple ends up (can be later found) in the new primary key it defines.
Sunday, June 2nd, 2013
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
10:43 am
Evaluating a MySQL database connector
Since Tarantool stored procedure API was extended with socket I/O, a whole universe of applications for data-enriched networking (routing, proxying, PUSH-notifications, and so on) has become possible.

But there is one case which doesn't lend itself so easily: anything MySQL. The first scenario I'd love to support is when Tarantool works as a smart write-back cache for MySQL, providing a higher update RPS, but automatically maintaining a slightly dated copy of all the data in a relational database.

One dramatic shortcoming of MySQL universe, which, IMHO, if addressed properly, could spark a whole new set of uses and third-party solutions, is the clumsiness of the client-server protocol.

The MySQL client-server protocol is unnecessarily hard to implement: it is built on top of a layered design, with built-in compression and transport-level tricks to be able to communicate over unreliable protocol such as UDP.

A separate issue still not done right is that replication has never been considered part of the protocol or the client library (there is even an open source project solving exactly this problem).

In Tarantool, a user of the connector can read the replication stream in just the same way as he/she would read result set of an ordinary query, and this adds a whole new set of ways in which a database can be used.

Finally, the MySQL library itself lacks necessary modularity: socket I/O, SSL encryption, character set support, prepared statement support, zlib compression, even reading client passwords from the command line, and, since recently, plug-in support are all intermixed with the binary protocol, which is at the core of what the library does, and are all part of one thick bundle.

What if I want something tiny to just be able to connect to the server and send simple queries?
What if I want to use it inside an event loop or in coroutine environment?
What if I want to write a protocol mapper between, say, HTTP and MySQL?
What if I don't want to add a dependency on OpenSSL?
In the best case, there is only one answer to a question like this, in the worst case one has to re-implement the protocol from ground up.

In Tarantool, we learned that the protocol must stand alone after we re-designed our own library 3 times, since people would simply ignore the "official" library and just go ahead and write their own.

Back to MySQL, the situation has begun to change in the last three years.

First, Drizzle project implemented libdrizzle - a wholly new client library to talk to Drizzle server. The unobvious part is that Drizzle binary protocol is fully compatible with MySQL, so libdrizzle can talk to MySQL as well.
Good things about libdrizzle is that it is built around an event loop, so the entire code base is callback based. Makes it easier to embed into callback-based environment such as node.js. Little advantage for Tarantool, which, while using an event loop under the hood, hides it completely by providing lightweight green threads, and thus is able to execute sequential networked code.

Another good thing about libdrizzle is that the code base is small and is easy to read, even if you're new to MySQL world.

The shortcomings of libdrizzle is that it doesn't support prepared statements - indeed, why would you add support for prepared statements when Drizzle itself doesn't have it :), and is completely character set-unaware - in Drizzle, everything is utf-8.

The second library created recently is MariaDB's native-client.
I've taken a quick look at the code, and it seems that it pretty much is the same as old-good MySQL - full support of the API in one think bundle.

Which library should we choose? Whatever it is, we'll need to patch it, since even libdrizzle is not modular enough so that it can be integrated into Tarantool core without changing any of the upstream code. The advantage of MariaDB library is that it has prepared statement support. On the other hand, prepared statements still don't work properly with connection pools, and hence are still not used widely. Indeed, most of the shops I know simply wrap their statements into stored procedures, which also gives extra security, but use the old direct API to invoke them.

Next week we'll be looking at the two libraries more closely.
Monday, April 29th, 2013
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
9:16 pm
Importance of intra-query parallelism.
Oracle Database has a feature which allows it to query millions of rows in parallel while executing a join which has a big fanout.
How important is it that a database server has a lot of intra-query concurrency? Does it still make a lot of sense to run an analytical query in parallel threads, on a single machine?

While at Percona Live, there was a lot of talk about the future of MySQL, and some even mentioned this as being part of the future.

The reason for intra-query parallelism has always been to fill up the pipeline to disk with lots of parallel queries. Indeed, this pipe is thick and long - and if used, it'd better produce a lot of data at once. Efficiency of CPU utilization is sacrificed to achieve efficiency of a rotating disk drive.

Yet in DaaS world this all fails to make sense to me. In a cloud, one execution unit is not one CPU, but one instance, and one database instance equals to a cluster of virtual machines. Map/Reduce was only the first sign of the change - it is stupid, indeed, but network is faster than disk, and if a query needs to inspect a million of rows, they'd better be on thousands of disks, not on a single one.

It's funny how MySQL technology is steadily pulled up-market. I haven't seen a single project use MySQL Stored Procedures, which were created for SAP R/3 integration, in applications they were created for. Perhaps, when parallel query in MySQL is ready, it also will be used for something completely different.

Meanwhile, I think the task of coming up with an efficient join algorithm to run across sharded data is more in line with the way hardware is going to look like in the future. Sharding is done best when not done at all. But so is concurrency.
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
11:26 am
Draft spec for automatic sharding and Tarantool/Proxy: request for comments.
The idea to proxy access to Tarantool/Box is obvious, and a closed source proxy existed within Mail.Ru for a long time. Now that socket I/O is part of server-side Lua, proxying in Tarantool/Box is easier and more manageable than anywhere else. I published a draft spec for a proxy which job is to hide data sharding from the end user: https://github.com/mailru/tarantool/wiki/Tarantool-proxy. There is no universal solution to the sharding problem, and when creating the spec, I tried to avoid the pitfall of making it a single-user solution. I hope, with your help, we'll be able to avoid that.
Friday, February 8th, 2013
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
10:30 am
Socket I/O in a stored procedure
The next feature added to Tarantool/Box development branch is box.socket, a luasocket-style API to work with TCP and UDP sockets.

Initially this was requested for monitoring and audit of the server, for example, to send a UDP packet to a statistics server on every connect/dicsonnect, or simply once in every few seconds.

But there are other very interesting uses:
- node.js -like operation mode, when http and script server is a single application. In our case, a database, a Web server and a scripting engine (there is a prototype of Javascript stored procedures for Tarantool, too),
- proxying - a replica can automatically proxy an update request to the master server,
- custom client-server protocols.

Apparently the lack of network I/O in traditional databases is a restriction of a secure client/server operation model. Once you're not worried as much about security, this becomes a very interesting addition. I'm sure we'll learn about all the drawbacks of the approach in the coming months :)
Friday, January 11th, 2013
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
8:31 am
Why Objective C?
I'm often asked why Tarantool is written in Objective C. Damien Katz, my ex-colleague from MySQL AB :), wrote a very good post on the strengths of C. We use Objective C as "C with exceptions". Objective C' @finally clause allows for simple integration of exception-aware code with C code. In contrast, the only sensible way to deal with exceptions in C++ is RAII, and this pretty much means that you forget about C the moment you decide to use exceptions in your program.

One serious "deficiency" of C is that it doesn't bring along the programming paradigms and patterns found in modern programming languages. In other words, it doesn't teach you programming culture. This is why, I think, it is
much better to return to C after a few years with other languages. There is something unique to learn in almost every modern programming language.

Often, a larger project uses a ton of languages and instruments. Tarantool is not an exception: apart from C and Objective C we use Lua, Ragel, Bison, a configuration file parser of our own breed, and this is just for the server itself. For tests, we use Python, Perl, PHP and Shell. Some of our benchmarks are written in Java.
Wednesday, January 9th, 2013
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
2:52 pm
Tarantool on_connect/on_disconnect triggers.
I've just pushed on_connect/on_disconnect trigger support into the master branch. This was mainly done to be able to keep track of abnormally disconnected clients - an activity necessary when Tarantool is used as a persistent asynchronous messaging server.

When stored procedures are added early in a product life cycle, they quickly permeate all parts of server functionality.

Tarnatool doesn't have authorization, but an on-connect trigger can be used for it: client address is available in the trigger and if a trigger throws an error, the erorr is sent to the client and the connection is dropped.

The next in line for merge to the master is box.io library - event-driven TCP and UDP I/O, again available in our stored procedures.

Thanks to Lua, the project is melding into a mix of an application and a database server, and our users seem to like it a lot.
Tuesday, December 4th, 2012
LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.
7:44 pm
MySQL: multiple user level locks per connection

People say that to have a good vacation, you need to do something else, something you don't do every day at work.
So, instead of hacking on Tarantool, I did some good old MySQL hacking. Thanks to Alexey Rybak from Badoo I had a nice opportunity for it -- a task to improve MySQL user level locks.

GET_LOCK() function in MySQL allows a connection to hold at most one user level lock. Taking a new lock automatically releases the old lock, if any.
The limit of one lock per session existed since early versions of MySQL didn't have a deadlock detector for SQL locks. MDL patches in MySQL 5.5 added a deadlock detector, so starting from 5.5 it became possible to take multiple locks in any order -- a deadlock, should it occur, would be detected and an error returned to the client that closed the wait chain.
So, thanks to MDL, implementing user level locks seemed to be an easy task, and in line with MySQL general strategy of moving all hand-crafted lock implementations to a single system. A code cleanup, too.

The implementation indeed turned out to be rather straightforward, but as it always happens with MySQL, not without issues.
By now I've finished working on the patch and published the tree, it's available here:
https://code.launchpad.net/~kostja/percona-server/userlock

I intend to contribute the patch to all MySQL forks - MySQL at Oracle, Percona, MariaDB. I'm publishing the patch under BSD licence, so any other fork (Twitter, Facebook, Google) is welcome to pick it up too.

Now let me list some less obvious moments in the new user level locks:

  • it has become possible not only to take distinct locks in the same connection, it's also possible to take the same lock twice. In this case, the lock is granted and each instance of the same lock needs to be released afterwards. In other words, the new user level locks are recursive.
  • the documented (and preserved) behaviour of GET_LOCK() is to return 0 in case of lock wait timeout and NULL in case of error. This doesn't look right to me, since when a lock is not granted I'd personally prefer getting an error, not a 0 or NULL. This starts to matter when a user lock is taken inside a stored function or trigger - if an error is returned, the statement is usually aborted, but 0 or NULL from GET_LOCK will keep it going. So as long as currently GET_LOCK() timeout doesn't return an error, it's possible that a trigger is invoked for each row, and the lock times out for some rows, and doesn't time out for others. But oh well, this is the current MySQL behaviour, so a matter of separate consideration.
  • if a connection which is waiting on a user level lock is killed by KILL CONNECTION/KILL QUERY, it's wait is aborted. This is alright, and works with MDL too. GET_LOCK() returns NULL in this case, and I preserved this behaviour. But if a connection is simply gone (the client has disappeared, closed a socket, crashed, etc, all this while waiting on a user lock), the old user lock wait implementation would eventually detect an abandoned socket, and abort the wait.

    MDL, however, didn't look at session sockets while waiting on a lock. I thought that this matter is important enough and fixed MDL to look at session socket state during long waits on any lock. Indeed, the whole checking for the disconnected mutex was done in scope of a fix for Bug#10374 by Davi Arnaut. (Hello, Oracle, if not for an open bugs database, I would never be able to find or understand this!). At some point this was considered important enough, so why break it.

  • the last issue is with variable @@lock_wait_timeout. In theory, @@lock_wait_timeout should affect all locks in SQL. I could make it work for user locks as well. But I decided not to do it yet, since there is always an explicit timeout, and honouring @@lock_wait_timeout would mean checking which one is larger -- the explicitly provided one, or session global, and honoring the smaller timeout. This perhaps needs to be done.

Fixes in tests

It was a surprise to see that actually no test is relying on (or testing) the fact there could be only one lock per session. There is not even a test which would test all return values of GET_LOCK() or RELEASE_LOCK(). For example, if a lock is not owned by this session, RELEASE_LOCK() returns either NULL or 0, depending on whether the lock exists at all or not. And I haven't found tests for IS_USED_LOCK()/IS_FREE_LOCK() either.
The main test suite actually passed after the first draft, and most surprises came from the replication tests.
For example, rpl_err_ignoredtable.test in 5.5 apparenty works according to the intent of the author, but despite some of its obscure details.
In particular, this test takes a user lock in an UPDATE, to make sure that UPDATE blocks at some point, and be able to abort it while it's blocked. But to detect that the UPDATE has blocked, an impossible condition is used, so the detection code actually oversleeps the lock wait timeout.
This test started to fail when lock implementation changed, so I had to provide a correct wait condition.
rpl_stm_000001 (why would you use 5 leading zeros in a test name, especially considering there is only rpl_stm_000002?-)) has a hard-coded sleep, instead of a synchronous wait, so I fixed it too.
Another replication test -- rpl.rpl_rewrt_db -- failed since it relied on the order of subsystem destruction in server session cleanup (THD::~THD()).
Before my patch, user level locks of a session were destroyed last, in particular, after closing temporary tables. So, this replication test would do the following trick to synchronously wait until a temporary table is closed:

  • take a lock in a session
  • kill it
  • take a lock in a concurrent session, and, as soon as this lock is granted, assume that the other session is destroyed, and, in particular, temporrary tables are closed (the side effect which was ultimately desired).

How clever! Except that at first I put user level lock subsystem destruction slighly higher in THD::~THD(), closer to its new home - MDL subsystem. Well, I had to put everything back, plus move MDL susbystem destruction to the end of THD::~THD(), to make this test work.

Rant

I doubt I would have been able to make my way through the test suite if I haven't had previous experience on the MySQL team.
Writing the patch was moderately fun (I'm not going to bash MySQL Item class hierarchy another time), but groveling through a huge test suite and fixing stupid errors which were only barely related to my patch was extremely tedious.

[ << Previous 20 ]

LJ.Rossia.org makes no claim to the content supplied through this journal account. Articles are retrieved via a public feed supplied by the site for this purpose.