Open Source Thinking

Monday, 18 September 2017

MySQL 8 - Roles

With the next version of MySQL that is MySQL 8, there is a very nice feature of creating "roles" which can be assigned certain privileges and then these roles can be assigned to users, thus helping us in maintaining the principle of least privilege.

It makes our life as developers and DBAs easier as we do not have to remember what are the specific privileges assigned to different users. These "roles" can simply be assigned or de-assigned from users as necessary.

The setup process is pretty easy and intuitive.

1. Create Role:

CREATE ROLE 'read', 'write', 'read_write';

Which essentially creates a set of named roles which do not have any privileges assigned to them, as yet.

2. Grant/Assign Privileges to a Role:

GRANT SELECT ON sample_db.* TO 'read';

3. Assign Role to user:

GRANT 'read' TO 'john'@'localhost', 'mary'@'localhost';

Reference: https://dev.mysql.com/doc/refman/8.0/en/roles.html

Monday, 5 June 2017

Random human recognizable dataset

We all do need sometimes to generate raw valid dummy data for our use cases and applications as we start them. Obviously, one can write their own scripts to generate random data, but it is much better to have data, to which human beings can associate with like names, addresses instead of having them filled with random "lorem ipsum" string data :)

While searching for such a tool, I found a site which does exactly this: http://www.generatedata.com/

Documentation: http://benkeen.github.io/generatedata/

This can also be downloaded and installed locally. It supports three types of installations:
- A single, anonymous user account
- A single user account, requires login
- Multiple accounts

Below is the set of wide varied data types it supports for generating random data in different categories:

Well, for the sake of demo, I am planning to create a dataset for my Employee table.

Starting on it, I want to have a unique column for which I will use GUID type. Then, I want to have a first_name and last_name column for which I select the type as "Names" which then provides many options in the right examples dropdown as to what kind of naming pattern I want like full name or first name or surname etc of which I select first name and surname for my use case.

Then, I go on to my next column date_joined as "date" type which since I am intending use in MySQL database, have chosen MySQL datetime in the examples dropdown list, which then provides me the range of date from which the values should be randomly selected. There are other popular date formats provided too to choose from.

Then I select email type which does not have any options to it and will be populates with some valid human friendly emails. Next is phone, for which it provides me which country region specific format I want. It has a limited list as of now but it would suffice for now. I choose UK from the list and it shows me the various formats in the options column.

Next is zip which shows me that it will select zip values specific to the "Indian States & UT" since I had selected in the very start to localize my dataset to India.

Then I select a few more useful columns like phone, city, region and credit card. For credit card, it also provides many options to choose from.

I chose to download the data in SQL format and below is how it looks in web view.

Now, the ice on the cake offering is that you get to download the dataset in many well supported popular formats like csv, excel, html, json, ldif, programming language (hurrays!), sql and xml.

Below is how my final schema looks like:

Just to give an idea of the dataset, I downloaded it in JS format for a clear view as to how the data looks.

Looks good to me and I am more than satisfied with it :)

So to round up, the features you get from this tool are:
- Generate random human recognizable dataset.
- Random dataset spans across many categories like name, phone, email, date, company, street, city, country, pan, pin etc.
- The dataset can be localized to a specific country.
- It supports downloading the generating data in many well-known formats like csv, excel, html, json, ldif, programming language specific, sql and xml.

One of the limitations is that you can generate only 100 rows as of now from the site, but I believe you can generate more if you run it from your local setup.

Hope this tool helps others as it has helped me. Give it a try for the sake of fun :)

Monday, 30 January 2017

MySQL Procedure Analyse Use

Sometimes we are not sure if the database design we have created and our site has been using is correct and optimised. We all do have this feeling at times don't we :)

Well, MySQL provides with PROCEDURE ANALYSE() to help us detect inconsistencies in our database design by suggesting for an optimal datatype and data length for columns.

The syntax for using PROCEDURE ANALYSE() is as below:

SELECT ... FROM ... WHERE ... PROCEDURE

ANALYSE([max_elements,[max_memory]])

Ref: http://dev.mysql.com/doc/refman/5.7/en/procedure-analyse.html

We will create a sample database for our case here and try out some things.

Lets create a database called analyse_db.

mysql> CREATE DATABASE `analyse_db`

DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci;

Then lets create a table called employee where we can keep our data.

mysql> CREATE TABLE `analyse_db`.`employee` (

`id` INT NOT NULL AUTO_INCREMENT,

`name` VARCHAR(100) NOT NULL,

`dept_id` INT NULL DEFAULT NULL,

`country` VARCHAR(100) NOT NULL,

PRIMARY KEY (`id`))

ENGINE = InnoDB;

Lets insert some data into it, so that finally we have some records in it as displayed below:

mysql> SELECT * FROM `employee`;

Let us now run PROCEDURE ANALYSE() on the table.

mysql> SELECT * FROM `employee` PROCEDURE ANALYSE();

So from the analysis result we can take away below points:

First row output

1. From the first row output above, employee.id column seems to be fine, but the output suggests that we have enum as datatype for it. The suggestion makes sense given the less number of records we have in our table right now but in the long run our table would be much larger and each row value would be unique and enum would not suit our purpose. So we can neglect this suggestion.

Second row output

2. From the second row output above, employee.name has a min and max length of 8 and 13, average length of 10 and we have assigned a max varchar length of 100 while creating the table. This seems like a wrong decision on our part to assign it 100, when the range of length for values for this column is in between 8 and 13, but taking into consideration the very less data and our future estimate that there will be longer names with some even having middle names, our column design is fine. Also taking into consideration point 1 above, enum type can similarly be neglected for this column too.

Third row output

3. From the third row output above, employee.dept_id has repeated values of same departments which if are not mapped to another department table and do not have probability to change then they can be added as a enum type, but since there may be a probability that a new department can be added in future, our consideration for the data type is fine.

In addition to this, the result also shows that we have allowed for NULL in our schema while there are no NULL values in any of the values for the column, so we can mark it as NOT NULL, which makes SQL operations faster, provided we also have in our requirement that every employee should have a department value assigned.

Excerpt from the MySQL site manual page related to table column optimisation:

"- Use the most efficient (smallest) data types possible. MySQL has many specialized types that save disk space and memory. For example, use the smaller integer types if possible to get smaller tables. MEDIUMINT is often a better choice than INT because a MEDIUMINT column uses 25% less space.

- Declare columns to be NOT NULL if possible. It makes SQL operations faster, by enabling better use of indexes and eliminating overhead for testing whether each value is NULL. You also save some storage space, one bit per column. If you really need NULL values in your tables, use them. Just avoid the default setting that allows NULL values in every column."

Ref: https://dev.mysql.com/doc/refman/5.7/en/data-size.html

Fourth row output

4. From the fourth row output above, our country list consists of only four set of values for this example but in a real world example this may vary with an increased constant list which is very rarely modified. So in this case we can change this column to enum with the list of countries, since this will help us save some storage space and will provide us an additional constraint on the values stored for country.

Note: Even phpMyAdmin has an interface to run this check and access this info. In the table structure page, you will find a "Propose table structure" link which does this action for you as shown below:

Propose table structure link.

Propose table structure action result.

So to conclude PROCEDURE ANALYSE() provides some very important insight into schema design for an optimal data type like column length, type and NULL value usage, which we can consider, but not all of it needs to be taken into account considering application requirement, future data possibilities and usage :)

Tuesday, 13 September 2016

MySQL DATETIME VS TIMESTAMP

A question which would come sometimes to mind when starting with MySQL is whether I should use DATETIME or TIMESTAMP data type since both appear to store same date and time component.

Similarities between datetime and timestamp:
1. Values contain both date and time parts.
2. Format of retrieval and display is "YYYY-MM-DD HH:MM:SS".
3. Can include a trailing fractional seconds part in up to microseconds (6 digits) precision.
4. With the fractional part included, the format for these values is "YYYY-MM-DD HH:MM:SS[.fraction]".
5. Both the TIMESTAMP and (as of MySQL 5.6.5) DATETIME offer automatic initialization and updating to the current date and time.

But both differ in some ways as mentioned below:

Differences between DATETIME and TIMESTAMP data types.

DATETIME:
1. Supported range is from '1000-01-01 00:00:00' to '9999-12-31 23:59:59'.
2. Storage Required Before MySQL 5.6.4 was 8 bytes. *
3. Storage Required as of MySQL 5.6.4 is 5 bytes + fractional seconds storage. *
4. Preserves the textual representation of the date and time.
5. A value in the supported range is saved as it is given to MySQL so lets say if you change the timezone of MySQL, the value remains same that is it stores no timezone information and is timezone independent.

Example:

By default MySQL uses the SYSTEM timezone as its timezone which is IST in my system.

mysql> CREATE TABLE `dt` (
`dt` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

mysql> SELECT @@time_zone;
+-------------+
| @@time_zone |
+-------------+
| SYSTEM |
+-------------+

mysql> INSERT INTO dt VALUES ('2016-09-12 12:12:00');

mysql> SELECT * FROM dt;
+---------------------+
| dt |
+---------------------+
| 2016-09-12 12:12:00 |
+---------------------+

Now lets change the session timezone to CDT which has an offset -05:00 hours with respect to UTC.

mysql> SET @@session.time_zone = '-05:00';

mysql> SELECT * FROM dt;
+---------------------+
| dt |
+---------------------+
| 2016-09-12 12:12:00 |
+---------------------+

The result above is same irrespective of timezone.

TIMESTAMP:
1. Supported range is from '1970-01-01 00:00:01' UTC to '2038-01-19 03:14:07' UTC. **
2. Storage Required Before MySQL 5.6.4 was 4 bytes. *
3. Storage Required as of MySQL 5.6.4 is 4 bytes + fractional seconds storage. *
4. Preserves values relative to the timezone in use.
5. A value in the supported range is saved in UTC timestamp value when the value is supplied to MySQL, so the value contains a timezone reference. While fetching the value again, MySQL will convert that value from UTC to the timezone specific value. If the timezone of MySQL is changed it has no effect on the UTC value stored but when the value is fetched it is displayed as per the current timezone of MySQL and not in the original timezone value which was stored first time. This occurs because the same time zone was not used for conversion in both directions.

An example of this would be if timezone of MySQL is currently IST and I save a value of "2016-09-12 12:12:00" into the TIMESTAMP datatype field, so when I fetch this record value from MySQL I will get the same value "2016-09-12 12:12:00". Now if I change the timezone value to CDT and fetch this record value, I will get "2016-09-12 01:42:00" which is the CDT equivalent time of IST "2016-09-12 12:12:00".

Example:

By default MySQL uses the SYSTEM timezone as its timezone which is IST in my system.

mysql> CREATE TABLE `ts` (
`ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

mysql> SELECT @@time_zone;
+-------------+
| @@time_zone |
+-------------+
| SYSTEM |
+-------------+

mysql> INSERT INTO ts VALUES ('2016-09-12 12:12:00');

mysql> SELECT * FROM ts;
+---------------------+
| ts |
+---------------------+
| 2016-09-12 12:12:00 |
+---------------------+

Now lets change the session timezone to CDT which has an offset -05:00 hours with respect to UTC.

mysql> SET @@session.time_zone = '-05:00';

mysql> SELECT * FROM ts;
+---------------------+
| ts |
+---------------------+
| 2016-09-12 01:42:00 |
+---------------------+

The result above is the CDT date time equivalent of IST date time "2016-09-12 12:12:00".

References:
- https://dev.mysql.com/doc/refman/5.7/en/datetime.html
- https://dev.mysql.com/doc/refman/5.6/en/datetime.html
- http://stackoverflow.com/questions/409286/should-i-use-field-datetime-or-timestamp

* As of MySQL 5.6.4, storage for YEAR and DATE remains unchanged. However, TIME, DATETIME, and TIMESTAMP are represented differently. DATETIME is packed more efficiently, requiring 5 rather than 8 bytes for the nonfractional part, and all three parts have a fractional part that requires from 0 to 3 bytes, depending on the fractional seconds precision of stored values.

https://dev.mysql.com/doc/refman/5.7/en/storage-requirements.html

** Why is the TIMESTAMP datatype limited to 2038 years and not beyond?
- Excerpt from https://en.wikipedia.org/wiki/Year_2038_problem:
"The Year 2038 problem is an issue for computing and data storage situations in which time values are stored or calculated as a signed 32-bit integer, and this number is interpreted as the number of seconds since 00:00:00 UTC on 1 January 1970 ("the epoch").[1] Such implementations cannot encode times after 03:14:07 UTC on 19 January 2038, a problem similar to but not entirely analogous to the "Y2K problem" (also known as the "Millennium Bug"), in which 2-digit values representing the number of years since 1900 could not encode the year 2000 or later. Most 32-bit Unix-like systems store and manipulate time in this "Unix time" format, so the year 2038 problem is sometimes referred to as the "Unix Millennium Bug" by association."

- It is also known as the Y2K28 bug.

A nice illustration from wikipedia showing the Year 2038 bug.

Caution: While designing applications please consider the range limitation of TIMESTAMP datatype [1970-2038] before using it or consider using DATETIME datatype instead.

Monday, 22 August 2016

Apertium: Free open source language translator

Came across a very cool "free" and "open source" tool to translate text from one language to another.

It's "Apertium" => "A free/open-source machine translation platform".

https://www.apertium.org

Excerpt from the site:

Apertium is a free/open-source machine translation platform, initially aimed at related-language pairs but expanded to deal with more divergent language pairs (such as English-Catalan). The platform provides a language-independent machine translation engine tools to manage the linguistic data necessary to build a machine translation system for a given language pair and linguistic data for a growing number of language pairs.

You can try the UI at https://www.apertium.org to translate sample text from one language to another.

Wiki and documentation related to it can be found at http://wiki.apertium.org/wiki/Main_Page

It also has a variety of tools for users / translators and developers: http://wiki.apertium.org/wiki/Tools

Apertium tools for users / translators

Apertium tools for developers

The installation steps are very easy. Just follow the one suitable for your OS at: http://wiki.apertium.org/wiki/Installation

Here is a video explaining the download, install process and command line usage of it for Ubuntu: https://www.youtube.com/watch?v=vy7rWy2u_m0

Below are the steps to install and use in a fresh new Ubuntu system:

1. Download stable-release sh script and run it to add Apertium package key to Ubuntu and to update the packages list.

$ wget https://apertium.projectjj.com/apt/install-release.sh -O - | sudo bash

2. To install all the core Apertium tools run:

$ sudo apt-get install apertium-all-dev

3. Download a language pair for your conversion use. You can search for your language pair here at http://wiki.apertium.org/wiki/List_of_language_pairs. For this example case, we will be downloading spanish-english language pair for conversion:

"apertium-en-es" => English <-> Spanish

$ sudo apt-get install apertium-en-es

4. Now finally time to test it:

The command syntax is apertium <language-of-given-text>-<language-to-be-translated>

In our case, we will be translating from Spanish to English:

$ echo 'gracias' | apertium es-en

o/p: thank you

Now we will try to convert this spanish sentence "Tengo Un Mes Estudiando Español" which in english means "I've been learning Spanish for 1 month".

$ echo 'Tengo Un Mes Estudiando Español' | apertium es-en

o/p: Have A Month Studying Spanish

Same translation from google translate results in:

I have a month studying Spanish

Seems like Apertium translation result is quite similar to google translate result. This is great, coming from a free and open source tool. Hope you find use for it in your applications :)

Wednesday, 17 August 2016

Google dev tools Security and Audits tab

Two great tools provided by Google developer tools browser console are:

- The Security tab:

Security validation

Security inspection result for a site using https

It provides below checks for your website:

1. Valid certificate
Whether the connection to this site is using a valid, trusted server certificate.

2. Secure TLS connection
Whether the connection to this site is using a strong protocol version and cipher suite.

3. Secure Resources
Checks if all resources like js, css and others are fetched through https.

Security inspection result for a site using http and fetching sub-resources through http

- The Audits tab:

The Audit tab

It provides a great tool to check and provide useful suggestions for network utilization like using gzip compression, combine js files, minify css and js files and web page performance like remove unused css styles for your web site which can be considered in your application context and used accordingly. You can see in detail the changes suggested for each category by clicking on the leftmost arrow against each category.

Below is result audit output from some public sites:

Audit result-1

Audit result-2

Audit result-3

These two tools will definitely help us monitor the security and improve network and web performance of our sites.

Sunday, 26 June 2016

MySQL auto update date_created and date_updated columns

It's usually a good practice to have two columns called date_created and date_updated in every table. One can always use it in application and it helps in debugging too as to when a record was created and last updated in various circumstances and contexts.

This responsibility can be given to MySQL to automatically assign current time values to these columns.

In MySQL 5.6 onwards, this can be done by setting the data type of the columns to be either date time or timestamp and creating date_created column with NOT NULL DEFAULT CURRENT_TIMESTAMP schema and date_updated column with NOT NULL DEFAULT '0000-00-00 00:00:00' as schema with attribute ON UPDATE CURRENT_TIMESTAMP.

Below is a sample schema of a table containing date_created and date_updated columns:

CREATE TABLE `time_stamp` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`date_created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`date_updated` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

But there is a catch here, since this is not applicable to all MySQL versions. Earlier to 5.6 version, MySQL allows a table to have only one TIMESTAMP column with an automatic TIMESTAMP value, that is you can either have date_created or date_updated auto updated to CURRENT_TIMESTAMP not both.

Related to this the MySQL documentation has some very good info at this page: https://dev.mysql.com/doc/refman/5.5/en/timestamp-initialization.html

Excerpt: "By default, TIMESTAMP columns are NOT NULL, cannot contain NULL values, and assigning NULL assigns the current timestamp."

So the workaround this is to have date_created column with timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' schema and date_updated column with timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' ON UPDATE CURRENT_TIMESTAMP schema.

With above, the date_updated field is correctly updated by MySQL automatically when there is an update and to update date_created field with current timestamp value, we have to explicitly pass NULL value to date_created field which will then store the CURRENT_TIMESTAMP value in the field.

Below is a sample schema for above changes:

CREATE TABLE `time_stamp` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`date_created` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`date_updated` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

So for a structure as above the insert query would be as below:

INSERT INTO `test`.`time_stamp`
(`id`, `date_created`)
VALUES
(NULL, NULL);

For update the query would be:

UPDATE
`test`.`time_stamp`
SET
`id` = '4'
WHERE
`time_stamp`.`id` = 1;

Above will just update the date_updated with the CURRENT_TIMESTAMP and the date_created value will remain same as earlier. Caution needs to be taken while updating records so as not update date_created field with NULL value.

One could also say that instead of MySQL automatically updating the date_updated field, I would like the date_created field to be updated automatically by it. Yes, it is just a reverse case and can be used, in which case while updating the record one has to pass NULL value to date_updated field. But the thing is that since there will be many updates and only a one time insert so passing a NULL while a one time insert reduces the overhead on us of passing a NULL for every subsequent update and let MySQL handle it for us. But yes this is debatable and there are cases where this reverse structure can be used.

Another workaround this is to have date_created field updated with current timestamp value using a trigger when a record is inserted.

Hope this helps :)