Suryachoudhury’s Weblog

March 16, 2009

Building Highly Scalable and more rich content, reliable sites

Filed under: Web Development — Surya Choudhury @ 10:37 am

Well, when we talk about building more reliable and massive websites with more than petabytes of rich content data…  and that to be served as a SaaS model and part of business solution suite (I mean a product).

That means you are turning on towards Brobdingnagian market. Well this a business where you need to more accurate. Apart from the various parameter that are considered while building a site, I ‘ll narrate two major parameters and design/solutions, that are considered while developing a rich site.

1. Scalability.

2. Performance.

Scalability:    Any software/application design needs to pre-evaluate and scale the industries they are targeting for. Lets take discuss the same with an example.

In early 2004, the Facebook.com site was mostly used by Harvard students as a glorified on-line yearbook. One can imagine that the entire storage requirements and query load on the database could be handled by a single beefy server. Fast forward to 2008 where just the Facebook application related page views are about 14 billion a month (which translates to over 5,000 page views per second, each of which will require multiple backend queries to satisfy). Besides query load with its attendant IOPS, CPU and memory cost there’s also storage capacity to consider. Today Facebook stores 40 billion physical files to represent about 10 billion photos which will cover a petabyte of storage. Even though the actual photo files are likely not in a relational database, their metadata such as identifiers and locations still would require a few terabytes of storage to represent these photos in the database. Do you think the original database used by Facebook had terabytes of storage available just to store photo metadata?

At some point during the development of Facebook, they reached the physical capacity of their database server. The question then was whether to scale vertically by buying a more expensive, beefier server with more RAM, CPU horsepower, disk I/O and storage capacity or to spread their data out across multiple relatively cheap database servers. In general if your service has lots of rapidly changing data (i.e. lots of writes) or is sporadically queried by lots of users in a way which causes your working set not to fit in memory (i.e. lots of reads leading to lots of page faults and disk seeks) then your primary bottleneck will likely be I/O. This is typically the case with social media sites like Facebook, LinkedIn, Blogger, MySpace and even Flickr. In such cases, it is either prohibitively expensive or physically impossible to purchase a single server to handle the load on the site. In such situations sharding the database provides excellent bang for the buck with regards to cost savings relative to the increased complexity of the system.

Now that we have an understanding of when and why one would shard a database, the next step is to consider how one would actually partition the data into individual shards. There are a number of options and their individual tradeoffs presented below – Pseudocode / Joins

How Sharding Changes your Application

In a well designed application, the primary change sharding adds to the core application code is that instead of code such as

//string connectionString = @"Driver={MySQL};SERVER=dbserver;DATABASE=CustomerDB;"; <-- should be in web.configstring connectionString = ConfigurationSettings.AppSettings["ConnectionInfo"];OdbcConnection conn = new OdbcConnection(connectionString);conn.Open();

OdbcCommand cmd = new OdbcCommand("SELECT Name, Address FROM Customers WHERE CustomerID= ?", conn);OdbcParameter param = cmd.Parameters.Add("@CustomerID", OdbcType.Int);param.Value = customerId;OdbcDataReader reader = cmd.ExecuteReader();

the actual connection information about the database to connect to depends on the data we are trying to store or access. So you’d have the following instead

string connectionString = GetDatabaseFor(customerId);   OdbcConnection conn = new OdbcConnection(connectionString);conn.Open();

OdbcCommand cmd = new OdbcCommand("SELECT Name, Address FROM Customers WHERE CustomerID= ?", conn);OdbcParameter param = cmd.Parameters.Add("@CustomerID", OdbcType.Int);param.Value = customerId;OdbcDataReader reader = cmd.ExecuteReader();

the assumption here being that the GetDatabaseFor() method knows how to map a customer ID to a physical database location. For the most part everything else should remain the same unless the application uses sharding as a way to parallelize queries.

A Look at a Some Common Sharding Schemes

There are a number of different schemes one could use to decide how to break up an application database into multiple smaller DBs. Below are four of the most popular schemes used by various large scale Web applications today.

  1. Vertical Partitioning: A simple way to segment your application database is to move tables related to specific features to their own server. For example, placing user profile information on one database server, friend lists on another and a third for user generated content like photos and blogs. The key benefit of this approach is that is straightforward to implement and has low impact to the application as a whole. The main problem with this approach is that if the site experiences additional growth then it may be necessary to further shard a feature specific database across multiple servers (e.g. handling metadata queries for 10 billion photos by 140 million users may be more than a single server can handle).
  2. Range Based Partitioning: In situations where the entire data set for a single feature or table still needs to be further subdivided across multiple servers, it is important to ensure that the data is split up in a predictable manner. One approach to ensuring this predictability is to split the data based on values ranges that occur within each entity. For example, splitting up sales transactions by what year they were created or assigning users to servers based on the first digit of their zip code. The main problem with this approach is that if the value whose range is used for partitioning isn’t chosen carefully then the sharding scheme leads to unbalanced servers. In the previous example, splitting up transactions by date means that the server with the current year gets a disproportionate amount of read and write traffic. Similarly partitioning users based on their zip code assumes that your user base will be evenly distributed across the different zip codes which fails to account for situations where your application is popular in a particular region and the fact that human populations vary across different zip codes.
  3. Key or Hash Based Partitioning: This is often a synonym for user based partitioning for Web 2.0 sites. With this approach, each entity has a value that can be used as input into a hash function whose output is used to determine which database server to use. A simplistic example is to consider if you have ten database servers and your user IDs were a numeric value that was incremented by 1 each time a new user is added. In this example, the hash function could be perform a modulo operation on the user ID with the number ten and then pick a database server based on the remainder value. This approach should ensure a uniform allocation of data to each server. The key problem with this approach is that it effectively fixes your number of database servers since adding new servers means changing the hash function which without downtime is like being asked to change the tires on a moving car.
  4. Directory Based Partitioning: A loosely couples approach to this problem is to create a lookup service which knows your current partitioning scheme and abstracts it away from the database access code. This means the GetDatabaseFor() method actually hits a web service or a database which actually stores/returns the mapping between each entity key and the database server it resides on. This loosely coupled approach means you can perform tasks like adding servers to the database pool or change your partitioning scheme without having to impact your application. Consider the previous example where there are ten servers and the hash function is a modulo operation. Let’s say we want to add five database servers to the pool without incurring downtime. We can keep the existing hash function, add these servers to the pool and then run a script that copies data from the ten existing servers to the five new servers based on a new hash function implemented by performing the modulo operation on user IDs using the new server count of fifteen. Once the data is copied over (although this is tricky since users are always updating their data) the lookup service can change to using the new hash function without any of the calling applications being any wiser that their database pool just grew 50% and the database they went to for accessing John Doe’s pictures five minutes ago is different from the one they are accessing now.

Problems Common to all Sharding Schemes

Once a database has been sharded, new constraints are placed on the operations that can be performed on the database. These constraints primarily center around the fact that operations across multiple tables or multiple rows in the same table no longer will run on the same server. Below are some of the constraints and additional complexities introduced by sharding

  • Joins and Denormalization – Prior to sharding a database, any queries that require joins on multiple tables execute on a single server. Once a database has been sharded across multiple servers, it is often not feasible to perform joins that span database shards due to performance constraints since data has to be compiled from multiple servers and the additional complexity of performing such cross-server.A common workaround is to denormalize the database so that queries that previously required joins can be performed from a single table. For example, consider a photo site which has a database which contains a user_info table and a photos table. Comments a user has left on photos are stored in the photos table and reference the user’s ID as a foreign key. So when you go to the user’s profile it takes a join of the user_info and photos tables to show the user’s recent comments.  After sharding the database, it now takes querying two database servers to perform an operation that used to require hitting only one server. This performance hit can be avoided by denormalizing the database. In this case, a user’s comments on photos could be stored in the same table or server as their user_info AND the photos table also has a copy of the comment. That way rendering a photo page and showing its comments only has to hit the server with the photos table while rendering a user profile page with their recent comments only has to hit the server with the user_info table.Of course, the service now has to deal with all the perils of denormalization such as data inconsistency (e.g. user deletes a comment and the operation is successful against the user_info DB server but fails against the photos DB server because it was just rebooted after a critical security patch).
  • Referential integrity – As you can imagine if there’s a bad story around performing cross-shard queries it is even worse trying to enforce data integrity constraints such as foreign keys in a sharded database. Most relational database management systems do not support foreign keys across databases on different database servers. This means that applications that require referential integrity often have to enforce it in application code and run regular SQL jobs to clean up dangling references once they move to using database shards.Dealing with data inconsistency issues due to denormalization and lack of referential integrity can become a significant development cost to the service.
  • Rebalancing (Updated 1/21/2009) – In some cases, the sharding scheme chosen for a database has to be changed. This could happen because the sharding scheme was improperly chosen (e.g. partitioning users by zip code) or the application outgrows the database even after being sharded (e.g. too many requests being handled by the DB shard dedicated to photos so more database servers are needed for handling photos). In such cases, the database shards will have to be rebalanced which means the partitioning scheme changed AND all existing data moved to new locations. Doing this without incurring down time is extremely difficult and not supported by any off-the-shelf today. Using a scheme like directory based partitioning does make rebalancing a more palatable experience at the cost of increasing the complexity of the system and creating a new single point of failure (i.e. the lookup service/database).

Well… I’m just snoozing at my keyboad… We will talk about building high prerformance rich content architecture in my next post…

Just a hint for you that when we do have streaming response (that mean video, music, files etc) it is better to have a chip level multi threaded server like lighttpd wrapped around the content managment system :)

December 23, 2009

Conditionally Defining Spring Beans

Filed under: Uncategorized — Surya Choudhury @ 9:05 pm

One feature missing from Spring framework that I would find handy is the ability to define a bean based on an EL (Expressional Language) or a particular property is defined.

So without further suspense here is what a conditionally defined bean looks like:


<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:if="http://mycompany.com/springbeans/condition"
    xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-2.5.xsd
   http://mycompany.com/springbeans/condition http://mycompany.com/springbeans/condition/condition.xsd">

<!– Please note the name space “xmlns:if=”http://mycompany.com/springbeans/condition” and schema
location “*http://mycompany.com/springbeans/conditionhttp://mycompany.com/springbeans/condition/condition.xsd*
–>
<if:condition
test=”${(jms.server.type == ‘activemq’) &amp;&amp; (isMessageBrokerEnabled == true)}”
varnames=”jms.server.type,isMessageBrokerEnabled”
src=”META-INF/bootstrap.properties” >

<bean id="apacheMQConnectionFactory">
    <property name="brokerURL" value="tcp://e3_cloud_senxex.mycompany.com:61616"/>
    <property name="userName" value="admin"/>
    <property name="password" value="xVkE2iPpk9"/>
</bean>
</if:condition>

What the example above does? Is a bean defination defining a Sping JMS connector for Apache ActiveMQ. If the EL condition is *true* then the bean will be instantiated.

The property jms.server.type and isMessageBrokerEnabled is defined in the property file bootstrap.properties

Of course instead of conditional logic we could split the bean definitions up into multiple files and connect them together for different tasks – e.g. construct one application context for Apache ActiveMQ and Sun Message Broker that includes an xml file for a JMS provider specific configurations,  but if you only have a small number of different beans then it may not be worth constructing multiple application contexts at build time. Even for products that comesup with multiple JMS connector support, where installer are provided to the product sells/service team for customer side installation. In such senerio we need to make installation process simple and we don’t want to individual build for each customer.

So given I’m restricting myself to xml configuration I thought I’de try out the Spring 2.0 Extensible XML Authoring API. This API allows you to add your own attributes to bean definitions or allow you to define beans using your own XML syntax. Using this API I got close to what I wanted, with a few limitations.

As per the Extensible XML Authoring API the infrastructure to set up the above is as follows:

Step 1) Authoring The Schema

com/mycompany/product/spring/condition.xsd

<?xml version="1.0" encoding="UTF-8" standalone="no"?>


<xsd:schema xmlns="http://mycompany.com/springbeans/condition"

            xmlns:xsd="http://www.w3.org/2001/XMLSchema"


            xmlns:beans="http://www.springframework.org/schema/beans"


            targetNamespace="http://mycompany.com/springbeans/condition"


            elementFormDefault="qualified"

            attributeFormDefault="unqualified">  
 
<xsd:import namespace="http://www.w3.org/XML/1998/namespace" />  
 
<xsd:annotation> 
 
<xsd:documentation><![CDATA[ 
 
       Defines the configuration elements for My Comany Spring Framework's conditional bean creation. 
 
       Limitations: 
 
       1. The spring-beans-2.0.xsd forces you to define a <if:condition/>


          for single beans - you cannot put a <if:condition/> block around a group of beans. 
 
       2. The spring-beans-2.0.xsd prevents you from defining two beans with the same name 
 
          in the same XML file, even if different <if:condition/> conditions guarantee only
          one of the definitions will be in force at any given time. ]]>
</xsd:documentation> 
</xsd:annotation>
 
<xsd:element name="condition">
   <xsd:complexType>
      <xsd:sequence>
        <xsd:any minOccurs="0" />
      </xsd:sequence> 
      <xsd:attribute name="test" type="xsd:string" use="required"> 
 
       <xsd:annotation>


          <xsd:documentation><![CDATA[


                   Define the param value that need to tested as a condition for bean creation. 
                   For example if '${myCondition}=true' then the following child bean will be instantiated.
                                   ]]></xsd:documentation> 
       </xsd:annotation> 
 
       </xsd:attribute> 
 
       <xsd:attribute name="varnames" type="xsd:string" use="optional"> 
 
       <xsd:annotation>

           <xsd:documentation><![CDATA[ 
 
               Define the param name that need to be tested against the set of property source provided in the src attribute.
                  ]]></xsd:documentation> 
       </xsd:annotation>
       </xsd:attribute> 
       <xsd:attribute name="src" type="xsd:string" use="required"> 
 
       <xsd:annotation>
       <xsd:documentation><![CDATA[ 
              Define the property file/xml from which the test param can be loaded.
                  ]]></xsd:documentation> 
       </xsd:annotation>
       </xsd:attribute> 
</xsd:complexType> 
 
</xsd:element>
 
</xsd:schema>






Step 2) Coding a NamespaceHandler

package com.mycompany.product.spring; import org.springframework.beans.factory.xml.NamespaceHandlerSupport; /** * @author <A href="mailto:snc_43@yahoo.com">Surya Choudhury</A> * */ public class ConditionalBeanNamespaceHandler extends NamespaceHandlerSupport { /* (non-Javadoc) * @see org.springframework.beans.factory.xml.NamespaceHandler#init() */ @Override public void init() { super.registerBeanDefinitionParser("condition", new ConditionalBeanDefinitionParser()); } }

Step 3) Coding a BeanDefinitionParser



package com.mycompany.product.spring;
import java.util.Properties;
import org.apache.commons.jexl.Expression;
import org.apache.commons.jexl.ExpressionFactory;
import org.apache.commons.jexl.JexlContext;
import org.apache.commons.jexl.JexlHelper;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.beans.factory.config.BeanDefinition;
import org.springframework.beans.factory.config.BeanDefinitionHolder;
import org.springframework.beans.factory.support.BeanDefinitionReaderUtils;
import org.springframework.beans.factory.xml.BeanDefinitionParser;
import org.springframework.beans.factory.xml.BeanDefinitionParserDelegate;
import org.springframework.beans.factory.xml.ParserContext;
import org.springframework.core.io.Resource;
import org.springframework.util.xml.DomUtils;
import org.w3c.dom.Element;

 
/**
* @author <A href="mailto:snc_43@yahoo.com">Surya Choudhury</A>
*
*/
public class ConditionalBeanDefinitionParser implements BeanDefinitionParser {
       private final Log cLog = LogFactory.getLog(ConditionalBeanDefinitionParser.class);
       private Properties config;


      /** Default placeholder prefix: "${" */
      public static final String DEFAULT_PLACEHOLDER_PREFIX = "${";
      /** Default placeholder suffix: "}" */
      public static final String DEFAULT_PLACEHOLDER_SUFFIX = "}";


      public ConditionalBeanDefinitionParser() {
            config = new Properties();
      }


      /**
      * Parse the "condition" element and check the mandatory "test" attribute. If
      * the provided resources or the system property named by test is null/empty/false
      * (i.e. not defined) then return null, which is the same as not defining the bean.
      */
      public BeanDefinition parse(Element element, ParserContext parserContext) {
             try{
               if (DomUtils.nodeNameEquals(element, "condition")) {
                  String test = element.getAttribute("test");
                  String src = element.getAttribute("src");
                  String varnames = element.getAttribute("varnames"); 
 
                  // Check the src attribute is not empty.
                   if(StringUtils.isNotEmpty(src)){
                      Resource resource = parserContext.getReaderContext().getResourceLoader().getResource(src.trim());
                      config.load(resource.getInputStream());
                   }else{
                      throw new IllegalArgumentException("src attribute not set.");
                   } 
                  // Check if the varnames is not empty/null
                  if(StringUtils.isNotEmpty(test) && StringUtils.isNotEmpty(varnames) && StringUtils.isNotBlank(varnames)){
                     String expression = test.substring(
                                DEFAULT_PLACEHOLDER_PREFIX.length(),
                                test.length() - DEFAULT_PLACEHOLDER_SUFFIX.length()).trim();

                     String[] vars = varnames.split(",");
                     JexlContext jc = JexlHelper.createContext();
                     for(String varname: vars){
                       varname = varname.trim();
                       jc.getVars().put(varname, config.get(varname));
                     }
                     Expression e = ExpressionFactory.createExpression(expression);
                     Object result = e.evaluate(jc);  
                     if( (null != result)){
                       if(result.toString().equalsIgnoreCase("true")){
                          Element beanElement = DomUtils.getChildElementByTagName(element, "bean");
                          return parseAndRegisterBean(beanElement, parserContext);
                       }else if(result.toString().equalsIgnoreCase("false")){
                          return null;
                       }else if ( StringUtils.isNotEmpty(getProperty(test))) {
                          Element beanElement = DomUtils.getChildElementByTagName(element, "bean");
                          return parseAndRegisterBean(beanElement, parserContext);
                       }else{
                          cLog.warn("Condition bean creation did not happen as test or src attribute is not set.");
                       }
                  }
             }
            // Else proceed with non-empty/NULL/Boolean value check
            else{
               if ( StringUtils.isNotEmpty(getProperty(test))) {
                  Element beanElement = DomUtils.getChildElementByTagName(element, "bean");
                  return parseAndRegisterBean(beanElement, parserContext);
                }else{
                  cLog.warn("Condition bean creation did not happen as test or src attribute is not set.");
                }
           }
        }
     }catch (Exception e) {
         cLog.error("Fail to load condition bean.", e);
     }
        return null;
    } 


    /**
     * Get the value of a named resource/system property (it may not be defined).
     *
     * @param strVal The name of a system property. The property may
     * optionally be surrounded in Ant/EL-style brackets. e.g. "${propertyname}" 
     *
     * @return
     */
      private String getProperty(String strVal) {
              cLog.info(strVal);
              if (StringUtils.isEmpty(strVal)) {
                  return null;
              } 
              String returnValue = null;
              if (strVal.startsWith(DEFAULT_PLACEHOLDER_PREFIX) && strVal.endsWith(DEFAULT_PLACEHOLDER_SUFFIX)) {
                  returnValue = config.getProperty(
                        strVal.substring(DEFAULT_PLACEHOLDER_PREFIX.length(),
                        strVal.length() - DEFAULT_PLACEHOLDER_SUFFIX.length()).trim());
                  if(null == returnValue){
                      returnValue = System.getProperty(
                              strVal.substring(DEFAULT_PLACEHOLDER_PREFIX.length(),
                              strVal.length() - DEFAULT_PLACEHOLDER_SUFFIX.length()));
                   }
                   if( StringUtils.isNotEmpty(returnValue)){
                      if(returnValue.trim().equalsIgnoreCase("false"))
                          returnValue = null;
                       }
                       if(cLog.isDebugEnabled()){
                           cLog.debug("Returned : "+System.getProperty(
                                     strVal.substring(DEFAULT_PLACEHOLDER_PREFIX.length(),
                                     strVal.length() - DEFAULT_PLACEHOLDER_SUFFIX.length())));
                        }
                        return returnValue;
                     }else{
                        return System.getProperty(strVal);
                    }
                } 


         private BeanDefinition parseAndRegisterBean(Element element, ParserContext parserContext) {
                BeanDefinitionParserDelegate delegate = parserContext.getDelegate();
                BeanDefinitionHolder holder = delegate.parseBeanDefinitionElement(element);
                BeanDefinitionReaderUtils.registerBeanDefinition(holder, parserContext.getRegistry()); 
                return holder.getBeanDefinition();
        }

}

Step 4) Register the Handler and the Schema

META-INF/spring.handlers

http\://mycompany.com/springbeans/condition=com.mycompany.product.spring.ConditionalBeanNamespaceHandler

META-INF/spring.schemas
http\://mycompany.com/springbeans/condition/condition.xsd=com/mycompany/product/spring/condition.xsd

Special note for developing in Tomcat: Spring looks for META-INF/spring.handlers and META-INF/spring.schemas on the classpath. webapp/META-INF is not on the Tomcat classpath, so you need to put these files inside a JAR or (hack warning) in the webapp/WEB-INF/classes/META-INF directory.

Other conditional operators like: <, <=, >, >=, ==, !=, ||, && etc. are also legal expression

NOTE: As there are few restriction that is been add by the spring-bean-2.0.xsd, so while defining character like <, >, ||, && it is better to use its html/ASCII code, for example a ‘&&’ condition can be noted as ‘&&’ similarly ‘>’ as ‘>’ and ‘<’ as ‘<’ etc.

The attribute varnames is an optional attribute, but when an expression is subjected for evaluation (as in above examples) all the variable names are required to be passed with comma separated as varnames value.

Limitations

A common way of configuring an application with property replacement is to use a PropertyPlaceholderConfigurer bean. Unfortunately this is a two-pass process: the first pass parses a bean definition, the second pass does property replacement. The Extensible Authoring XML API only allows you to interact with the first pass, so that means we are limited to things that are defined at bean definiton time, such as system properties or provide a properties file to the “src” attribute [The same parser context loader's ResourceLoader will be used to load the src properties file]

Note: Both test and src are mandatory attributes for <if:condition/>

The spring-beans-2.0.xsd forces you to define a <if:condition/> for single beans – you cannot put a <if:condition/> block around a group of beans.

The spring-beans-2.0.xsd prevents you from defining two beans with the same name in the same XML file, even if different <if:condition/> conditions guarantee only one of the definitions will be in force at any given time.

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.