Guide to mdReporting

Purpose

Walk a user through developing reports using mdReporting.

Introduction

There are three elements to using reporting in your application: instrumentation, log processing and report viewing. These items correspond to: the construction of crumb urls for transmitting data about what is being displayed on the page, CrumbLog configuration for turning logs into counts of key/value pairs and templates for turning raw counts into end user reports.

Instrumentation

Image requests to the reporting server are used to track site activity. These images are referred to as crumbs and take the format /image/mdwtc-${PLACEMENT_CODE}-OBJECT_TYPE-${OBJECT_ID}?other=data&that=you&want=to&track=counts

Configuration

Documentation about the CrumbLog program you plan to use should provide information about how to turn an image request into raw reporting data. Please see section CrumbLog tables of documentation in //eng/products/mdReporting/CrumbProcessing/docs/index.html for details as of the writing of this document.

Report Viewing

See MTL tag documentation for information about REPORTING tags.

Reporting Design for Real Estate

Below is information about the design of reporting for real estate that should help familiarize you with the capabilities of the mdReporting system and the concepts that should be considered when designing how to store the raw reporting information. What to count is usually dictated by the needs of the application but how the information is stored can have a dramatic difference in the performance of reports.

Determine the type of report you want.

For the jacksonville real estate reports, the data is structured in the following way:

At each level, you can see a rollup of the level immediately below it grouped by values at that level. For example, a broker can see all the info an agent can see, but instead of being broken down by listing, the listings are rolled up and the report is broken down by agent.

Property (jacksonville.com)
    |
    -- Broker
        |
        -- Agent
            |
            -- Listing
            |
            -- Listing
            |
            -- Listing
Each level contains hit counts for the selected month for different placements of listings: The report also contains information about leads:

Determine what information needs to be captured to create report

At this point there are usually many options for what data needs to be captured to create a report. For a first iteration, we'll just pick one set of information and we can adjust if we think we'll run into problems later.

Minimally, we need to know how many times a particular listing was shown on a given date on a given page. This also happens to represent the most detailed level in our report which is typical.

Decide on storage structure of data

Just as you have many options for collecting data, you generally have many options for the format to store it in. At this point you'll probably walk through many iterations looking for one that will capture all of the information you need in a manner that is both efficient for building the reports and for reducing the storage of duplicate information.

Storage of tracking crumbs involves two concepts, objects and attributes. Objects exist to allow for the retrieval of a set of attributes that contain the information you want. Attributes are where counts are maintained and can be structured in a parent/child relationship. Attributes are composed of key/value pairs and will contain a hit count record of how many times each attribute has been seen.

One possible storage arrangement for the real estate reports is to treat each site as an object (the top level of the report) and then store the information in a hierarchy similar to our report.

OBJECT:
   jacksonville homefinder.

ATTRIBUTES (KEY - VALUE)
    BROKER - BROKER_ID
     |
     -- AGENT - AGENT_ID
         |
         -- LISTING - LISTING-ID
             |
             -- PLACEMENT - PLACEMENT CODE
The initial problem with this storage is that we haven't allowed for the concept of date. We could place it at the top or the bottom of the tree
    DATE - DATE VALUE
     |
     -- BROKER - BROKER_ID
         |
         -- AGENT - AGENT_ID
             |
             -- LISTING - LISTING-ID
                 |
                 -- PLACEMENT - PLACEMENT CODE

--------------------
 or
--------------------

    BROKER - BROKER_ID
     |
     -- AGENT - AGENT_ID
         |
         -- LISTING - LISTING-ID
             |
             -- PLACEMENT - PLACEMENT CODE
                |
                -- DATE - DATE VALUE

It is a generally good idea to keep attribute duplication to a minium (the same key value branched over several parts of the tree [this should become clearer later so don't worry to much if it doesn't make sense at this point]) so we'll pick to have the DATE stored at the top of the tree. Having a date as a parent attribute also makes it easier to run reports at the moment because of the way the mtl works.

See if we've created any performance issues. (Reports will be slow or won't run at all)

A good rule of thumb is that the answer here will always be yes the first couple of times through. So, how do you catch performance issues. You could just try it out, which works but has the drawbacks of being a painfully slow process and it can take a while before the problems show up. That sucks because then you have to both start over and convert data. There are two things that will generally lead to issues in getting the reports to run.

Those descriptions can be a little hard to fathom so we'll take a look at an example of each.

Rolling up a deep tree structure.

Let's assume that we have 2 brokers for our property, each with 2 agents, with 2 listings and that we've been collecting data for a month. At that point, our data would have the following structure (broken out for 1 day).

DATE - DAY 1
 |
 -- BROKER - BROKER A
     |
     -- AGENT - AGENT A1
         |
         -- LISTING - LISTING A1A
            |
            -- PLACEMENT - SEARCH VIEW - 15 hits
            -- PLACEMENT - DETAIL VIEW - 10 hits
         |
         -- LISTING - LISTING A1B
            |
            -- PLACEMENT - SEARCH VIEW - 15 hits
            -- PLACEMENT - DETAIL VIEW - 10 hits
            -- PLACEMENT - MAP VIEW    -  5 hits
     |
     -- AGENT - AGENT A2
         -- LISTING - LISTING A2A
            |
            -- PLACEMENT - SEARCH VIEW -  7 hits
            -- PLACEMENT - DETAIL VIEW -  3 hits
         |
         -- LISTING - LISTING A2B
            |
            -- PLACEMENT - SEARCH VIEW - 20 hits
            -- PLACEMENT - DETAIL VIEW -  9 hits
            -- PLACEMENT - MAP VIEW    -  3 hits
            -- PLACEMENT - PRINT LEAD  -  3 hits
 -- BROKER - BROKER B
     |
     -- AGENT - AGENT B1
         |
         -- LISTING - LISTING B1A
            |
            -- PLACEMENT - SEARCH VIEW - 23 hits
            -- PLACEMENT - DETAIL VIEW -  8 hits
            -- PLACEMENT - MAP VIEW    -  2 hits
            -- PLACEMENT - PRINT LEAD  -  2 hits
         |
         -- LISTING - LISTING B1B
            |
            -- PLACEMENT - SEARCH VIEW -  5 hits
            -- PLACEMENT - DETAIL VIEW -  2 hits
     |
     -- AGENT - AGENT B2
         -- LISTING - LISTING B2A
            |
            -- PLACEMENT - SEARCH VIEW -  9 hits
            -- PLACEMENT - DETAIL VIEW -  8 hits
            -- PLACEMENT - MAP VIEW    -  4 hits
         |
         -- LISTING - LISTING B2B
            |
            -- PLACEMENT - SEARCH VIEW -  7 hits
            -- PLACEMENT - DETAIL VIEW -  3 hits

Replicate this for 30 days and you can imagine the amount of information that would exist for even the smallest of sites. Using the existing tags, to create a report that had broker, agent and listing info all on one page you would have to do the following:

As you can see, that's not going to be a pretty looking mtl template. At minimum you are going to have 4 nested loops of attribute tags. You are also going to need code for handling at each level cases where the value at that level changes on the last loop through the level and for the last loop through that level when the value changes. I've done it before and it isn't pretty. The two problems you'll run into at this point are

This is the type of report that was required for mdTransit (substitute zipcode for broker, make for agent and model for listing). We wound up having to create a specialized table to handle it that is a prebuilt rollup of the data. Messy and time consuming and it is still barely efficient enough to run dynamically.

Fortunately the reports needed for real estate don't need the breakdown so there are a few approaches to fixing the problem.

The first is relatively straight forward, by collecting the placement counts at each level in our tree the mtl for the report would look like this:

That cuts what was going to be 5 loops with (4 nests) down to 2 loops (1 nested).

While option 1 gets us much better performance it does so at the expense of storage space and processing time when loading. Because we'll be inserting a much greater number of attributes it will take much longer to process the access logs.

Let's see if option 2 is viable.

For option 2 to work, we'd need to be able to write mtl that would say:

That looks much better, unfortunately we don't have mtl to do that yet.

What to use as the key for listing, agent, broker. For broker and agent, probably use CLASSIFIEDS.AD_CUSTOMER.CUSTOMER_ID. For listing, thinking about using MLS#. House street address still has to be available even if ad is gone, need to determine how to handle that. Also, how to handle if the street address changes (We could wind up with two attributes with a key of street address).

    DATE - DATE VALUE
     |
     -- BROKER - BROKER_ID
         |
         -- AGENT - AGENT_ID
             |
             -- LISTING - MLSNUMBER
                 |
                 -- STREET ADDRESS
                 |
                 -- PLACEMENT - PLACEMENT CODE

An alternative storage scheme might involve treating an agent and broker as "Objects" instead of levels of object attributes. One disadvantage to this is that the relationship between agent and broker will be somewhat lost from the reporting data.

Initial crumb setup: $(PLACEMENT_CODE)-MDRE_AD-$(SITE_ID).gif?aid=$(CUSTOMER_ID)&bid=$(CUSTOMER_PARENT_ID)&lid=$(LISTING_ID) with data format:

    DATE - DATE VALUE
     |
     -- PLACEMENT_CODE - PLACEMENT_CODE
     |
     -- BROKER - BROKER_ID
     |   |
     |   -- PLACEMENT_CODE_B - PLACEMENT_CODE
     |
     -- AGENT - AGENT_ID
         |
         -- PLACEMENT_CODE_A - PLACEMENT_CODE
         |
         -- LISTING - MLSNUMBER
             |
             -- STREET ADDRESS
             |
             -- PLACEMENT_CODE_L - PLACEMENT CODE

Where did this data structure come from? Well, for efficiency we are going to try duplicating all of the placement information at each level (the PLACEMENT_CODE values). Since the relationship between broker and agent is maintained in the ad_customer table there doesn't appear to be a reason put the agent info two levels away from the date info (hopefully a speed boost). Listings are maintained underneath the agents they belong to since the ads may be gone by the time a report is run which is also the reason for the street address data getting stored.

After talking with Terry the following is now known:

Put together two reports for the previous data structure. The first works with an ad_customer query as it's top most loop, to gather and then uses the ids it sees to pull together the brokers/agents reports. It is not exceptionally quick but has the advantage of showing all current customers. At the listing level, the agent ID is used to gather all of the possible attribute id's that could be serving as parents for listings, then uses those ids to loop over each listing rolling up the placement numbers.

The second report uses a new mtl tag atl.log_object_attribute_rollup which allows you to specify the immediate parent key, any ancestor key/value and the child key and it will sum child values grouped by parent values. It is quick but does not show agents/brokers for which there are no listings.

Upon examination, the above data structure has some problems, the most notable are that the street address for each listing will be duplicated endless times needlessly. Also, this is requiring us to pass the street number for the listing

Alternate data structures to test

Alternate 1

LOG_OBJECT: MDRE_SITE_A - SITE_ID

    DATE - DATE VALUE
     |
     -- PLACEMENT_CODE - PLACEMENT_CODE
     |
     -- BROKER - BROKER_ID
     |   |
     |   -- PLACEMENT_CODE_B - PLACEMENT_CODE
     |
     -- AGENT - AGENT_ID
         |
         -- PLACEMENT_CODE_A - PLACEMENT_CODE
         |
         -- LISTING - AD ID
             |
             -- PLACEMENT_CODE_L - PLACEMENT CODE


LOG_OBJECT: CLASSIFIED_AD/REAL ESTATE - AD ID
        
    ADDRESS - STREET ADDRESS

Alternate 2

LOG_OBJECT: MDRE_SITE_B - SITE_ID

    DATE - DATE VALUE
     |
     -- BROKER - BROKER_ID
         |
         -- AGENT - AGENT_ID
             |
             -- LISTING - AD ID
                 |
                 -- PLACEMENT_CODE - PLACEMENT CODE


LOG_OBJECT: CLASSIFIED_AD/REAL ESTATE - AD ID
        
    ADDRESS - STREET ADDRESS

Alternate 3

LOG_OBJECT: MDRE_SITE_C - SITE_ID

    DATE - DATE VALUE
     |
     -- BROKER - BROKER_ID
     |   |
     |   -- PLACEMENT_CODE_B - PLACEMENT CODE
     |
     -- AGENT - AGENT_ID
         |
         -- LISTING - AD ID
             |
             -- PLACEMENT_CODE_L - PLACEMENT CODE


LOG_OBJECT: CLASSIFIED_AD/REAL ESTATE - AD ID
        
    ADDRESS - STREET ADDRESS

Note that the LOG_OBJECT type is the same for all of these alternate storage strategies. This is done because we are going to be looking up the street address from the database and the TYPE and ID are constrained when we do that. The downside is that we are going to have to use two crumbs to get the data stored this way.

I've had to add an ancestor_rollup tag that sums counts and such like the rollup tag but you can group by any ancestor

On initial testing the data structure for alternate 2 (MDRE_SITE_B) appears to be having a little bit of trouble running in the alloted time. The atl.log_object_attribute_report tag, which allows you to select any child attribute value and hits, summed and grouped by any parent attribute, based on the existence of any ancestor attribute key and value is taking more than 2 minutes to run.

Conclusion

In the end Alternate structure 1 was used for real estate reporting because of performance issues. During testing, 7 variations of data storage were eventually used before Alternate 1 was selected.

Currently, the best advice for using reporting is to create sample data sets and test ruthlessly for performance issues before launching a reporting system.