Statistical Theory for High Performance Analytics

A thing that strikes me when I was a student of statistics is that most theories of sampling, testing of hypothesis and modeling were built in an age where data was predominantly insufficient, computation was inherently manual and results of tests aimed at large enough differences.

I look now at the explosion of data, at the cloud computing enabled processing power on demand, and competitive dynamics of businesses to venture out my opinion-

1) We now have large , even excess data than we had before for statisticians a generation ago.

2) We now have extremely powerful computing devices, provided we can process our algorithms in parallel.

3) Even a slight uptick in modeling efficiency or mild uptick in business insight can provide huge monetary savings.

Call it High Performance Analytics or Big Data or Cloud Computing- are we sure statisticians are creating enough mathematical theory or are we just taking it easy in our statistics classrooms only to be subjected to something completely different when we hit the analytics workplace.

Do we  need more theorists as well? Is there ANY incentive for corporations with private R and D research teams to share their latest cutting edge theoretical work outside their corporate silo.

 

Related-

“a mathematician is a machine for turning coffee into theorems

Oracle launches its version of R #rstats

From-

http://www.oracle.com/us/corporate/press/1515738

Integrates R Statistical Programming Language into Oracle Database 11g

News Facts

Oracle today announced the availability of Oracle Advanced Analytics, a new option for Oracle Database 11g that bundles Oracle R Enterprise together with Oracle Data Mining.
Oracle R Enterprise delivers enterprise class performance for users of the R statistical programming language, increasing the scale of data that can be analyzed by orders of magnitude using Oracle Database 11g.
R has attracted over two million users since its introduction in 1995, and Oracle R Enterprise dramatically advances capability for R users. Their existing R development skills, tools, and scripts can now also run transparently, and scale against data stored in Oracle Database 11g.
Customer testing of Oracle R Enterprise for Big Data analytics on Oracle Exadata has shown up to 100x increase in performance in comparison to their current environment.
Oracle Data Mining, now part of Oracle Advanced Analytics, helps enable customers to easily build and deploy predictive analytic applications that help deliver new insights into business performance.
Oracle Advanced Analytics, in conjunction with Oracle Big Data ApplianceOracle Exadata Database Machine and Oracle Exalytics In-Memory Machine, delivers the industry’s most integrated and comprehensive platform for Big Data analytics.

Comprehensive In-Database Platform for Advanced Analytics

Oracle Advanced Analytics brings analytic algorithms to data stored in Oracle Database 11g and Oracle Exadata as opposed to the traditional approach of extracting data to laptops or specialized servers.
With Oracle Advanced Analytics, customers have a comprehensive platform for real-time analytic applications that deliver insight into key business subjects such as churn prediction, product recommendations, and fraud alerting.
By providing direct and controlled access to data stored in Oracle Database 11g, customers can accelerate data analyst productivity while maintaining data security throughout the enterprise.
Powered by decades of Oracle Database innovation, Oracle R Enterprise helps enable analysts to run a variety of sophisticated numerical techniques on billion row data sets in a matter of seconds making iterative, speed of thought, and high-quality numerical analysis on Big Data practical.
Oracle R Enterprise drastically reduces the time to deploy models by eliminating the need to translate the models to other languages before they can be deployed in production.
Oracle R Enterprise integrates the extensive set of Oracle Database data mining algorithms, analytics, and access to Oracle OLAP cubes into the R language for transparent use by R users.
Oracle Data Mining provides an extensive set of in-database data mining algorithms that solve a wide range of business problems. These predictive models can be deployed in Oracle Database 11g and use Oracle Exadata Smart Scan to rapidly score huge volumes of data.
The tight integration between R, Oracle Database 11g, and Hadoop enables R users to write one R script that can run in three different environments: a laptop running open source R, Hadoop running with Oracle Big Data Connectors, and Oracle Database 11g.
Oracle provides single vendor support for the entire Big Data platform spanning the hardware stack, operating system, open source R, Oracle R Enterprise and Oracle Database 11g.
To enable easy enterprise-wide Big Data analysis, results from Oracle Advanced Analytics can be viewed from Oracle Business Intelligence Foundation Suite and Oracle Exalytics In-Memory Machine.

Supporting Quotes

“Oracle is committed to meeting the challenges of Big Data analytics. By building upon the analytical depth of Oracle SQL, Oracle Data Mining and the R environment, Oracle is delivering a scalable and secure Big Data platform to help our customers solve the toughest analytics problems,” said Andrew Mendelsohn, senior vice president, Oracle Server Technologies.
“We work with leading edge customers who rely on us to deliver better BI from their Oracle Databases. The new Oracle R Enterprise functionality allows us to perform deep analytics on Big Data stored in Oracle Databases. By leveraging R and its library of open source contributed CRAN packages combined with the power and scalability of Oracle Database 11g, we can now do that,” said Mark Rittman, co-founder, Rittman Mead.
Oracle Advanced Analytics — an option to Oracle Database 11g Enterprise Edition – extends the database into a comprehensive advanced analytics platform through two major components: Oracle R Enterprise and Oracle Data Mining. With Oracle Advanced Analytics, customers have a comprehensive platform for real-time analytic applications that deliver insight into key business subjects such as churn prediction, product recommendations, and fraud alerting.

Oracle R Enterprise tightly integrates the open source R programming language with the database to further extend the database with Rs library of statistical functionality, and pushes down computations to the database. Oracle R Enterprise dramatically advances the capability for R users, and allows them to use their existing R development skills and tools, and scripts can now also run transparently and scale against data stored in Oracle Database 11g.

Oracle Data Mining provides powerful data mining algorithms that run as native SQL functions for in-database model building and model deployment. It can be accessed through the SQL Developer extension Oracle Data Miner to build, evaluate, share and deploy predictive analytics methodologies. At the same time the high-performance Oracle-specific data mining algorithms are accessible from R.

BENEFITS

  • Scalability—Allows customers to easily scale analytics as data volume increases by bringing the algorithms to where the data resides – in the database
  • Performance—With analytical operations performed in the database, R users can take advantage of the extreme performance of Oracle Exadata
  • Security—Provides data analysts with direct but controlled access to data in Oracle Database 11g, accelerating data analyst productivity while maintaining data security
  • Save Time and Money—Lowers overall TCO for data analysis by eliminating data movement and shortening the time it takes to transform “raw data” into “actionable information”
Oracle R Hadoop Connector Gives R users high performance native access to Hadoop Distributed File System (HDFS) and MapReduce programming framework.
This is a  R package
From the datasheet at

WPS Version 3 Released

Apparently-you can now use the language of SAS on a Mac using the British software WPS

http://teamwpc.co.uk/press/wps3_released

WPS software ready for Big Data, Cloud Computing and Apple Mac

ONDON, UK – 2 February 2012 – World Programming today released version 3 of their leading WPS data processing and analytics software.

Big data processing at affordable prices is driving adoption of WPS in the datacentre and across the enterprise for analytics, business intelligence and prediction, data management, ETL and reporting.

WPS version 3 boosts support for the language of SAS, extending core data processing capabilities as well as analytics and graphing. Further improvements have been made to performance and scalability together with a wide range of supported platforms.

The popularity of Linux platforms continues to grow as organisations look for platform flexibility and control over costs. WPS Link technology with WPS version 3 offers the option to use the popular WPS Workbench user interface (GUI) to connect to and run programs in server, grid, cluster and cloud environments, suiting modern datacentre-driven compute facilities.

Version 3 also brings the WPS Workbench user interface to Mac OS X, Solaris, AIX and Linux platforms including Linux for System z. The WPS Workbench benefits from significant enhancements including: improved handling and display of automatically generated output; importing and exporting data; multiple concurrent program execution; code generation templates and more.

WPS statistical analysis capability continues to expand. Organisations are increasingly looking to use their data to provide the insight, prediction and intelligence to make decisions that will affect their future. WPS has the power to handle the big data volumes of the modern enterprise and to produce results that can be depended on.

WPS version 3 is available as a free upgrade to all WPS license holders.

Related Links

www.teamwpc.co.uk/support/release/wps : Summary of all the major new features in WPS version 3 plus additional downloadable documents (release notes, change log, issues).

www.teamwpc.co.uk/products/wps : Explore in more detail all the features of WPS software.

More Information About WPS

WPS is a competitively priced, high performance, highly scalable data processing and analytics software product that allows users to execute programs written in the language of SAS. WPS is supported on a wide variety of hardware and operating system platforms and can connect to and work with many types of data with ease. The WPS user interface (Workbench) is frequently praised for its ease of use and flexibility, with the option to include numerous third-party extensions.

Press Enquiries: press@teamwpc.co.uk

and

http://teamwpc.co.uk/products/wps

Overview

World Programming System (WPS)

What is WPS?

The World Programming System (WPS) is a powerful and versatile platform for working with data. WPS software can run programs written in the language of SAS.

The supported syntax covers core, statistical and graphing functionality, and makes it possible to run many applications written in the language of SAS whilst the breadth of language support in WPS continues to grow.

The WPS Workbench IDE/GUI allows you to create, edit, manage and execute scripts and view the resulting output. Scripts can also be executed from the command line or in batch mode using WPS CLI.

Integrated Modular System

More About Modules…

Multi Platform Availability

WPS is available on a wide variety of hardware and operating system platforms, including Microsoft Windows, Apple Mac OS X, Linux (including for System z), AIX, Solaris and IBM Mainframe z/OS.

More About Platforms…

 

User Interface

WPS can be used in a number of different ways:

Handle Large Data Volumes

WPS can read and write to many of the most commonly used data file formats, databases and data warehouses. It is capable of handling huge data volumes, wherever the processing occurs, be that on a mainframe, in a cloud, cluster, grid, server or workstation.

 

and

http://teamwpc.co.uk/support/release/wps

Summary of Main New Features in WPS Version 3

For a more generalised overview of the current features of WPS, not just the ‘new’ features summarised below, please refer to the Product section.

Here are the main new features of the latest release.

  • Multi-Platform Workbench
    The WPS Workbench (IDE/GUI) is now offered on the following platforms:

    - AIX
    - Linux (x86 and System z)
    - Mac OS X
    - Solaris (x86 and Sparc)
    - Windows

  • Workbench Feature Enhancements
    The WPS Workbench has received many usability enhancements including:

    - Dataset import/export wizard.
    - 3rd-party eclipse plugin support.
    - Rename/delete datasets.
    - Assign/deassign libraries (libnames).
    - Find values in dataset viewer.
    - Enhanced dataset viewer display.
    - Automatic management of ODS HTML and Listing output.
    - Regular expression support in ‘find’ features.
    - Improved character set/codepage support.
    - Multiple concurrent WPS servers (see below).
    - WPS Link remote server capability (see below).

  • Multiple Concurrent WPS Servers
    In previous releases of the WPS Workbench it was only possible to have one local server on which you could run your scripts. WPS version 3 allows you to set up multiple servers in the WPS Workbench and to pick which server to run any given script on. The WPS Workbench manages all the output, logs and datasets generated by each server for you.
    This enhancement, combined with the New WPS Link technology (see below) allows you to run your programs wherever you would like and control it all from the WPS Workbench.
  • Remote Server Connection
    New WPS Link technology allows the WPS Workbench to link to remote WPS Servers on other Mac, Linux or UNIX servers and to run scripts on those machines. It also allows you to view any resulting output locally in your WPS Workbench on your local machine. This enables you to make use of centralised storage and processing resources including grids and clusters of WPS processing servers and removes the requirement to process or store any data on the workstation.
  • Multi-Threading Summarisation
    Workstations and Servers with multiple CPU cores or hyper-threading can benefit from the new multi-threaded summarising engine in WPS version 3.
    This significantly improves the performance of many procedures within WPS that perform summarisation of data including PROC SUMMARY, PROC MEANS and other statistical procedures such as PROC TTEST.
  • Microsoft Windows® Installer
    WPS for Windows now allows in-place upgrade without requiring the removal of previous version of WPS beforehand.
  • Core Language Support
    WPS version 3 continues the expansion of it’s language support with even more new language items.
  • Statistical Analysis
    The support in WPS Statistics has been expanded to include:

    - PROC DISTANCE
    - PROC FACTOR
    - PROC GLM
    - PROC GLMMOD
    - PROC PRINCOMP
    - PROC STDIZE
    - PROC TTEST

    PROC LOGISTIC has been improved to allow the following model selection methods FORWARDS, BACKWARDS, STEPWISE and FAST.
    Numerous other statements and options have been added to the DATA STEP and other PROCS.

  • Financial Functions
    Support has been added for the following financial functions:

    - PMT
    - IPMT
    - PPMT
    - CUMIPMT
    - CUMPRINC
    - EFFRATE
    - NOMRATE

  • DATA Step Enhancements
    Support has been added for MODIFY and UPDATE statements within the DATA Step as well as support for NOMISS and UNIQUE constraints. Numerous other enhancements have also been added to the DATA Step like the addition of the COMPGED, CALL COMPCOST and UUIDGEN DATA Step functions.
  • Data Set Index Enhancements
    WPS support for data set indexes has been extended and optimised to offer faster index build and modification actions as well as faster index retrieval. Index creation speed has been dramatically improved. For example, on a 50 million row dataset WPS version 3 may create an index 10 times faster than WPS 2. The index files WPS version 3 produces are also significantly smaller than those produced by previous WPS versions, typically up to 50% smaller.
  • Improved WPD Library Engine
    WPS version 3 has a new, improved version of the World Programming Data Set (WPD) library engine. WPD files generated by version 3.x cannot be read by previous version 2.X releases of WPS.

    Version 2 files can be read and written to by WPS 3 using the new WPDV2 engine.
    The WPD engine in WPS 3 can read version 2 datasets without program modification, however by default the WPD engine will now write WPS 3 datasets. Please see Upgrading comments below*.

  • Sybase®
    A new WPS Engine for Sybase on Windows, Linux, Solaris and AIX platforms.
  • XML Data Support
    A new Libray engine for XML in the WPS Core module provides generic XML data import/export support and use of Oracle, CDISC and XMLMAP transformations.
  • PROC IMPORT/EXPORT support for Microsoft Access and Excel
    WPS version 3 now provides full support in PROCs IMPORT/EXPORT for Microsoft Access and Excel.

New Plotters in Rapid Miner 5.2

I almost missed this because of my vacation and traveling

Rapid Miner has a tonne of new stuff (Statuary Ethics Declaration- Rapid Miner has been an advertising partner for Decisionstats – see the right margin)

see

http://rapid-i.com/component/option,com_myblog/Itemid,172/lang,en/

Great New Graphical Plotters

and some flashy work

and a great series of educational lectures

A Simple Explanation of Decision Tree Modeling based on Entropies

Link: http://www.simafore.com/blog/bid/94454/A-simple-explanation-of-how-entropy-fuels-a-decision-tree-model

Description of some of the basics of decision trees. Simple and hardly any math, I like the plots explaining the basic idea of the entropy as splitting criterion (although we actually calculate gain ratio differently than explained…)

Logistic Regression for Business Analytics using RapidMiner

Link: http://www.simafore.com/blog/bid/57924/Logistic-regression-for-business-analytics-using-RapidMiner-Part-2

Same as above, but this time for modeling with logistic regression.
Easy to read and covering all basic ideas together with some examples. If you are not familiar with the topic yet, part 1 (see below) might help.

Part 1 (Basics): http://www.simafore.com/blog/bid/57801/Logistic-regression-for-business-analytics-using-RapidMiner-Part-1

Deploy Model: http://www.simafore.com/blog/bid/82024/How-to-deploy-a-logistic-regression-model-using-RapidMiner

Advanced Information: http://www.simafore.com/blog/bid/99443/Understand-3-critical-steps-in-developing-logistic-regression-models

and lastly a new research project for collaborative data mining

http://www.e-lico.eu/

e-LICO Architecture and Components

The goal of the e-LICO project is to build a virtual laboratory for interdisciplinary collaborative research in data mining and data-intensive sciences. The proposed e-lab will comprise three layers: the e-science and data mining layers will form a generic research environment that can be adapted to different scientific domains by customizing the application layer.

  1. Drag a data set into one of the slots. It will be automatically detected as training data, test data or apply data, depending on whether it has a label or not.
  2. Select a goal. The most frequent one is probably “Predictive Modelling”. All goals have comments, so you see what they can be used for.
  3. Select “Fetch plans” and wait a bit to get a list of processes that solve your problem. Once the planning completes, select one of the processes (you can see a preview at the right) and run it. Alternatively, select multiple (selecting none means selecting all) and evaluate them on your data in a batch.

The assistant strives to generate processes that are compatible with your data. To do so, it performs a lot of clever operations, e.g., it automatically replaces missing values if missing values exist and this is required by the learning algorithm or performs a normalization when using a distance-based learner.

You can install the extension directly by using the Rapid-I Marketplace instead of the old update server. Just go to the preferences and enter http://rapidupdate.de:8180/UpdateServer as the update URL

Of course Rapid Miner has been of the most professional open source analytics company and they have been doing it for a long time now. I am particularly impressed by the product map (see below) and the graphical user interface.

http://rapid-i.com/content/view/186/191/lang,en/

Product Map

Just click on the products in the overview below in order to get more information about Rapid-I products.

 

Rapid-I Product Overview
 

 

Understanding Indian Govt attitude to Iran and Iraq wars

This is a collection of links for a geo-strategic analysis, and the economics of wars and allies. The author neither condones nor condemns current global dynamics in the balance of power.

nations don’t have friends or enemies…nations only have interests

In 2003

The war in Iraq had a unique Indian angle right at the beginning. Some members of the US administration felt they needed more troops in Iraq, and they started negotiating with India. Those negotiations broke down because the Indians wanted to fight under the UN flag and on MONEY!!

India wanted-

  • More money per soldier deployed,
  • more share in post War Oil Contracts,
  • better diplomatic subtlety
Govt changed in India due to elections in2003 (Muslim voters are critical in any govt forming majority party), and the Iraq war ran its tragic course without any Indian explicit support.
In 26 Nov 2008, Islamic Terrorists killed US, Indian and Israeli citizens in terror strikes in Mumbai Sieze- thus proving that appeasing terrorist nations is just riding a tiger.

http://articles.timesofindia.indiatimes.com/2003-06-13/india/27203305_1_stabilisation-force-indian-troops-pentagon-delegation

NEW DELHI: There will be a lot a Iraq on the menu over the weekend before the Pentagon team arrives here on Monday to talk India into sending troops to the war-torn nation.

http://articles.timesofindia.indiatimes.com/2003-07-28/india/27176989_1_troops-issue-stabilisation-force-defence-policy-group

Jul 28, 2003, 01.28pm IST

NEW DELHI: Chairman of the US Joint Chiefs of Staff Gen Richard B Myers, who is arriving here on Monday evening on a two-day visit, will request India to reconsider its decision on sending troops to Iraq.

and

Jul 29, 2003, 07.00pm IST

NEW DELHI: Though Gen Myers flatly denied his visit had anything to do with persuading India to send troops to Iraq, it is evident that the US desperately wants Delhi to contribute a division-level force of over 15,000 combat soldiers.

http://articles.timesofindia.indiatimes.com/2003-09-10/india/27176101_1_stabilisation-force-force-under-american-control-regional-dialogue

Sep 10, 2003, 05.34pm IST

NEW DELHI: Even as the US-drafted resolution on Iraq is being heatedly debated in many countries, American Assistant Secretary of State for South Asia Christina Rocca held a series of meetings with External Affairs Ministry officials on Wednesday.

Though it was officially called “a regional dialogue”, the US request to contribute a division-level force of over 15,000 combat soldiers to the “stabilisation force” in Iraq is learnt to have figured in the discussions.

The penny wise -pound foolish attitude of then Def Secretary Rumsfield led to break down in negotiations.

“Those who fail to learn from history are doomed to repeat it.” Sir Winston Churchill

In 2012

Indian govt again faces elections and we have 150 million Muslim voters just like other countries have influential lobbies.

and while Israelis are being targeted again in attacks in India-

India is still seeking money-

India has struck a defiant tone over new financial sanctions imposed by the United States and European Union to punish Iran for its nuclear programme, coming up with elaborate trade and barter arrangements to pay for oil supplies.

However, the president of the All India Rice Exporters’ Association, said Monday’s attack on the wife of an Israeli diplomat in the Indian capital will damage trade with Iran and may complicate efforts to resolve an impasse over Iranian defaults on payments for rice imports worth around $150 million.

http://timesofindia.indiatimes.com/india/Unfazed-by-US-sanctions-India-to-step-up-ties-with-Iran/articleshow/11887691.cms

India buys $ 5  billion worth of oil from Iran. Annually. Clearly it is a critical financial trading partner to Iran.

It has now gotten extra sops from Iran to continue trading-and is now waiting for a sweeter monetary offer from US and/or Israel to even consider thinking about going through the pain of unchanging the inertia of ties with Iran.

There are some aspects of political corruption as well, as Indian political establishment  is notoriously prone to corruption by lobbyists (apparently there   is a global war on lobbyists that needs to happen)

http://timesofindia.indiatimes.com/india/Unfazed-by-US-sanctions-India-to-step-up-ties-with-Iran/articleshow/11887691.cms

 Feb 14, 2012, 05.54PM ISTUnfazed by US sanctions, India to step up ties with Iran
India is set to ramp up its energy and business ties with Iran. (AFP Photo)
NEW DELHI: Unfazed by US sanctions and Israel linking Tehran to the attack on an Israeli embassy car here, India is set to ramp up its energy and business ties with Iran, with a commerce ministry team heading to Tehran to explore fresh business opportunities. 

The team is expected to go to Tehran later this month to discuss steps to expand India’s trade with Iran, part of a larger strategy to pay for Iranian oil, said highly-placed sources. 

Despite the US and European Union sanctions on Iran, India recently sealed a payment mechanism under which Indian companies will pay for 45 percent of their crude oil imports from Iran in rupees. 

So diplomats with argue over money in Israel, Indian and US while terrorists will kill.

Against Stupidity- The Gods Themselves -Contend in Vain