How to find out people who are spamming you

Step 1-

We assume you have Gmail. If you dont have Gmail, you deserve the Spam

You click -show original on the drop down in the spammy message

 

you see a lot of mumbo jumbo

(or you just pick the IP addresses from comment spam)

Step 2-

You pick the IP addresses from the mumbo jumbo above (called headers )

http://en.wikipedia.org/wiki/IP_address

An Internet Protocol address (IP address) is a numerical label assigned to each device (e.g., computer, printer) participating in a computer networkthat uses the Internet Protocol for communication.[1] An IP address serves two principal functions: host or network interface identification and locationaddressing

Step 3-

You find out who has that IP address using arin

https://www.arin.net/

 

Step 4-

You put those IP addresses in your firewall for your computer

http://technet.microsoft.com/en-us/library/cc733090(v=ws.10).aspx

(or if you have a self-hosted blog using Website cpanel ip deny)

http://www.siteground.com/tutorials/cpanel/ip_deny_manager.htm

Step 5-

 

Communicate to that IP Address using IRC

http://en.wikipedia.org/wiki/Internet_Relay_Chat

Internet Relay Chat (IRC) is a protocol for real-time Internet text messaging (chat) or synchronous conferencing.[1] It is mainly designed for group communication in discussion forums, called channels,[2] but also allows one-to-one communication via private message[3] as well as chat and data transfer,[4] including file sharing.[5]

or use HOIC to test your own firewall better before people  spam  you

http://gizmodo.com/5883146/what-is-hoic or

http://www.decisionstats.com/occupy-the-internet/

 

Top 5 XKCD on Data Visualization

By request, an analysis of Top 5  XKCDs on data visualization. Statisticians and Data Scientists to note-

1) DOT PLOT

 

2)  LINE PLOTS

3) FLOW CHARTS

4) PIE CHARTS and 5) BAR GRAPHS

I am not going into the big big graphs of course like the Star Wars Plot data visualization at

http://xkcd.com/657/ or the Money Chart at http://xkcd.com/980/ because I dont believe in data visualization to show off, but to keep it simple simply :)

Now I gotta find me a software that can write my blog for me :)

Analytics for Cyber Conflict

 

The emerging use of Analytics and Knowledge Discovery in Databases for Cyber Conflict and Trade Negotiations

 

The blog post is the first in series or articles on cyber conflict and the use of analytics for targeting in both offense and defense in conflict situations.

 

It covers knowledge discovery in four kinds of databases (so chosen because of perceived importance , sensitivity, criticality and functioning of the geopolitical economic system)-

  1. Databases on Unique Identity Identifiers- including next generation biometric databases connected to Government Initiatives and Banking, and current generation databases of identifiers like government issued documents made online
  2. Databases on financial details -This includes not only traditional financial service providers but also online databases with payment details collected by retail product selling corporates like Sony’s Playstation Network, Microsoft ‘s XBox and
  3. Databases on contact details – including those by offline businesses collecting marketing databases and contact details
  4. Databases on social behavior- primarily collected by online businesses like Facebook , and other social media platforms.

It examines the role of

  1. voluntary privacy safeguards and government regulations ,

  2. weak cryptographic security of databases,

  3. weakness in balancing marketing ( maximized data ) with privacy (minimized data)

  4. and lastly the role of ownership patterns in database owning corporates

A small distinction between cyber crime and cyber conflict is that while cyber crime focusses on stealing data, intellectual property and information  to primarily maximize economic gains

cyber conflict focuses on stealing information and also disrupt effective working of database backed systems in order to gain notional competitive advantages in economics as well as geo-politics. Cyber terrorism is basically cyber conflict by non-state agents or by designated terrorist states as defined by the regulations of the “target” entity. A cyber attack is an offensive action related to cyber-infrastructure (like the Stuxnet worm that disabled uranium enrichment centrifuges of Iran). Cyber attacks and cyber terrorism are out of scope of this paper, we will concentrate on cyber conflicts involving databases.

Some examples are given here-

Types of Knowledge Discovery in -

1) Databases on Unique Identifiers- including biometric databases.

Unique Identifiers or primary keys for identifying people are critical for any intensive knowledge discovery program. The unique identifier generated must be extremely secure , and not liable to reverse engineering of the cryptographic hash function.

For biometric databases, an interesting possibility could be determining the ethnic identity from biometric information, and also mapping relatives. Current biometric information that is collected is- fingerprint data, eyes iris data, facial data. A further feature could be adding in voice data as a part of biometric databases.

This is subject to obvious privacy safeguards.

For example, Google recently unveiled facial recognition to unlock Android 4.0 mobiles, only to find out that the security feature could easily be bypassed by using a photo of the owner.

 

 

Example of Biometric Databases

In Afghanistan more than 2 million Afghans have contributed iris, fingerprint, facial data to a biometric database. In India, 121 million people have already been enrolled in the largest biometric database in the world. More than half a million customers of the Tokyo Mitsubishi Bank are are already using biometric verification at ATMs.

Examples of Breached Online Databases

In 2011, Playstation Network by Sony (PSN) lost data of 77 million customers including personal information and credit card information. Additionally data of 24 million customers were lost by Sony’s Sony Online Entertainment. The websites of open source platforms like SourceForge, WineHQ and Kernel.org were also broken into 2011. Even retailers like McDonald and Walgreen reported database breaches.

 

The role of cyber conflict arises in the following cases-

  1. Databases are online for accessing and authentication by proper users. Databases can be breached remotely by non-owners ( or “perpetrators”) non with much lesser chance of intruder identification, detection and penalization by regulators, or law enforcers (or “protectors”) than offline modes of intellectual property theft.

  2. Databases are valuable to external agents (or “sponsors”) subsidizing ( with finance, technology, information, motivation) the perpetrators for intellectual property theft. Databases contain information that can be used to disrupt the functioning of a particular economy, corporation (or “ primary targets”) or for further chain or domino effects in accessing other data (or “secondary targets”)

  3. Loss of data is more expensive than enhanced cost of security to database owners

  4. Loss of data is more disruptive to people whose data is contained within the database (or “customers”)

So the role play for different people for these kind of databases consists of-

1) Customers- who are in the database

2) Owners -who own the database. They together form the primary and secondary targets.

3) Protectors- who help customers and owners secure the databases.

and

1) Sponsors- who benefit from the theft or disruption of the database

2) Perpetrators- who execute the actual theft and disruption in the database

The use of topic models and LDA is known for making data reduction on text, and the use of data visualization including tied to GPS based location data is well known for investigative purposes, but the increasing complexity of both data generation and the sophistication of machine learning driven data processing makes this an interesting area to watch.

 

 

The next article in this series will cover-

the kind of algorithms that are currently or being proposed for cyber conflict, the role of non state agents , and what precautions can knowledge discovery in databases practitioners employ to avoid breaches of security, ethics, and regulation.

Citations-

  1. Michael A. Vatis , CYBER ATTACKS DURING THE WAR ON TERRORISM: A PREDICTIVE ANALYSIS Dartmouth College (Institute for Security Technology Studies).
  2. From Data Mining to Knowledge Discovery in Databases Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyt

Jill Dyche on 2012

In part 3 of the series for predictions for 2012, here is Jill Dyche, Baseline Consulting/DataFlux.

Part 2 was Timo Elliot, SAP at http://www.decisionstats.com/timo-elliott-on-2012/ and Part 1 was Jim Kobielus, Forrester at http://www.decisionstats.com/jim-kobielus-on-2012/

Ajay: What are the top trends you saw happening in 2011?

 

Well, I hate to say I saw them coming, but I did. A lot of managers committed some pretty predictable mistakes in 2011. Here are a few we witnessed in 2011 live and up close:

 

1.       In the spirit of “size matters,” data warehouse teams continued to trumpet the volumes of stored data on their enterprise data warehouses. But a peek under the covers of these warehouses reveals that the data isn’t integrated. Essentially this means a variety of heterogeneous virtual data marts co-located on a single server. Neat. Big. Maybe even worthy of a magazine article about how many petabytes you’ve got. But it’s not efficient, and hardly the example of data standardization and re-use that everyone expects from analytical platforms these days.

 

2.       Development teams still didn’t factor data integration and provisioning into their project plans in 2011. So we saw multiple projects spawn duplicate efforts around data profiling, cleansing, and standardization, not to mention conflicting policies and business rules for the same information. Bummer, since IT managers should know better by now. The problem is that no one owns the problem. Which brings me to the next mistake…

 

3.       No one’s accountable for data governance. Yeah, there’s a council. And they meet. And they talk. Sometimes there’s lunch. And then nothing happens because no one’s really rewarded—or penalized for that matter—on data quality improvements or new policies. And so the reports spewing from the data mart are still fraught and no one trusts the resulting decisions.

 

But all is not lost since we’re seeing some encouraging signs already in 2012. And yes, I’d classify some of them as bona-fide trends.

 

Ajay: What are some of those trends?

 

Job descriptions for data stewards, data architects, Chief Data Officers, and other information-enabling roles are becoming crisper, and the KPIs for these roles are becoming more specific. Data management organizations are being divorced from specific lines of business and from IT, becoming specialty organizations—okay, COEs if you must—in their own rights. The value proposition for master data management now includes not just the reconciliation of heterogeneous data elements but the support of key business strategies. And C-level executives are holding the data people accountable for improving speed to market and driving down costs—not just delivering cleaner data. In short, data is becoming a business enabler. Which, I have to just say editorially, is better late than never!

 

Ajay: Anything surprise you, Jill?

 

I have to say that Obama mentioning data management in his State of the Union speech was an unexpected but pretty powerful endorsement of the importance of information in both the private and public sector.

 

I’m also sort of surprised that data governance isn’t being driven more frequently by the need for internal and external privacy policies. Our clients are constantly asking us about how to tightly-couple privacy policies into their applications and data sources. The need to protect PCI data and other highly-sensitive data elements has made executives twitchy. But they’re still not linking that need to data governance.

 

I should also mention that I’ve been impressed with the people who call me who’ve had their “aha!” moment and realize that data transcends analytic systems. It’s operational, it’s pervasive, and it’s dynamic. I figured this epiphany would happen in a few years once data quality tools became a commodity (they’re far from it). But it’s happening now. And that’s good for all types of businesses.

 

About-

Jill Dyché has written three books and numerous articles on the business value of information technology. She advises clients and executive teams on leveraging technology and information to enable strategic business initiatives. Last year her company Baseline Consulting was acquired by DataFlux Corporation, where she is currently Vice President of Thought Leadership. Find her blog posts on www.dataroundtable.com.

Using SAS and R Together

Proc r

 I really liked this code snippet paper from JSS, enough to upload and embed it here. It shows using R from within Base SAS is quite easy, though Phil Rack, of Minequest gets credit for writing the earliest macros on that (in SAS language product WPS ) at http://www.minequest.com/Bridge2R.html .
I also liked Phil Holland’s paper on that at http://www.hollandnumerics.co.uk/pdf/SAS2R2SAS_paper.pdf
and Sam Croker’s paper on using Time Series in both SAS and R at
A great blog on using both languages together is
SAS and R: Data Management, Statistical Analysis, and Graphics
The earliest book on the topic of R being used by SAS users was by Bob Muenchen, of course at
R for SAS and SPSS Users (Statistics and Computing)

Of course you can refer to official SAS/IML documentation as well for using SAS and R -
Case Studies on using R and SAS together can be seen from here-

 multiple case studies, and in each a comparison of R and SAS, as well as ways of combining the two together. This included calling R from SAS, and using R to generate SAS code.

Most striking for me was the comparison of SAS with R in a live, corporate financial context, and the presentation of R as a viable, robust, industrial strength option, with some unique advantages, and admitted weaknesses.

I hope that Hong can present this again to the Sydney Users of R Forum (SURF)

His presentation slides can be found here.

but the last word goes to this document on doing graphs  from

http://biostat.mc.vanderbilt.edu/wiki/pub/Main/RafeDonahue/doingmore_currentversion.pdf

drawing a plot with SAS/Graph and then modify its defaults and make it better. Along the way I will discuss issues that will arise with how the code runs and how SAS works and whatnot.
Then we’ll start over and do the whole thing all over again with R.