[Dailydave] Quick thread on SQLi

Thu Mar 8 18:47:29 EST 2012

On Mar 8, 2012, at 11:17 AM, Michal Zalewski wrote:

>> There are many SQLI patterns that are hard for automated tools to
>> find. This is an obvious point, so I'm sorry to pedantic, but I think
>> a survey based on automated scanning is a misleading starting point
>> for the discussion.
> 
> Well, the definition of a web application  is a surprisingly
> challenging problem, too. This is particularly true for any surveys
> that randomly sample Internet destinations.
> 
> Should all the default "it works!" webpages produced by webservers be
> counted as "web applications"? In naive counts, they are, but
> analyzing them for web app vulnerabilities is  meaningless. In
> general, at what level of complexity does a "web application" begin,
> and how do you measure that when doing an automated scan?
> 
> Further, if there are 100 IPs that serve the same www.youtube.com
> front-end to different regions, are they separate web applications? In
> many studies, they are. On the flip side, is a single physical server
> with 10,000 parked domains a single web application? Some studies see
> it as 10,000 apps.

[more about various subdomain configurations deleted]

This is actually a researched topic, but in the area of massive web crawlers. The reason for this is that you need to balance:

* Parallel queries to different domains for performance but not overload a single server hosting them
* Make forward progress against different subdomains but not be vulnerable to a spider trap DNS that returns $(PRNG).example.com

The best paper on this so far is for IRLBot:

H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov, "IRLbot: Scaling to 6 Billion Pages and Beyond"
http://irl.cs.tamu.edu/people/hsin-tsang/papers/tweb2009.pdf

See sections 6 and 7 for their scheme to balance these priorities. It's quite clever how they combine this with a disk-based queue to avoid running into RAM limits. The result is a web crawler that saturates the network link and has no weak points where it sits waiting for a robots.txt response or something.

On your topic, perhaps you can apply some of their algorithms + some heuristics (exclude "it works" pages, find .php extensions, etc.) to get a fair estimates of the number of web apps at the subdomain level. This would leave out multiple web apps on a single subdomain, but at least it's a start.

-Nate