-
Website
http://avc.com/ -
Original page
http://www.avc.com/a_vc/2009/07/hacker-news-and-the-nosql-movement.html -
Subscribe
All Comments -
Community
-
Top Commenters
-
ShanaC
643 comments · 61 points
-
William Mougayar
143 comments · 46 points
-
Daniel Ha
170 comments · 396 points
-
howardlindzon
207 comments · 69 points
-
vruz
241 comments · 19 points
-
-
Popular Threads
-
The AVC Reader Census
6 hours ago · 43 comments
-
Valuation and Option Pool
1 day ago · 89 comments
-
Throw The Bums Out
3 days ago · 220 comments
-
The Double Opt-In Introduction
4 days ago · 141 comments
-
Great Meetup Last Night
2 days ago · 62 comments
-
The AVC Reader Census
One small point worth mentioning is the value of the "ASK HN" posts - this is something no other news community offers. Entrepreneurs & hackers (even VCs) often post questions about anything related to technology or startups, and the quality of responses is amazing. People take a lot of time to write helpful and very well thought out responses in a genuine show of support. It still amazes me every time - we've used ASK HN posts for advice related to MySQL issues, sales, UI feedback and even where to stay in a new place!
Thanks for highlighting such an incredible resource.
i think all important services on the net should have their own domains
it's just a pet peeve of mine
Making the barrier of entry lower can make less vested people involved, but only peripherally. However, it also means information more freely spreads and that that information can mutate very quickly to provide newer and even more sources of information. We also came, somewhat, to the conclusion that it is not as much as a true barrier as one that is similar to that of membrane- permeable. I still take the position that much like a real cell membrane- it is permeable both ways.
As one can see from the data above about Fred's website- many people stop by. Not so many log comments. There is a barrier of entry here, I admit to some days being fire-hosed and running around trying to figure out what is going on. (I stay to be educated)
How do you think sites HackerNews and even this one should be constructed to help people who want to pass through the barrier in both directions? It is not just a domain issue, it is a much larger issue of usability, knowledge, and how to make information available in progressive steps with reinforcements for progress along the way, alongside the flip of how to make comfortable social spheres to help those through the transition processes.
Not that certain knowledge should be Elite- but how to make the eliteness more structured in a friendly way?
The Internet sometimes feels like it disguises the social issues it creates- too much of a firehose of knowledge some days, not enough on others.
How does it provide structure and engagement? The flip is not necessarily a bad system either. In any knowledge creation and curatorial system- one still needs a system in which to engage in and delve deep just as much as one that propagates the knowledge that is out there in new forms?
You once said you were trained like an apprentice. The question for me, is not how to suck people in- but rather how to give people better apprenticeship experiences through a wider system of gaining those contacts and knowledge bases, without sacrificing the tail end of extremely good knowledge and expertise that sometimes only happens when individuals and groups break off away from the mass.
newmogul.com looked interesting but has failed to build the same type of community.
they've got the right aggregation model and UI (I agree about HN's UI), but
not the community
HN has the benefit of the YC community as its seed
newmogul needs something like that
http://arclanguage.org/install
I'm not saying the article is missing the shortcomings of these different technologies but they are definitely blending them all together as if they are all the same thing and equal competitors when they are in fact not.
That said I work on the business side and don't intimately use these technologies....just sell the benefits of the data and info they provide so if anyone understands this space better I would love to hear your opinion or you correct me or set me straight on this stuff. (which i was hoping that article would do and it did not)
They're so similar as well!
1) Architecture incompatibilities the make RDBMS a difficult fit
2) Problems with the RDBMS (scalability, consistency) that prevent it being used even when it is a suitable fit
Solutions include building data architectures based loosely around DHT’s (MongoDB, Project Voldemort, Cassandra, Redis, Wistla etc) or other distributed architectures (Hive & Hadoop more generally etc), fixing some of the RBDMS’s problems (GroovyCorp, Vertica, Drizzle etc) or marrying the RDBMS with other technologies (Aster Data with Map/Reduce etc)
When the dust settles there will be the need for all.
http://bigdatamatters.com/bigdatamatters/2009/0...
It's great to see the NoSQL movement getting mainstream recognition too. I took my own stab at why startups are moving to key/value stores in The SQL Trap:
http://petewarden.typepad.com/searchbrowser/200...
Paul Grahams essay's and ycombinators framework were influencers on my view about the feasibility of a low cost business and potential open development areas
But I think you need to be careful about calling it 'NoSQL'...SQL is just the language used to talk to a traditional relational database (structured query language)...and many of the new systems, like SimpleDB, that are completely different concepts on the storage side still offer up a SQL syntax for inserting data and getting data back out...
So anyway, I think it's more about 'NoRelational' than it is 'NoSQL'...
i didn't make up the term, i just used it, which i guess is implicity saying
i like it
This tradeoff is rooted in a set of unsolved problems in computer science; if you solve the core problems, you get the best of both worlds and both the traditional RDBMS and NoSQL platforms become obsolete. Very few people in this space, even technical people, seem to fully understand the underlying theoretical context in which database technologies are embedded. If one of the core problems was solved, it would look sufficiently different from existing SQL/NoSQL art and be sufficiently technical that I doubt most VCs would give it much of a look despite the revolutionary potential. (Indeed, the couple recent startup ventures I am aware of that have made significant progress toward solving the core algorithm problems are largely ignored by VCs, and have been raising their money from defense agencies and customers with the acumen to recognize the meaning of the algorithm advancements.)
NoSQL is a technical stopgap until the real computer science and algorithm problems are addressed, and there is a lot of progress being made on that. As an investor, it seems wiser to invest money in the ventures that will obviate the use case for NoSQL (and traditional RDBMS) rather than investing in the stopgap measure. There are systems being worked on and tested that are capable of doing complex analytics even a traditional SQL database can't do well, but at scales and with throughput traditionally reserved for systems like Hadoop. From reading the coverage of such things, I expect a lot of people to be caught off-guard when products are released that do not fit into the SQL/NoSQL false dichotomy.
KV stores are not really databases at all, they're really just like the indexes used in a traditional RDBMS (except, of course, they are writable). They make sense when you just need a simple and fast way of getting info in and out of a web page. As you rightly point out, if you want to slice and dice the contents of a KV store in new ways, then you basically have to do all the processing yourself.
There is definitely scope for the creation of a system combining the speed and simplicity of a KV store with the flexibility and power if an RDBMS.
For what it's worth, we use both systems: we use a KV as the front-end and mirror the contents into an RDBMS in the background. It's not perfect, but as long as all the writes go through to KV, it works OK.
I think the nascent focus on "real-time" is going to be the straw that breaks the camel's back because it requires highly scalable updates over complex constraints, meaning that you cannot hybridize your software stack. Worse, constraints are a spatial data type, which have always performed poorly in conventional RDBMS when treated as such. Databases that focus on real-time search have been around for decades from the major vendors under a few different names (stream databases, adaptive dataflow engines, etc), but the reason no one has heard of them is that they scale and perform badly outside of very limited domains.
The best-of-both-worlds solution is simple (in a geek-ish sense) to describe: We need a database built on a parallelizable, associative space-preserving hyper-cube access method. Only a handful of database theory people are familiar with this, and only a few academics have even heard of the only access method algorithm in literature that comes close to delivering it. Nonetheless, there are a few people working on it because the potential is mind-boggling. Real-time search, complex data fusion (think Photosynth on Internet scales), deep predictive modeling, large-scale geospatial, and sensor-based contextual targeting are all applications that depend on a solution to this problem to scale well. It would make for a frankly amazing database system.
(Disclosure: this is my area of expertise, I have on worked on this problem space for Big Companies and occasionally do technology evaluations of database startups for VCs.)
Do you remember A Wrinkle in Time? A tesseract is a hypercube of the space we are in now.
Not that I understand the math (I wish there were more teaching methods of math that were both proof/theory oriented and visually oriented at the same time.), but it sounds like you are saying you want to "fold" through the data by giving it another spacial line to expand and rotate on.
I've only played with SQL (which in its most basic form can be a Relational database system with a few minor constraints on it): It seems to exists only as interconnecting two dimensional spaces. To really have deep data as well as broad data, you would need to have three or four axes to rotate around and stack on for some sort of "keying" system, because otherwise how will you cut through the stuff you don't want.
It's like having a really big spaceship above earth that is really fast, doesn't obey Physics, and you can put it down on Earth where-ever you want. That is what you are looking for in this new kind of database? (Or at least this is my image of it...)
When I was a nipper we just stuffed n-dimensional data into an n-dimensional array, but that was when computer screens came in any color as long as it was green.
But I loved the 'really fast big spaceship' analogy (whatever it means)
+1
Have you ever read Flatland by Edwin Abbott? I'm thinking I and my data live in Flatland with and I need some spaceships from an intruder from the next dimension to get to my data. (If not- http://xahlee.org/flatland/index.html)
A spaceship that can go "through" the earth? Because Flatland only ends in the third dimension- whereas the spaceship can deal with other since it is in a mysterious "above" plane.
(and I was hoping you would say you programmed on those cards, I really could use some of those cards...)
And whilst I am curious to know how you managed to connect databases and spaceships, what I would really like is for you to pass the bong over...
As for the cards, all I can suggest is eBay (they're quite boring, when you seen one you've seen em all).
Take care now.
I'm an art student- it's for a sculpture made up of various kinds of computer data storage materials throughout the "ages." My language in the end is very visual.
Consider an "employee" table with three attributes "name", "title", "employment date". The three attributes are 'dimensions', and every record in the "employee" table can be located in the space of all possible employee records using those three attributes as coordinates. In this way, the employee table describes a 3-dimensional logical space in which all possible records will exist. A query on the employee table is searching for a region of that 3-dimensional space that contains the coordinates described in the WHERE clause of your SQL.
The complication is that in traditional databases we would use a collection of 1-dimensional indexes that are merged to simulate a true 3-dimensional index, and that merge operation does not scale. This problem becomes much worse if you have true spatial data types, like geospatial or constraints (such as what one might use for real-time search).
Ideally, you want a data structure that can directly represent several
dimensions such that any subset of the space those dimensions describe can be very efficiently accessed in a distributed system *and* the data structure can efficiently store data types that can be described as a subset of that space. It is a very tricky problem that has generated thousands of pages of research over decades. We've been using workarounds for years but the rapid growth of data sets has exposed the lack of scalability intrinsic to those workarounds, forcing us to deal with the underlying problem directly.
The issue for me with a hypercube, especially once I looked it up- is its shape and its ability to rotate. If you solve the issue of what shape (because there are many possible n) and how you want to rotate it (there are more axes with each n added) (and those rotations reveal all of the potential shortest distances within the hypercube, or your where statements) then you resolve the problem.
Being "above" the shape, or having the ability to re-look at the shape from a totally different perspective is what is giving me trouble.
Ideally, the more dimensions you have, the more possible hypecubes, and imagining yourself from above the dimension(s) allows you to "play" with each block more efficiently in your head to find out shape one needs in order to really get the shortest distance from point a to point b if one can only take the lines within the hypercube of any n dimensions.
Hence the need for a really large and fast spaceship. You need to be "above n hypercube," and then think about what happens when you add a new n about how it looks and behaves. Spaceships move and are above the earth and if the earth was my hypercube- I could see and use my spaceship to figure out where to go one I got back down onto it by looking around it.
Unless this is more lost than before? I understand the idea of cube trying to stick itself into arbitrary space as we define it- it's more where do I look at the cube that I am trying to understand, or how should I look at the cube to see as much of it as possible.
But, I also think it the focus on SQL or index structures actually misses the point slightly in the context of data storage for large web applications.
The big problem is the generals.
It's not just about moving away from SQL, it's about being forced to deal with the reality that maintaining one consistent view of data stored on multiple nodes has unacceptable performance penalties. Inktomi was the first to run into this in the setting of internet services and all large internet properties are treading onward from the same trailhead they established.
Consensus protocols have unacceptable overhead, so in practice they're generally only used for maintaining critical lookup values for some larger system that leverages weaker consistency requirements.
Scalable partial query result merging doesn't matter for real systems because Read All Write One or Read One Write All end up working better.
For example, let's look at Google's inverse word index for search queries. It's too large to fit in one host, so we must partition the state in one way or another. At first glance the best way would probably seem to allocate the inverse lists for specific words to specific hosts, via hash, directory or some other method. This way a given search only needs to query the specific servers that hold inverse word lists for terms in the search.
This isn't how google actually does it.
Instead they partition by document, and a given search is broadcast to all servers. At first this seems quite wasteful, but it's actually a better fit to current hardware constraints. Each server can apply the full query against each document, avoiding the intermediate merge problem entirely and only returning full matches.
This works better because bisection bandwidth is the least scalable resource in large systems.
This system is straightforward to scale as well. Need to store more documents? Add nodes. No need to worry about a single inverse word list growing beyond a single host and requiring more complicated behavior. Need to process more searches? Replicate the entire cluster as many times as necessary, and accept that replicas will never be exactly in sync. As long as your load balancing sends users preferentially to the same replica you'll avoid making replication lag visible for most users.
Scaling and operating RAWO systems like this is simple. I'd love to see a SQL database that used the same architecture: INSERTS to random nodes, all other statements broadcast, and each replica cluster does async or partially async multi-master replication.
ROWA systems don't scale, but for systems that can live within those limits they're a far simpler method of high availability (example: atomic broadcast like zookeeper vs implementing paxos).
And of course there are many interesting approximate quorum systems that live in the region between those extremes (such as amazon S3, which I gather is R2 W4).
It is clear that this is your area of expertise, so I had to think twice (well, in fact it was even more than that) before replying to your comments.
It is my understanding that your comment implies that there is a solution that would address problems of both worlds: relational and KV. I'm not an expert in the field, but my perception is that if this solution would exists it will be once again something that will try to fit every problem. And this is exactly what happened with RDBMS (what is now called the one size fits all RDBMS).
While I cannot formulate a general rule and I'm basing this hypothesis only on my experience, I'd say that special problems will always need a special solution. Not to mention that I still don't believe in the existence of a panacea. Starting to trust again in a single approach will just limit the solution space we are looking into.
I think that what we should be looking for is a way to address the impedance mismatch between RDBMS and KV and make the two cooperate in a much easier and standardized way, to address the impedance mismatch between current programming paradigms and storage by aligning the access layer and figuring out the 'interaction language', etc.
A radical new paradigm is definitely welcome, but I think that will take a long time to be proved theoretically and even more to be proved by real live apps.
cheers.
ps: I'd love to learn more from you about this new approach, so I'm wondering if it would be possible to move this offline somehow.
thanks for a great comment
fred
The startup to watch for next-generation database technology that addresses the issues being raised here is SpaceCurve. Their database platform is based on some very novel computer science that implements something like the exotic best-of-both-worlds access method I mention up-thread. Virtually all "new" database startups are rehashing existing computer science but that one stands out as both doing something completely different and demonstrably attacking the limitations of current technology.
As a general comment, there is a pattern I have seen before with startups that have technology based on fundamentally new theoretical computer science is very hard to sell to VCs regardless of the capability. The only people that understand what those ventures are doing are other organizations that have already invested considerable R&D in solving those same theoretical problems, which in many cases become "customers". Because there is little real technical expertise outside of that pool to evaluate the technology, most VCs do not want (or can't afford) to invest the time to develop technical expertise. There may be a demonstrable market, but not understanding what that market is really about is a major hurdle because the long-term upside is hard to gauge on that basis. I understand why this happens, but a lot of money gets left on the table and a lot of technologies are under-exploited as a result.
Anyways, what I'd be even more interested to hear (probably in a separate post) is about your involvement in MongoDB project (rationale, future plans, etc.). I've been following this market for quite a while and so far my impression is that most of these projects were created to solve very particular problems (I use as a proof the fact that until recently these guys have never talked to each other and instead of working together in the spirit of open source community each have jumped to implement its own).
thanks in advance
what more would you like to know about MongoDB?
re: MongoDB
While I'm familiar with MongoDB, I don't know much about MongoDB plans, so I'll just speculate here -- in my extremely obvious attempt :-) -- to make you say more about what convinced you to invest in MongoDB (and leaving aside the team argument).
Do you think MongoDB will possibly become a MySQL equivalent (not saying replacement or alternative though) and start using the same model? Or is it that you see opportunity in the open source backed services (f.e. cloudera, springsource, etc)?
So like most things we invest in, the opportunity changed post investment
Thanks a ton
But not if you want to get something done. Sometimes it gets a bit too addictive so I just watch the tweets & very occasionally read the articles & maybe the comments.
I too have been dabbling with key-value stores, but I am still troubled by a comment I read a while ago on Hacker News (http://news.ycombinator.com/item?id=538705):
'In every business I've ever been involved with, the data is always more valuable than the code, and it always outlives the code. Too much of the hype around these alternative database technologies are throwing the baby out with the bathwater. The idea that "most" applications don't need structured data just strikes me as incredibly naive and short-sighted. Far more applications need structured data than need to scale.'
It was OODBMS vs RDBMS in 90's and this following CNET article says it all-
http://news.cnet.com/Jasmines-time-has-come/210...
(This article dated Dec'97 says time has come. That time never came for Jasmine!)
Now it's key-value vs RDBMS and I think RDBMS will again.
I agree with the comment by a key-value developer:
"A few specialized applications can and have been built on a plain DHT, but most applications built on DHTs have ended up having to customize the DHT's internals to achieve their functional or performance goals."
http://spyced.blogspot.com/2009/05/why-you-wont...
Then the computer world article is wrong. FB uses how many thousands of MySQL? Why FB hired a MySQL expert recently? Are they going to throw out all the MySQL instances in the near future? no way...
Why Facebook/Yahoo is building SQL on top of Hadoop? Why Cassandra alias is discussing SQL like for Cassendra?
Even I myself said, Good bye SQL: (i may be also wrong; it might have come out from my past OODBMS experience)
http://uds-web.blogspot.com/2008/12/good-bye-sq...
So, the question is NOT about SQL vs NoSQL and is about what type of database technologies needed for the next gen web apps?
Thanks!!
If a DBA runs Oracle/SQL Server/DB2 and looses data, he is OK.
If a DBA runs any of these un-proven DB and looses data, he will be FIRED.
So it catch-22 for any new DB's. They need big customers to prove that they can scale and sustain the load and customers wants to see the real time deployment before the adapt it!
Here is a classic example: Facebook using Oracle: (note the date- April 09)
http://www.oracle.com/customers/snapshots/faceb...
- their needs are only typical of a handful of the largest web properties
- their use of mysql is heavily sharded, which means throwing out most of what SQL gives you
- most of their actual data access is done to memcached
This architecture resembles a key value store much more than it resembles a traditional RDBMS installation. They're essentially using MySQL as a recovery log, memcached as their hot data store, and implement query processing in the application.
source datastore