I talked about the PostgreSQL backend for the State History some time ago. I've had a working implementation for a while, but haven't been able to test it before last week since the scalability tests were keeping my test machine busy pretty much 24/7!
The initial results were absolutely awful. Construction would take a LOT of time. There's also the problem that this was my first database-connecting application, so to be expected it wasn't very optimized at first. After a bit of fiddling, reading, and asking more knowledgeable people, we understood that with the default settings, the JDBC connector would do one database transaction for every INSERT statement. While it has nice guarantees on integrity, etc. it also seriously limits the performance!
With one transaction per insert, it was putting out around 100 insertions / second. This is relatively speaking not bad at all. But the smallest trace I have (the 10 MB VSTF trace) still produces about 3 millions state intervals, so this mean inserting the complete history in the database would take about 8 hours! For a 10 MB trace!
I did not want to stray too far from the default configuration, since it would become less and less representative (and I would risk affecting the performance negatively without knowing it). One simple change I did though was to turn off database auto-commit, and run Connection.commit() only once at the end, after all the intervals have been inserted. Of course the .commit() operation would take some time, but overall it was much faster than commiting at every insertion.
The following results show the performance of the PostGIS backend. When building the history, I turn off auto-commit, call insert statements to insert the state intervals, then at the end I do one .commit(), build an index on the geometry column and run a vacuum. For the table organization, see the previous blog post I linked earlier. The index is created with:
CREATE INDEX ON table USING GIST (interval)
where "interval" is the name of the geometry column.
Full queries are executed using the following statement:
SELECT * FROM table WHERE interval && ST_GeomFromText('POINT((t) 0)',-1)
where (t) is the target time of the full query. '&&' in PostGIS means "intersects".
Then the result is iterated over and we extract the statevalue data for all attributes.
For single queries I took the exact same statement but added:
AND attrib = (a)
where (a) is the target attribute. Then we extract and return the statevalue of this result.
For comparison purposes, I also re-ran the construction test with the in-memory R-Tree backend done previously.
(However don't directly compare the results posted back then with the ones posted here. The event handler changed since then, and we store more information now.)
One more change, you might have noticed the x-axis is now measured in millions of intervals inserted. Since we're talking especially about R-Trees or databases, this metric makes more sense than just "size of the original trace". The three measurements correspond to the 27, 79 and 121 MB traces, respectively.
As expected, construction is still relatively slow. (This is when doing all insertions + one commit at the end). This could undoubtedly be optimized further, for example by doing insertions in a separate thread, so that the "java" process doesn't have to wait on the "postgresql" one, or running periodic commits at different time intervals. However, as explained earlier, I did not want to stray too far from the default configuration. And even if the creation could be optimized a bit, the end result should still be exactly the same and it wouldn't affect query results.
Surprisingly, query times are really slow too (whereas the R-Tree queries were faster than the History Tree ones). Not only are they slow, they seem to scale linearly with the number of intervals. The indexes do seem to be used, since the single queries are expectedly faster than the full ones, but they do not seem to benefit from the spatial positioning of the intervals. It's known that R-Trees do not perform very well in only 1 dimension, but I am still surprised to see it fall back to linear scalibility. If anyone has any tip or explanation as to why PostGIS behaves this way, please let me know.
Note that I usually show the "single queries" Y-axis in microseconds, but it is shown in milliseconds here. It's also noteworthy that some single queries results seem to be almost exactly 10 times faster than the full ones. The test cases used were the same as usual (in both cases it's 200 random queries, repeated 10 times, averaged) so I frankly can't explain this other than by pure coincidence.
Finally, I looked at the disk space usage, by comparing the size of the History Tree file to the size of the Postgres directory (obtained with "du -hs /var/lib/postgresql"). Of course the relevant table was the only thing existing in the database at the time of the measurement, but I am not sure this comparison is meaningful, since there might be stuff left over in the directory like caches and indexes.
In conclusion, the PostGIS database backend is slower than the History Tree at everything we want it to do. Query times are not only slower but they do not seem to benefit from the logarithmic scaling the History Tree has (up to a certain point) either. So far, no alternative backend has been able to dethrone the History Tree at storing out state intervals, both on the high and low ends.