performance - Solr indexing of a large data set -


i have content 50 tb large. number of documents in set 250 million. daily increment not large nay 10000 documents of varying sizes totaling under 50 mb. current indexing effort taking way long , guesstimated complete in 100+ days!!!
... large of data set? me, 50 tb of content (in day , age) not large. have content of size? if do, how did improve time taken one-time indexing? also, how did improve time taken real-time indexing?
if can answer .. great. if can point me in right direct direction ... appreciate well.

thanks in advance.
rd

there number of factors consider.

  1. you can start client index. client using. solrj, or framework listens databases(like oracle or hbase) or rest api. can make difference, given solr @ handling them, client framework , data preparation @ client, needs optimized. example, if use hbase indexer(which reads hbase tables , writes solr), can expect few millions indexed in hour or so. then, should not take time complete 250 million.

  2. after client, enter solr environment. how many fields indexing in document. have stored fields or other overheads field types.

  3. config parameters autocommit based on number of records or ram size, softcommit mentioned in comment above, parallel threads index data, hardware of points cosider.

you can find comprehensive check list here , can verify each. happy designing


Comments