Monday, February 1, 2010

Re: remove and clean CDATA out of xml

On Mon, February 1, 2010 3:10 pm, bw wrote:
> I am looking for a way to remove the CDATA and only get the text.
> CURRENT:
> <add>
> <doc>
> <some_title>My title</some_title>
> <content><![[CDATA[
> <p>The <strong>keyword</strong> is nice to have but is not needed to
> include in a solr feed</p><p><table cellspacing="2" cellpadding="2"
> border="1" width="100%"><tbody><tr><td>&#201;tape 1&nbsp;:</td></tr>
> ]]></content>
> </doc>
> <doc>
> ....
> </doc>
> </add>
>
> WANTED:
> <add>
> <doc>
> <some_title>My title</some_title>
> <content>The keyword is nice to have but is not needed to
> include in a solr feed</content>
> </doc>
> <doc>
> ....
> </doc>
> </add>
>
> any vim tricks to do this?

If the start and end pattern are always in a separate line, you could
possibly use something like this:
:g/\V<![[CDATA[/+,/\V]]>/-s/<\_[^>]*>//g
followed by an additional
:%s/\V<![[CDATA[\|]]>//
to remove the remaining <![[CDATA start and end delimiters.

Alternatively, you could use something like
:%s/\V<![[CDATA[\_.\{-}]]/\=substitute(submatch(0),
'\(<[^>]*>\)\|\(^\V![[CDATA[\)\|\(\V]]\$\)', '', 'g')/
(1 line, barely tested, should work in your example case).

Nevertheless, both leave the &#201;tape 1&nbsp;: parts in your text. So
you might be able to put the expression
:s/&[^;]*;//
into the previous expression, which would then look like this:
%s/\V<![[CDATA[\_.\{-}]]/\=substitute(submatch(0),
'\(<[^>]*>\)\|\(^\V![[CDATA[\)\|\(\V]]\$\)\|\m\(&[^;]*;\)', '', 'g')/
and should work. However, I have it only barely tested.

regards,
Christian

--
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php

No comments: