Monday, February 1, 2010

Re: remove and clean CDATA out of xml

Your last comment made me think. I would like all the html encoded
parts like É, é ’ etc... to be transformed into real
utf8 as the feed should be utf8. (É, é and ')

Any tips here?

On 01/02/2010, Christian Brabandt <cblists@256bit.org> wrote:
> On Mon, February 1, 2010 3:10 pm, bw wrote:
>> I am looking for a way to remove the CDATA and only get the text.
>> CURRENT:
>> <add>
>> <doc>
>> <some_title>My title</some_title>
>> <content><![[CDATA[
>> <p>The <strong>keyword</strong> is nice to have but is not needed to
>> include in a solr feed</p><p><table cellspacing="2" cellpadding="2"
>> border="1" width="100%"><tbody><tr><td>&#201;tape 1&nbsp;:</td></tr>
>> ]]></content>
>> </doc>
>> <doc>
>> ....
>> </doc>
>> </add>
>>
>> WANTED:
>> <add>
>> <doc>
>> <some_title>My title</some_title>
>> <content>The keyword is nice to have but is not needed to
>> include in a solr feed</content>
>> </doc>
>> <doc>
>> ....
>> </doc>
>> </add>
>>
>> any vim tricks to do this?
>
> If the start and end pattern are always in a separate line, you could
> possibly use something like this:
> :g/\V<![[CDATA[/+,/\V]]>/-s/<\_[^>]*>//g
> followed by an additional
> :%s/\V<![[CDATA[\|]]>//
> to remove the remaining <![[CDATA start and end delimiters.
>
> Alternatively, you could use something like
> :%s/\V<![[CDATA[\_.\{-}]]/\=substitute(submatch(0),
> '\(<[^>]*>\)\|\(^\V![[CDATA[\)\|\(\V]]\$\)', '', 'g')/
> (1 line, barely tested, should work in your example case).
>
> Nevertheless, both leave the &#201;tape 1&nbsp;: parts in your text. So
> you might be able to put the expression
> :s/&[^;]*;//
> into the previous expression, which would then look like this:
> %s/\V<![[CDATA[\_.\{-}]]/\=substitute(submatch(0),
> '\(<[^>]*>\)\|\(^\V![[CDATA[\)\|\(\V]]\$\)\|\m\(&[^;]*;\)', '', 'g')/
> and should work. However, I have it only barely tested.
>
> regards,
> Christian
>
> --
> You received this message from the "vim_use" maillist.
> For more information, visit http://www.vim.org/maillist.php


--
[Bb](astia{2}n)?\s?[Ww](ak{2}ie)?$

--
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php

No comments:

Post a Comment